Commit Graph

506 Commits

Author SHA1 Message Date
Chaitanya Kulkarni b13c6393be nvme-pci: use max of PRP or SGL for iod size
>From the initial implementation of NVMe SGL kernel support
commit a7a7cbe353 ("nvme-pci: add SGL support") with addition of the
commit 943e942e62 ("nvme-pci: limit max IO size and segments to avoid
high order allocations") now there is only caller left for
nvme_pci_iod_alloc_size() which statically passes true for last
parameter that calculates allocation size based on SGL since we need
size of biggest command supported for mempool allocation.

This patch modifies the helper functions nvme_pci_iod_alloc_size() such
that it is now uses maximum of PRP and SGL size for iod allocation size
calculation.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29 07:45:19 +02:00
Chaitanya Kulkarni 6c3c05b087 nvme-core: replace ctrl page size with a macro
Saving the nvme controller's page size was from a time when the driver
tried to use different sized pages, but this value is always set to
a constant, and has been this way for some time. Remove the 'page_size'
field and replace its usage with the constant value.

This also lets the compiler make some micro-optimizations in the io
path, and that's always a good thing.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29 07:45:18 +02:00
Baolin Wang 359c1f88ab nvme-pci: use standard block status symbolic names
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-08 16:16:22 +02:00
Baolin Wang 9056fc9fc5 nvme-pci: use the consistent return type of nvme_pci_iod_alloc_size()
The nvme_pci_iod_alloc_size() should return 'size_t' type to be
consistent with the sizeof return value.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-08 16:16:21 +02:00
Baolin Wang 4e523547e2 nvme-pci: add a blank line after declarations
Add a blank line after declarations to make code more readable.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-08 16:16:21 +02:00
Baolin Wang ee0d96d322 nvme-pci: fix some comments issues
Fix comment typos and remove whitespaces before tabs to cleanup
checkpatch errors.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-08 16:16:21 +02:00
Baolin Wang c25c853ef6 nvme-pci: remove redundant segment validation
We've validated the segment counts before calling nvme_map_data(),
so there is no need to validate again in nvme_pci_use_sgls(, which is
only called from nvme_map_data().

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-08 16:16:21 +02:00
David Fugate 972b13e29d nvme: document quirked Intel models
Documented model names of Intel SSDs requiring quirks.

Signed-off-by: David Fugate <david.fugate@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-08 16:16:20 +02:00
Dongli Zhang ad50999643 nvme-pci: remove the empty line at the beginning of nvme_should_reset()
Just cleanup by removing the empty line.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-08 16:16:16 +02:00
Chaitanya Kulkarni 9dc54a0d15 nvme-pci: code cleanup for nvme_alloc_host_mem()
Although use of for loop is preferred it is not a common practice to
have 80 char long for loop initialization and comparison section.

Use temp variables for calculating values and replace them in the
for loop with size of all variables to set to u64 since preferred
variable is declared as u64.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-08 16:16:16 +02:00
Chaitanya Kulkarni 61f3b89630 nvme-pci: use unsigned for io queue depth
The NVMe PCIe declares module parameter io_queue_depth as int. Change
this to u16 as queue depth can never be negative. Now to reflect this
update module parameter getter function from param_get_int() ->
param_get_uint() and respective setter function with type of n changed
from int to u16 with param_set_int() to param_set_ushort(). Finally
update struct nvme_dev q_depth member to u16 and use u16 in min_t()
when calculating dev->q_depth in the nvme_pci_enable() (since q_depth is
now u16) and use unsigned int instead of int when calculating
dev->tagset.queue_depth as target variable tagset->queue_depth is of type
unsigned int in nvme_dev_add().

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-08 16:16:16 +02:00
Jens Axboe 482c6b614a Linux 5.8-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl8CYDYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGcQkH/2vOsPf79yWtsc7x
 hd2LpCPfrm7T1xlQcYcXbEbyRI8sqPmguixO8pRI1ePl2lBZ7KurfyeYgYZNGpFU
 t74Ph6A6dSWoCgO68Genm/SQuK8ic6o9n1Vr8tDsGDp5KlHWNaweq4JwHrsPmO1T
 cI0PR/ClAhLG8cQZ4x988Es5HTNGY17XK27e+M/zKYxSMGY2NRdJBGQIq964i5Q8
 2d9G0rtVCaVDzgjrLwaFm6RBu21Il7HV6KsBsacyTFiL1ywx2vnUHzeZQyvuJSOQ
 4YpLo9v4tBP10WHC50LRStZyO0qRwPVd/Yl7fL4R/CKsJT9H4uiwasVoEBVSL/k6
 CUn3JL0=
 =P/Vx
 -----END PGP SIGNATURE-----

Merge tag 'v5.8-rc4' into for-5.9/drivers

Merge in 5.8-rc4 for-5.9/block to setup for-5.9/drivers, to provide
a clean base and making the life for the NVMe changes easier.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

* tag 'v5.8-rc4': (732 commits)
  Linux 5.8-rc4
  x86/ldt: use "pr_info_once()" instead of open-coding it badly
  MIPS: Do not use smp_processor_id() in preemptible code
  MIPS: Add missing EHB in mtc0 -> mfc0 sequence for DSPen
  .gitignore: Do not track `defconfig` from `make savedefconfig`
  io_uring: fix regression with always ignoring signals in io_cqring_wait()
  x86/ldt: Disable 16-bit segments on Xen PV
  x86/entry/32: Fix #MC and #DB wiring on x86_32
  x86/entry/xen: Route #DB correctly on Xen PV
  x86/entry, selftests: Further improve user entry sanity checks
  x86/entry/compat: Clear RAX high bits on Xen PV SYSENTER
  i2c: mlxcpld: check correct size of maximum RECV_LEN packet
  i2c: add Kconfig help text for slave mode
  i2c: slave-eeprom: update documentation
  i2c: eg20t: Load module automatically if ID matches
  i2c: designware: platdrv: Set class based on DMI
  i2c: algo-pca: Add 0x78 as SCL stuck low status for PCA9665
  mm/page_alloc: fix documentation error
  vmalloc: fix the owner argument for the new __vmalloc_node_range callers
  mm/cma.c: use exact_nid true to fix possible per-numa cma leak
  ...
2020-07-08 08:02:13 -06:00
Max Gurtovoy d4ec47f120 nvme-pci: initialize tagset numa value to the value of the ctrl
Both admin's and drive's tagsets should be set according the numa node
of the controller.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-06-24 18:37:08 +02:00
Max Gurtovoy 635333e400 nvme-pci: override the value of the controller's numa node
Set the node value according to the PCI device numa node.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-06-24 18:37:08 +02:00
Christoph Hellwig ff02945149 nvme: use blk_mq_complete_request_remote to avoid an indirect function call
Use the new blk_mq_complete_request_remote helper to avoid an indirect
function call in the completion fast path.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:15:57 -06:00
Christoph Hellwig b97120b15e nvme-pci: use simple suspend when a HMB is enabled
While the NVMe specification allows the device to access the host memory
buffer in host DRAM from all power states, hosts will fail access to
DRAM during S3 and similar power states.

Fixes: d916b1be94 ("nvme-pci: use host managed power state for suspend")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-11 09:10:06 -06:00
Linus Torvalds bce159d734 for-5.8/drivers-2020-06-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7VPc4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgQkEACnQlzWOfNQMz1AzgUAv/S8IYDJCLrkbjLZ
 JK4pJv8Hjhss/7sS+fd8kyKe9VtaZz2IjmrXcC66RMMwtpx4iHnkRffoNAgEdGOl
 /M5TCZGhs+F/mp3Lc0WdR5DFHkM6yy2Tkk9wCFLreB4bW67janAWnd7nbU4INqJj
 +WqIgpzNMc/kfUhpBYTeQLORhL4e2TG9ADTi/zeUITlpnEsA65LOgXKEpeIFYnSX
 KTl4GIZ9tjazG3Y1Eva7DYHDIErNNAtX67KBqf+WBgMV98eB0O6xIPN1WlmhDTqj
 FGMLkb8msH1HHntvxDAuc4/ortnUy8vPI4o6zKP89HJJNjIM5p5eHEuVF5JnBw42
 Rtu9Om6JqWx51nhAhJNBj9bUStYbhEl0vVQCwbkfPbDJhzTy3RR8z709q9+ZwOrL
 xbp4aJBzqrzscjBEiSQbNCf2PyuOAdU0r1x81UN81ZN41d5qUcumcinjw4Y7vru8
 z5zMlo1Iy/AWQYyu7jgHmnpI7ZyA/1Qclo5dV7aa72bLFaJa35e7QxgfQOFBA5dY
 UZl6QPJRlnB80uGRzD5jCh2O2sQ3XZqYnpaKsUAka1GgbceCp9IC4A5mfZvpACsh
 Xk8VXjlhvY/iPJsKLqrh4Oedg4Dj5M3PLL9C3MDfYeIP2qgXpbnk87UV1TPNSpY0
 QcTxsXXXIw==
 =H+/Z
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "On top of the core changes, here are the block driver changes for this
  merge window:

   - NVMe changes:
        - NVMe over Fibre Channel protocol updates, which also reach
          over to drivers/scsi/lpfc (James Smart)
        - namespace revalidation support on the target (Anthony
          Iliopoulos)
        - gcc zero length array fix (Arnd Bergmann)
        - nvmet cleanups (Chaitanya Kulkarni)
        - misc cleanups and fixes (me, Keith Busch, Sagi Grimberg)
        - use a SRQ per completion vector (Max Gurtovoy)
        - fix handling of runtime changes to the queue count (Weiping
          Zhang)
        - t10 protection information support for nvme-rdma and
          nvmet-rdma (Israel Rukshin and Max Gurtovoy)
        - target side AEN improvements (Chaitanya Kulkarni)
        - various fixes and minor improvements all over, icluding the
          nvme part of the lpfc driver"

   - Floppy code cleanup series (Willy, Denis)

   - Floppy contention fix (Jiri)

   - Loop CONFIGURE support (Martijn)

   - bcache fixes/improvements (Coly, Joe, Colin)

   - q->queuedata cleanups (Christoph)

   - Get rid of ioctl_by_bdev (Christoph, Stefan)

   - md/raid5 allocation fixes (Coly)

   - zero length array fixes (Gustavo)

   - swim3 task state fix (Xu)"

* tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block: (166 commits)
  bcache: configure the asynchronous registertion to be experimental
  bcache: asynchronous devices registration
  bcache: fix refcount underflow in bcache_device_free()
  bcache: Convert pr_<level> uses to a more typical style
  bcache: remove redundant variables i and n
  lpfc: Fix return value in __lpfc_nvme_ls_abort
  lpfc: fix axchg pointer reference after free and double frees
  lpfc: Fix pointer checks and comments in LS receive refactoring
  nvme: set dma alignment to qword
  nvmet: cleanups the loop in nvmet_async_events_process
  nvmet: fix memory leak when removing namespaces and controllers concurrently
  nvmet-rdma: add metadata/T10-PI support
  nvmet: add metadata support for block devices
  nvmet: add metadata/T10-PI support
  nvme: add Metadata Capabilities enumerations
  nvmet: rename nvmet_check_data_len to nvmet_check_transfer_len
  nvmet: rename nvmet_rw_len to nvmet_rw_data_len
  nvmet: add metadata characteristics for a namespace
  nvme-rdma: add metadata/T10-PI support
  nvme-rdma: introduce nvme_rdma_sgl structure
  ...
2020-06-02 15:37:03 -07:00
Dongli Zhang 9210c075ce nvme-pci: avoid race between nvme_reap_pending_cqes() and nvme_poll()
There may be a race between nvme_reap_pending_cqes() and nvme_poll(), e.g.,
when doing live reset while polling the nvme device.

      CPU X                        CPU Y
                               nvme_poll()
nvme_dev_disable()
-> nvme_stop_queues()
-> nvme_suspend_io_queues()
-> nvme_suspend_queue()
                               -> spin_lock(&nvmeq->cq_poll_lock);
-> nvme_reap_pending_cqes()
   -> nvme_process_cq()        -> nvme_process_cq()

In the above scenario, the nvme_process_cq() for the same queue may be
running on both CPU X and CPU Y concurrently.

It is much more easier to reproduce the issue when CONFIG_PREEMPT is
enabled in kernel. When CONFIG_PREEMPT is disabled, it would take longer
time for nvme_stop_queues()-->blk_mq_quiesce_queue() to wait for grace
period.

This patch protects nvme_process_cq() with nvmeq->cq_poll_lock in
nvme_reap_pending_cqes().

Fixes: fa46c6fb5d ("nvme/pci: move cqe check after device shutdown")
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-05-27 20:32:56 +02:00
Max Gurtovoy 9509335039 nvme: introduce max_integrity_segments ctrl attribute
This patch doesn't change any logic, and is needed as a preparation
for adding PI support for fabrics drivers that will use an extended
LBA format for metadata and will support more than 1 integrity segment.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-05-27 07:12:39 +02:00
Weiping Zhang 9c9e76d579 nvme-pci: make sure write/poll_queues less or equal then cpu count
Check module parameter write/poll_queues before using it to catch
too large values.

Reproducer:

modprobe -r nvme
modprobe nvme write_queues=`nproc`
echo $((`nproc`+1)) > /sys/module/nvme/parameters/write_queues
echo 1 > /sys/block/nvme0n1/device/reset_controller

[  657.069000] ------------[ cut here ]------------
[  657.069022] WARNING: CPU: 10 PID: 1163 at kernel/irq/affinity.c:390 irq_create_affinity_masks+0x47c/0x4a0
[  657.069056]  dm_region_hash dm_log dm_mod
[  657.069059] CPU: 10 PID: 1163 Comm: kworker/u193:9 Kdump: loaded Tainted: G        W         5.6.0+ #8
[  657.069060] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
[  657.069064] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[  657.069066] RIP: 0010:irq_create_affinity_masks+0x47c/0x4a0
[  657.069067] Code: fe ff ff 48 c7 c0 b0 89 14 95 48 89 46 20 e9 e9 fb ff ff 31 c0 e9 90 fc ff ff 0f 0b 48 c7 44 24 08 00 00 00 00 e9 e9 fc ff ff <0f> 0b e9 87 fe ff ff 48 8b 7c 24 28 e8 33 a0 80 00 e9 b6 fc ff ff
[  657.069068] RSP: 0018:ffffb505ce1ffc78 EFLAGS: 00010202
[  657.069069] RAX: 0000000000000060 RBX: ffff9b97921fe5c0 RCX: 0000000000000000
[  657.069069] RDX: ffff9b67bad80000 RSI: 00000000ffffffa0 RDI: 0000000000000000
[  657.069070] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff9b97921fe718
[  657.069070] R10: ffff9b97921fe710 R11: 0000000000000001 R12: 0000000000000064
[  657.069070] R13: 0000000000000060 R14: 0000000000000000 R15: 0000000000000001
[  657.069071] FS:  0000000000000000(0000) GS:ffff9b67c0880000(0000) knlGS:0000000000000000
[  657.069072] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  657.069072] CR2: 0000559eac6fc238 CR3: 000000057860a002 CR4: 00000000007606e0
[  657.069073] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  657.069073] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  657.069073] PKRU: 55555554
[  657.069074] Call Trace:
[  657.069080]  __pci_enable_msix_range+0x233/0x5a0
[  657.069085]  ? kernfs_put+0xec/0x190
[  657.069086]  pci_alloc_irq_vectors_affinity+0xbb/0x130
[  657.069089]  nvme_reset_work+0x6e6/0xeab [nvme]
[  657.069093]  ? __switch_to_asm+0x34/0x70
[  657.069094]  ? __switch_to_asm+0x40/0x70
[  657.069095]  ? nvme_irq_check+0x30/0x30 [nvme]
[  657.069098]  process_one_work+0x1a7/0x370
[  657.069101]  worker_thread+0x1c9/0x380
[  657.069102]  ? max_active_store+0x80/0x80
[  657.069103]  kthread+0x112/0x130
[  657.069104]  ? __kthread_parkme+0x70/0x70
[  657.069105]  ret_from_fork+0x35/0x40
[  657.069106] ---[ end trace f4f06b7d24513d06 ]---
[  657.077110] nvme nvme0: 95/1/0 default/read/poll queues

Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-05-27 07:12:37 +02:00
Keith Busch b69e2ef24b nvme-pci: dma read memory barrier for completions
Control dependencies do not guarantee load order across the condition,
allowing a CPU to predict and speculate memory reads.

Commit 324b494c28 inlined verifying a new completion with its
handling. At least one architecture was observed to access the contents
out of order, resulting in the driver using stale data for the
completion.

Add a dma read barrier before reading the completion queue entry and
after the condition its contents depend on to ensure the read order is
determinsitic.

Reported-by: John Garry <john.garry@huawei.com>
Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Tested-by: John Garry <john.garry@huawei.com>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-05-12 18:02:24 +02:00
Weiping Zhang 2a5bcfdd41 nvme-pci: align io queue count with allocted nvme_queue in nvme_probe
Since commit 147b27e4bd ("nvme-pci: allocate device queues storage
space at probe"), nvme_alloc_queue does not alloc the nvme queues
itself anymore.

If the write/poll_queues module parameters are changed at runtime to
values larger than the number of allocated queues in nvme_probe,
nvme_alloc_queue will access unallocated memory.

Add a new nr_allocated_queues member to struct nvme_dev to record how
many queues were alloctated in nvme_probe to avoid using more than the
allocated queues after a reset following a change to the
write/poll_queues module parameters.

Also add nr_write_queues and nr_poll_queues members to allow refreshing
the number of write and poll queues based on a change to the module
parameters when resetting the controller.

Fixes: 147b27e4bd ("nvme-pci: allocate device queues storage space at probe")
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
[hch: add nvme_max_io_queues, update the commit message]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:36 -06:00
Keith Busch 54b2fcee1d nvme-pci: remove last_sq_tail
The nvme driver does not have enough tags to wrap the queue, and blk-mq
will no longer call commit_rqs() when there are no new submissions to
notify.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:36 -06:00
Keith Busch 74943d45ee nvme-pci: remove volatile cqes
The completion queue entry is not volatile once the phase is confirmed.
Remove the volatile keywords and check the phase using the appropriate
READ_ONCE() accessor, allowing the compiler to optimize the remaining
completion path.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:36 -06:00
Alexey Dobriyan a8de663916 nvme-pci: fix "slimmer CQ head update"
Pre-incrementing ->cq_head can't be done in memory because OOB value
can be observed by another context.

This devalues space savings compared to original code :-\

	$ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux
	add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-32 (-32)
	Function                                     old     new   delta
	nvme_poll_irqdisable                         464     456      -8
	nvme_poll                                    455     447      -8
	nvme_irq                                     388     380      -8
	nvme_dev_disable                             955     947      -8

But the code is minimal now: one read for head, one read for q_depth,
one increment, one comparison, single instruction phase bit update and
one write for new head.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reported-by: John Garry <john.garry@huawei.com>
Tested-by: John Garry <john.garry@huawei.com>
Fixes: e2a366a4b0 ("nvme-pci: slimmer CQ head update")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:07:58 -06:00
Israel Rukshin 726612b6b8 nvme: Make nvme_uninit_ctrl symmetric to nvme_init_ctrl
Put the ctrl reference count at nvme_uninit_ctrl as opposed to
nvme_init_ctrl which takes it. This decrease the reference count at the
core layer instead of decreasing it on each transport separately.
Also move the call of nvme_uninit_ctrl at PCI driver after calling to
nvme_release_prp_pools and nvme_dev_unmap, in order to put the reference
count after using the dev. This is safe because those functions use
nvme_dev which is freed only later at nvme_pci_free_ctrl.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:51:56 +09:00
Israel Rukshin b780d7415a nvme: Fix ctrl use-after-free during sysfs deletion
In case nvme_sysfs_delete() is called by the user before taking the ctrl
reference count, the ctrl may be freed during the creation and cause the
bug. Take the reference as soon as the controller is externally visible,
which is done by cdev_device_add() in nvme_init_ctrl(). Also take the
reference count at the core layer instead of taking it on each transport
separately.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:51:56 +09:00
Israel Rukshin 253fd4ac80 nvme-pci: Re-order nvme_pci_free_ctrl
Destroy the resources in the same order like in nvme_probe error flow to
improve code readability.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:51:55 +09:00
Max Gurtovoy 2db24e4a22 nvme-pci: properly print controller address
Align PCI address print with fabrics address that is printed with
newline character.

Before:
[root@server40 linux]# cat /sys/class/nvme/nvme2/address
0000:0b:00.0[root@server40 linux]#

After:
[root@server40 linux]# cat /sys/class/nvme/nvme2/address
0000:0b:00.0
[root@server40 linux]#

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
2020-03-26 04:51:54 +09:00
Keith Busch fa059b856a nvme-pci: Simplify nvme_poll_irqdisable
The timeout handler can use the existing nvme_poll() if it needs to
check a polled queue, allowing nvme_poll_irqdisable() to handle only
irq driven queues for the remaining callers.

Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:48:06 +09:00
Keith Busch 324b494c28 nvme-pci: Remove two-pass completions
Completion handling had been done in two steps: find all new completions
under a lock, then handle those completions outside the lock. This was
done to make the locked section as short as possible so that other
threads using the same lock wait less time.

The driver no longer shares locks during completion, and is in fact
lockless for interrupt driven queues, so the optimization no longer
serves its original purpose. Replace the two-pass completion queue
handler with a single pass that completes entries immediately.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:48:06 +09:00
Keith Busch bf392a5dc0 nvme-pci: Remove tag from process cq
The only user for tagged completion was for timeout handling. That user,
though, really only cares if the timed out command is completed, which
we can safely check within the timeout handler.

Remove the tag check to simplify completion handling.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:48:06 +09:00
Alexey Dobriyan e2a366a4b0 nvme-pci: slimmer CQ head update
Update CQ head with pre-increment operator. This saves subtraction of 1
and a few registers.

Also update phase with "^= 1". This generates only one RMW instruction.

	ffffffff815ba150 <nvme_update_cq_head>:
	ffffffff815ba150:       0f b7 47 70             movzx  eax,WORD PTR [rdi+0x70]
	ffffffff815ba154:       83 c0 01                add    eax,0x1
	ffffffff815ba157:       66 89 47 70             mov    WORD PTR [rdi+0x70],ax
	ffffffff815ba15b:       66 3b 47 68             cmp    ax,WORD PTR [rdi+0x68]
	ffffffff815ba15f:       74 01                   je     ffffffff815ba162 <nvme_update_cq_head+0x12>
	ffffffff815ba161:       c3                      ret
	ffffffff815ba162:       31 c0                   xor    eax,eax
	ffffffff815ba164:       80 77 74 01      ===>   xor    BYTE PTR [rdi+0x74],0x1
	ffffffff815ba168:       66 89 47 70             mov    WORD PTR [rdi+0x70],ax
	ffffffff815ba16c:       c3                      ret

	add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-119 (-119)
	Function                                     old     new   delta
	nvme_poll                                    690     678     -12
	nvme_dev_disable                            1230    1177     -53
	nvme_irq                                     613     559     -54

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2020-03-26 04:48:06 +09:00
Bijan Mottahedeh 9515743bfb nvme-pci: Hold cq_poll_lock while completing CQEs
Completions need to consumed in the same order the controller submitted
them, otherwise future completion entries may overwrite ones we haven't
handled yet. Hold the nvme queue's poll lock while completing new CQEs to
prevent another thread from freeing command tags for reuse out-of-order.

Fixes: dabcefab45 ("nvme: provide optimized poll function for separate poll queues")
Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-02-28 01:32:14 +09:00
Andy Shevchenko 98f7b86a0b nvme-pci: Use single IRQ vector for old Apple models
People reported that old Apple machines are not working properly
if the non-first IRQ vector is in use.

Set quirk for that models to limit IRQ to use first vector only.

Based on original patch by GitHub user npx001.

Link: https://github.com/Dunedan/mbp-2016-linux/issues/9
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Leif Liddy <leif.liddy@gmail.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-02-20 00:30:58 +09:00
Shyjumon N 1fae37accf nvme/pci: Add sleep quirk for Samsung and Toshiba drives
The Samsung SSD SM981/PM981 and Toshiba SSD KBG40ZNT256G on the Lenovo
C640 platform experience runtime resume issues when the SSDs are kept in
sleep/suspend mode for long time.

This patch applies the 'Simple Suspend' quirk to these configurations.
With this patch, the issue had not been observed in a 1+ day test.

Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shyjumon N <shyjumon.n@intel.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-02-20 00:29:39 +09:00
Keith Busch fa46c6fb5d nvme/pci: move cqe check after device shutdown
Many users have reported nvme triggered irq_startup() warnings during
shutdown. The driver uses the nvme queue's irq to synchronize scanning
for completions, and enabling an interrupt affined to only offline CPUs
triggers the alarming warning.

Move the final CQE check to after disabling the device and all
registered interrupts have been torn down so that we do not have any
IRQ to synchronize.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206509
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-14 10:12:04 -07:00
Christoph Hellwig cfa27356f8 nvme-pci: remove nvmeq->tags
There is no real need to have a pointer to the tagset in
struct nvme_queue, as we only need it in a single place, and that place
can derive the used tagset from the device and qid trivially.  This
fixes a problem with stale pointer exposure when tagsets are reset,
and also shrinks the nvme_queue structure.  It also matches what most
other transports have done since day 1.

Reported-by: Edmund Nadolski <edmund.nadolski@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-02-04 03:00:25 +09:00
Keith Busch 7e4c6b9a5d nvme/pci: Fix read queue count
If nvme.write_queues equals the number of CPUs, the driver had decreased
the number of interrupts available such that there could only be one read
queue even if the controller could support more. Remove the interrupt
count reduction in this case. The driver wouldn't request more IRQs than
it wants queues anyway.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-12-07 02:52:47 +09:00
Keith Busch 17c3316734 nvme/pci Limit write queue sizes to possible cpus
The driver can never use more queues of any type than the number of
possible CPUs, so a higher value causes the driver to allocate more
memory for IO queues than it could ever use. Limit the parameter at
module load time to the number of possible cpus.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-12-07 02:52:42 +09:00
Keith Busch 3f68baf706 nvme/pci: Fix write and poll queue types
The number of poll or write queues should never be negative. Use unsigned
types so that it's not possible to break have the driver not allocate
any queues.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-12-07 02:52:24 +09:00
Keith Busch f6c4d97b0d nvme/pci: Remove last_cq_head
We had been saving the last_cq_head seen from an interrupt so that a
polled queue wouldn't mistakenly trigger spruious interrupt detection. We
don't poll interrupt driven queues any more, so saving this value is
pointless.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-12-04 00:38:06 +09:00
Edmund Nadolski c80b36cd95 nvme: else following return is not needed
Remove unnecessary keyword in nvme_create_queue().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Edmund Nadolski <edmund.nadolski@intel.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-11-27 02:48:33 +09:00
Linus Torvalds 323264eefb for-5.5/drivers-post-20191122
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3X/7gQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgponxD/9mb/H9LD6/flEqoPv7n1dv7Y8Oe+AKpGLb
 a2Jh8ycpU6WtzZdlMYbQkxAqgJCupLTlih3WY3NuI1fwSsxwyMziQEnVnKgPNf7s
 PLWt+Qo5ryooyVkPi4KCHKjx2CFDUL4B1BqtSLm9n3eN72FRa9HCsCiMugjCbV+K
 aF7snzZ0ss+m7SnKIpdtXJjcdIFC2hXwCAGWAOv1vOPwhqQZFBsxjHKnEJtumTp2
 +wzLBPItLBbzHtAyopbiNlfsuHL9CF9L1QFaCTZE7N6eVWnlYpIhMCMrfEJ/e3jK
 27ct3PWAa4Qr2S/85//AMOyL9mxK96g9FjuGjxnKQZ+qWh89RPLq+oGs1zWSv2O0
 08BynWvxYHkoOR2+baPx89SHrlwyN0HCsvFBxVKrMsVHpy81sIYLFlytYaQUMx2/
 RkgkIxiAo5R5vYHPZYX8cU4c1rASjG0tYD9OA6e78MJFIBfkTl70XHHVX2Tgf5by
 fuwon/g+iVvN94QHb81ulZcA0hXRz8jM2RNIAPhqfoJXX6wNzD5MooNxTs9m88IP
 6/HaM1l6AJUtOMNA1aZbKAq+ARYuIA0/qHoSS0UVHoG8D4YkYaFpvYsAaA+TBzeO
 J8IcwSu6eVR1NvgJ9b4cwXFWSf75a/o4UeDP/1fYcQU5Gn/KQEdQVFSMvND1nnHW
 hUTP5AKFcQ==
 =oyE3
 -----END PGP SIGNATURE-----

Merge tag 'for-5.5/drivers-post-20191122' of git://git.kernel.dk/linux-block

Pull additional block driver updates from Jens Axboe:
 "Here's another block driver update, done to avoid conflicts with the
  zoned changes coming next.

  This contains:

   - Prepare SCSI sd for zone open/close/finish support

   - Small NVMe pull request
        - hwmon support (Akinobu)
        - add new co-maintainer (Christoph)
        - work-around for a discard issue on non-conformant drives
          (Eduard)

   - Small nbd leak fix"

* tag 'for-5.5/drivers-post-20191122' of git://git.kernel.dk/linux-block:
  nbd: prevent memory leak
  nvme: hwmon: add quirk to avoid changing temperature threshold
  nvme: hwmon: provide temperature min and max values for each sensor
  nvmet: add another maintainer
  nvme: Discard workaround for non-conformant devices
  nvme: Add hardware monitoring support
  scsi: sd_zbc: add zone open, close, and finish support
2019-11-25 11:18:03 -08:00
Linus Torvalds 2d53943090 for-5.5/drivers-20191121
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3WyEAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplgbD/4jNeqT0q2IkNcUUEWkZWsBOlfi0SiclS5v
 X8JY1IxTlL0kaBWm83mw06JewucQ97Fh7xblPE8/iDHJqpgEX4vvSQY1b8hcDulZ
 YOKUnLkFU22nICeT04/8x/+f8gqD5KOlGxkgEvUKViQW15oc0oNu4St/yFM1QEN0
 qNMzpcfFXV9lYsOPl0y3pKdP+qbfcpeSmaFD9Z65gxN6rJy1WR8rtUGXy2luoiEc
 dh15IL9AGN/r8VTo8yRpD9PStiuJqpALIR8OHJSHPj+s0pQ6twk4aehcnYseAMbH
 zSDpa9AJrfqlnh8tUfKYLWi/PM7pMH0F01rAiQv47j/C0+QhbiOU/uTFTzUW5hQ1
 eK6XzJ0slxwnDsHLKf+xJmCj0Oyk0jDimNQr/2MNsuhmr29V5lfvBNflub8eOLyZ
 ie2Eulv+z6pYBSJx6kqm0X3vhXOy4wgU+X8LzvfcP9iAjgU1rfzxUWxLEj+KfJS2
 Nl+ERV9nafoPpoKpNR7zWRBUulp1qZJzo/U9JaUKiI5cWkIH1hhHmU2++xMeyJpb
 XHoDFNTGv6z/eef65eSveFD7F274TSi16K56Obk+4KWaSrIR0d6VwUA7FDmJbSI+
 Jqk1OFdaRGsQ5OcVxF1Qo4WChn0FvhcD0c+yL0N19WZ01QeYsb3hlA+MUPDtGQ04
 U79MPfu7iA==
 =i0jf
 -----END PGP SIGNATURE-----

Merge tag 'for-5.5/drivers-20191121' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "Here are the main block driver updates for 5.5. Nothing major in here,
  mostly just fixes. This contains:

   - a set of bcache changes via Coly

   - MD changes from Song

   - loop unmap write-zeroes fix (Darrick)

   - spelling fixes (Geert)

   - zoned additions cleanups to null_blk/dm (Ajay)

   - allow null_blk online submit queue changes (Bart)

   - NVMe changes via Keith, nothing major here either"

* tag 'for-5.5/drivers-20191121' of git://git.kernel.dk/linux-block: (56 commits)
  Revert "bcache: fix fifo index swapping condition in journal_pin_cmp()"
  drivers/md/raid5-ppl.c: use the new spelling of RWH_WRITE_LIFE_NOT_SET
  drivers/md/raid5.c: use the new spelling of RWH_WRITE_LIFE_NOT_SET
  bcache: don't export symbols
  bcache: remove the extra cflags for request.o
  bcache: at least try to shrink 1 node in bch_mca_scan()
  bcache: add idle_max_writeback_rate sysfs interface
  bcache: add code comments in bch_btree_leaf_dirty()
  bcache: fix deadlock in bcache_allocator
  bcache: add code comment bch_keylist_pop() and bch_keylist_pop_front()
  bcache: deleted code comments for dead code in bch_data_insert_keys()
  bcache: add more accurate error messages in read_super()
  bcache: fix static checker warning in bcache_device_free()
  bcache: fix a lost wake-up problem caused by mca_cannibalize_lock
  bcache: fix fifo index swapping condition in journal_pin_cmp()
  md/raid10: prevent access of uninitialized resync_pages offset
  md: avoid invalid memory access for array sb->dev_roles
  md/raid1: avoid soft lockup under high load
  null_blk: add zone open, close, and finish support
  dm: add zone open, close and finish support
  ...
2019-11-25 11:15:41 -08:00
Akinobu Mita 6c6aa2f26c nvme: hwmon: add quirk to avoid changing temperature threshold
This adds a new quirk NVME_QUIRK_NO_TEMP_THRESH_CHANGE to avoid changing
the value of the temperature threshold feature for specific devices that
show undesirable behavior.

Guenter reported:

"On my Intel NVME drive (SSDPEKKW512G7), writing any minimum limit on the
Composite temperature sensor results in a temperature warning, and that
warning is sticky until I reset the controller.

It doesn't seem to matter which temperature I write; writing -273000 has
the same result."

The Intel NVMe has the latest firmware version installed, so this isn't
a problem that was ever fixed.

Reported-by: Guenter Roeck <linux@roeck-us.net>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Jean Delvare <jdelvare@suse.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-11-22 02:21:08 +09:00
Geert Uytterhoeven 05d3046ff7 nvme-pci: Spelling s/resdicovered/rediscovered/
Fix misspelling of "rediscovered".

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-04 10:56:42 -07:00
Max Gurtovoy 16686f3a6c nvme: move common call to nvme_cleanup_cmd to core layer
nvme_cleanup_cmd should be called for each call to nvme_setup_cmd
(symmetrical functions). Move the call for nvme_cleanup_cmd to the common
core layer and call it during nvme_complete_rq for the good flow. For
error flow, each transport will call nvme_cleanup_cmd independently. Also
take care of a special case of path failure, where we call
nvme_complete_rq without doing nvme_setup_cmd.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-04 10:56:41 -07:00
Israel Rukshin 58a8df67e0 nvme: introduce nvme_is_aen_req function
This function improves code readability and reduces code duplication.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-04 10:56:40 -07:00
Kevin Hao a4f40484e7 nvme-pci: Set the prp2 correctly when using more than 4k page
In the current code, the nvme is using a fixed 4k PRP entry size,
but if the kernel use a page size which is more than 4k, we should
consider the situation that the bv_offset may be larger than the
dev->ctrl.page_size. Otherwise we may miss setting the prp2 and then
cause the command can't be executed correctly.

Fixes: dff824b2aa ("nvme-pci: optimize mapping of small single segment requests")
Cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-10-18 23:09:41 +09:00