Commit Graph

1849 Commits

Author SHA1 Message Date
Alexey Bogoslavsky ebd8a93aa4 nvme: extend and modify the APST configuration algorithm
The algorithm that was used until now for building the APST configuration
table has been found to produce entries with excessively long ITPT
(idle time prior to transition) for devices declaring relatively long
entry and exit latencies for non-operational power states. This leads
to unnecessary waste of power and, as a result, failure to pass
mandatory power consumption tests on Chromebook platforms.

The new algorithm is based on two predefined ITPT values and two
predefined latency tolerances. Based on these values, as well as on
exit and entry latencies reported by the device, the algorithm looks
for up to 2 suitable non-operational power states to use as primary
and secondary APST transition targets. The predefined values are
supplied to the nvme driver as module parameters:

 - apst_primary_timeout_ms (default: 100)
 - apst_secondary_timeout_ms (default: 2000)
 - apst_primary_latency_tol_us (default: 15000)
 - apst_secondary_latency_tol_us (default: 100000)

The algorithm echoes the approach used by Intel's and Microsoft's drivers
on Windows. The specific default parameter values are also based on those
drivers. Yet, this patch doesn't introduce the ability to dynamically
regenerate the APST table in the event of switching the power source from
AC to battery and back. Adding this functionality may be considered in the
future. In the meantime, the timeouts and tolerances reflect a compromise
between values used by Microsoft for AC and battery scenarios.

In most NVMe devices the new algorithm causes them to implement a more
aggressive power saving policy. While beneficial in most cases, this
sometimes comes at the price of a higher IO processing latency in certain
scenarios as well as at the price of a potential impact on the drive's
endurance (due to more frequent context saving when entering deep non-
operational states). So in order to provide a fallback for systems where
these regressions cannot be tolerated, the patch allows to revert to
the legacy behavior by setting either apst_primary_timeout_ms or
apst_primary_latency_tol_us parameter to 0. Eventually (and possibly after
fine tuning the default values of the module parameters) the legacy behavior
can be removed.

TESTING.

The new algorithm has been extensively tested. Initially, simulations were
used to compare APST tables generated by old and new algorithms for a wide
range of devices. After that, power consumption, performance and latencies
were measured under different workloads on devices from multiple vendors
(WD, Intel, Samsung, Hynix, Kioxia). Below is the description of the tests
and the findings.

General observations.
The effect the patch has on the APST table varies depending on the entry and
exit latencies advertised by the devices. For some devices, the effect is
negligible (e.g. Kioxia KBG40ZNS), for some significant, making the
transitions to PS3 and PS4 much quicker (e.g. WD SN530, Intel 760P), or making
the sleep deeper, PS4 rather than PS3 after a similar amount of time (e.g.
SK Hynix BC511). For some devices (e.g. Samsung PM991) the effect is mixed:
the initial transition happens after a longer idle time, but takes the device
to a lower power state.

Workflows.
In order to evaluate the patch's effect on the power consumption and latency,
7 workflows were used for each device. The workflows were designed to test
the scenarios where significant differences between the old and new behaviors
are most likely. Each workflow was tested twice: with the new and with the
old APST table generation implementation. Power consumption, performance and
latency were measured in the process. The following workflows were used:
1) Consecutive write at the maximum rate with IO depth of 2, with no pauses
2) Repeated pattern of 1000 consecutive writes of 4K packets followed by 50ms
   idle time
3) Repeated pattern of 1000 consecutive writes of 4K packets followed by 150ms
   idle time
4) Repeated pattern of 1000 consecutive writes of 4K packets followed by 500ms
   idle time
5) Repeated pattern of 1000 consecutive writes of 4K packets followed by 1.5s
   idle time
6) Repeated pattern of 1000 consecutive writes of 4K packets followed by 5s
   idle time
7) Repeated pattern of a single random read of a 4K packet followed by 150ms
   idle time

Power consumption
Actual power consumption measurements produced predictable results in
accordance with the APST mechanism's theory of operation.
Devices with long entry and exit latencies such as WD SN530 showed huge
improvement on scenarios 4,5 and 6 of up to 62%. Devices such as Kioxia
KBG40ZNS where the resulting APST table looks virtually identical with
both legacy and new algorithms, showed little or no change in the average power
consumption on all workflows. Devices with extra short latencies such as
Samsung PM991 showed moderate increase in power consumption of up to 18% in
worst case scenarios.
In addition, on Intel and Samsung devices a more complex impact was observed
on scenarios 3, 4 and 7. Our understanding is that due to longer stay in deep
non-operational states between the writes the devices start performing background
operations leading to an increase of power consumption. With the old APST tables
part of these operations are delayed until the scenario is over and a longer idle
period begins, but eventually this extra power is consumed anyway.

Performance.
In terms of performance measured on sustained write or read scenarios, the
effect of the patch is minimal as in this case the device doesn't enter low power
states.

Latency
As expected, in devices where the patch causes a more aggressive power saving
policy (e.g. WD SN530, Intel 760P), an increase in latency was observed in
certain scenarios. Workflow number 7, specifically designed to simulate the
worst case scenario as far as latency is concerned, indeed shows a sharp
increase in average latency (~2ms -> ~53ms on Intel 760P and 0.6 -> 10ms on
WD SN530). The latency increase on other workloads and other devices is much
milder or non-existent.

Signed-off-by: Alexey Bogoslavsky <alexey.bogoslavsky@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-03 10:29:24 +03:00
Colin Ian King 13ce7e625a nvme: remove redundant initialization of variable ret
The variable ret is being initialized with a value that is never read,
it is being updated later on. The assignment is redundant and can be
removed.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-03 10:29:24 +03:00
James Smart a7d139145a nvme-fc: clear q_live at beginning of association teardown
The __nvmf_check_ready() routine used to bounce all filesystem io if the
controller state isn't LIVE.  However, a later patch changed the logic so
that it rejection ends up being based on the Q live check.  The FC
transport has a slightly different sequence from rdma and tcp for
shutting down queues/marking them non-live.  FC marks its queue non-live
after aborting all ios and waiting for their termination, leaving a
rather large window for filesystem io to continue to hit the transport.
Unfortunately this resulted in filesystem I/O or applications seeing I/O
errors.

Change the FC transport to mark the queues non-live at the first sign of
teardown for the association (when I/O is initially terminated).

Fixes: 73a5379937 ("nvme-fabrics: allow to queue requests for live queues")
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-05-19 08:40:24 +02:00
Keith Busch a0fdd14180 nvme-tcp: rerun io_work if req_list is not empty
A possible race condition exists where the request to send data is
enqueued from nvme_tcp_handle_r2t()'s will not be observed by
nvme_tcp_send_all() if it happens to be running. The driver relies on
io_work to send the enqueued request when it is runs again, but the
concurrently running nvme_tcp_send_all() may not have released the
send_mutex at that time. If no future commands are enqueued to re-kick
the io_work, the request will timeout in the SEND_H2C state, resulting
in a timeout error like:

  nvme nvme0: queue 1: timeout request 0x3 type 6

Ensure the io_work continues to run as long as the req_list is not empty.

Fixes: db5ad6b7f8 ("nvme-tcp: try to send request in queue_rq context")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-05-19 08:33:42 +02:00
Sagi Grimberg 825619b09a nvme-tcp: fix possible use-after-completion
Commit db5ad6b7f8 ("nvme-tcp: try to send request in queue_rq
context") added a second context that may perform a network send.
This means that now RX and TX are not serialized in nvme_tcp_io_work
and can run concurrently.

While there is correct mutual exclusion in the TX path (where
the send_mutex protect the queue socket send activity) RX activity,
and more specifically request completion may run concurrently.

This means we must guarantee that any mutation of the request state
related to its lifetime, bytes sent must not be accessed when a completion
may have possibly arrived back (and processed).

The race may trigger when a request completion arrives, processed
_and_ reused as a fresh new request, exactly in the (relatively short)
window between the last data payload sent and before the request iov_iter
is advanced.

Consider the following race:
1. 16K write request is queued
2. The nvme command and the data is sent to the controller (in-capsule
   or solicited by r2t)
3. After the last payload is sent but before the req.iter is advanced,
   the controller sends back a completion.
4. The completion is processed, the request is completed, and reused
   to transfer a new request (write or read)
5. The new request is queued, and the driver reset the request parameters
   (nvme_tcp_setup_cmd_pdu).
6. Now context in (2) resumes execution and advances the req.iter

==> use-after-completion as this is already a new request.

Fix this by making sure the request is not advanced after the last
data payload send, knowing that a completion may have arrived already.

An alternative solution would have been to delay the request completion
or state change waiting for reference counting on the TX path, but besides
adding atomic operations to the hot-path, it may present challenges in
multi-stage R2T scenarios where a r2t handler needs to be deferred to
an async execution.

Reported-by: Narayan Ayalasomayajula <narayan.ayalasomayajula@wdc.com>
Tested-by: Anil Mishra <anil.mishra@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-05-19 08:33:42 +02:00
Hou Pu e181811bd0 nvmet: use new ana_log_size instead the old one
The new ana_log_size should be used instead of the old one.
Or kernel NULL pointer dereference will happen like below:

[   38.957849][   T69] BUG: kernel NULL pointer dereference, address: 000000000000003c
[   38.975550][   T69] #PF: supervisor write access in kernel mode
[   38.975955][   T69] #PF: error_code(0x0002) - not-present page
[   38.976905][   T69] PGD 0 P4D 0
[   38.979388][   T69] Oops: 0002 [#1] SMP NOPTI
[   38.980488][   T69] CPU: 0 PID: 69 Comm: kworker/0:2 Not tainted 5.12.0+ #54
[   38.981254][   T69] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[   38.982502][   T69] Workqueue: events nvme_loop_execute_work
[   38.985219][   T69] RIP: 0010:memcpy_orig+0x68/0x10f
[   38.986203][   T69] Code: 83 c2 20 eb 44 48 01 d6 48 01 d7 48 83 ea 20 0f 1f 00 48 83 ea 20 4c 8b 46 f8 4c 8b 4e f0 4c 8b 56 e8 4c 8b 5e e0 48 8d 76 e0 <4c> 89 47 f8 4c 89 4f f0 4c 89 57 e8 4c 89 5f e0 48 8d 7f e0 73 d2
[   38.987677][   T69] RSP: 0018:ffffc900001b7d48 EFLAGS: 00000287
[   38.987996][   T69] RAX: 0000000000000020 RBX: 0000000000000024 RCX: 0000000000000010
[   38.988327][   T69] RDX: ffffffffffffffe4 RSI: ffff8881084bc004 RDI: 0000000000000044
[   38.988620][   T69] RBP: 0000000000000024 R08: 0000000100000000 R09: 0000000000000000
[   38.988991][   T69] R10: 0000000100000000 R11: 0000000000000001 R12: 0000000000000024
[   38.989289][   T69] R13: ffff8881084bc000 R14: 0000000000000000 R15: 0000000000000024
[   38.989845][   T69] FS:  0000000000000000(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000
[   38.990234][   T69] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   38.990490][   T69] CR2: 000000000000003c CR3: 00000001085b2000 CR4: 00000000000006f0
[   38.991105][   T69] Call Trace:
[   38.994157][   T69]  sg_copy_buffer+0xb8/0xf0
[   38.995357][   T69]  nvmet_copy_to_sgl+0x48/0x6d
[   38.995565][   T69]  nvmet_execute_get_log_page_ana+0xd4/0x1cb
[   38.995792][   T69]  nvmet_execute_get_log_page+0xc9/0x146
[   38.995992][   T69]  nvme_loop_execute_work+0x3e/0x44
[   38.996181][   T69]  process_one_work+0x1c3/0x3c0
[   38.996393][   T69]  worker_thread+0x44/0x3d0
[   38.996600][   T69]  ? cancel_delayed_work+0x90/0x90
[   38.996804][   T69]  kthread+0xf7/0x130
[   38.996961][   T69]  ? kthread_create_worker_on_cpu+0x70/0x70
[   38.997171][   T69]  ret_from_fork+0x22/0x30
[   38.997705][   T69] Modules linked in:
[   38.998741][   T69] CR2: 000000000000003c
[   39.000104][   T69] ---[ end trace e719927b609d0fa0 ]---

Fixes: 5e1f689913 ("nvme-multipath: fix double initialization of ANA state")
Signed-off-by: Hou Pu <houpu.main@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-05-13 16:33:32 +02:00
Christoph Hellwig 5e1f689913 nvme-multipath: fix double initialization of ANA state
nvme_init_identify and thus nvme_mpath_init can be called multiple
times and thus must not overwrite potentially initialized or in-use
fields.  Split out a helper for the basic initialization when the
controller is initialized and make sure the init_identify path does
not blindly change in-use data structures.

Fixes: 0d0b660f21 ("nvme: add ANA support")
Reported-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
2021-05-11 18:30:45 +02:00
Daniel Wagner ce86dad222 nvme-multipath: reset bdev to ns head when failover
When a request finally completes in end_io() after it has failed over,
the bdev pointer can be stale and thus the system can crash. Set the
bdev back to ns head, so the request is map to an active path when
resubmitted.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-05-04 09:39:23 +02:00
Tao Chiu d4060d2be1 nvme-pci: fix controller reset hang when racing with nvme_timeout
reset_work() in nvme-pci may hang forever in the following scenario:
1) A reset caused by a command timeout occurs due to a controller being
   temporarily irresponsive.
2) nvme_reset_work() restarts admin queue at nvme_alloc_admin_tags(). At
   the same time, a user-submitted admin command is queued and waiting
   for completion. Then, reset_work() changes its state to CONNECTING,
   and submits an identify command.
3) However, the controller does still not respond to any command,
   causing a timeout being fired at the user-submitted command.
   Unfortunately, nvme_timeout() does not see the completion on cq, and
   any timeout that takes place under CONNECTING state causes a
   controller shutdown.
4) Normally, the identify command in reset_work() would be canceled with
   SC_HOST_ABORTED by nvme_dev_disable(), then reset_work can tear down
   the controller accordingly. But the controller happens to return
   online and respond the identify command before nvme_dev_disable()
   should have been reaped it off.
5) reset_work() continues to setup_io_queues() as it observes no error
   in init_identify(). However, the admin queue has already been
   quiesced in dev_disable(). Thus, any following commands would be
   blocked forever in blk_execute_rq().

This can be fixed by restricting usercmd commands when controller is not
in a LIVE state in nvme_queue_rq(), as what has been done previously in
fabrics.

```
nvme_reset_work():                     |
    nvme_alloc_admin_tags()            |
                                       | nvme_submit_user_cmd():
    nvme_init_identify():              |     ...
        __nvme_submit_sync_cmd():      |
            ...                        |     ...
---------------------------------------> nvme_timeout():
(Controller starts reponding commands) |     nvme_dev_disable(, true):
    nvme_setup_io_queues():            |
        __nvme_submit_sync_cmd():      |
            (hung in blk_execute_rq    |
             since run_hw_queue sees   |
             queue quiesced)           |

```

Signed-off-by: Tao Chiu <taochiu@synology.com>
Signed-off-by: Cody Wong <codywong@synology.com>
Reviewed-by: Leon Chien <leonchien@synology.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-05-04 09:35:50 +02:00
Tao Chiu a97157440e nvme: move the fabrics queue ready check routines to core
queue_rq() in pci only checks if the dispatched queue (nvmeq) is ready,
e.g. not being suspended. Since nvme_alloc_admin_tags() in reset flow
restarts the admin queue, users are able to submit admin commands to a
controller before reset_work() completes. Commands submitted under this
condition may interfere with commands that performs identify, IO queue
setup in reset_work(), and may result in a hang described in the
following patch.

As seen in the fabrics, user commands are prevented from being executed
under inproper controller states. We may reuse this logic to maintain a
clear admin queue during reset_work().

Signed-off-by: Tao Chiu <taochiu@synology.com>
Signed-off-by: Cody Wong <codywong@synology.com>
Reviewed-by: Leon Chien <leonchien@synology.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-05-04 09:35:49 +02:00
Kanchan Joshi 51ad06cd69 nvme: avoid memset for passthrough requests
nvme_clear_nvme_request() clears the nvme_command, which is unncessary
for passthrough requests as nvme_command is overwritten immediately.
Move clearing part from this helper to the caller, so that double memset
for passthrough requests is avoided.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-05-04 09:35:49 +02:00
Kanchan Joshi 4c74d1f803 nvme: add nvme_get_ns helper
Add a helper to avoid opencoding ns->kref increment.
Decrement is already done via nvme_put_ns helper.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-05-04 09:35:48 +02:00
Minwoo Im 48145b6256 nvme: fix controller ioctl through ns_head
In multipath case, we should consider namespace attachment with
controllers in a subsystem when we find out the live controller for the
namespace.  This patch manually reverted the commit 3557a44097
("nvme: don't bother to look up a namespace for controller ioctls") with
few more updates to nvme_ns_head_chr_ioctl which has been newly updated.

Fixes: 3557a44097 ("nvme: don't bother to look up a namespace for
controller ioctls")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-05-04 09:35:47 +02:00
Linus Torvalds fc05860628 for-5.13/drivers-2021-04-27
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCIJYcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpieWD/92qbtWl/z+9oCY212xV+YMoMqj/vGROX+U
 9i/FQJ3AIC/AUoNjZeW3NIbiaNqde5mrLlUSCHgn6RLsHK7p0GQJ4ohpbIGFG5+i
 2+Efm+vjlCxLVGrkeZEwMtsht7w/NbOYDr1Rgv9b4lQ6iWI11Mg8E337Whl1me1k
 h6bEXaioK9yqxYtsLgcn9I1qQ2p7gok0HX7zFU/XxEUZylqH6E4vQhj2+NL8UUqE
 7siFHADZE99Z7LXtOkl8YyOlGU52RCUzqDHWydvkipKjgYBi95HLXGT64Z+WCEvz
 HI54oVDRWr+uWdqDFfy+ncHm8pNeP0GV9JPhDz4ELRTSndoxB2il7wRLvp6wxV9d
 8Y4j7vb30i+8GGbM0c79dnlG76D9r5ivbTKixcXFKB128NusQR6JymIv1pKlSKhk
 H871/iOarrepAAUwVR5CtldDDJCy/q1Hks+7UXbaM3F9iNitxsJNZryQq9xdTu/N
 ThFOTz+VECG4RJLxIwmsWGiLgwr52/ybAl2MBcn+s7uC4jM/TFKpdQBfQnOAiINb
 MLlfuYRRSMg1Osb2fYZneR2ifmSNOMRdDJb+tsZGz4xWmZcj0uL4QgqcsOvuiOEQ
 veF/Ky50qw57hWtiEhvqa7/WIxzNF3G3wejqqA8hpT9Qifu0QawYTnXGUttYNBB1
 mO9R3/ccaw==
 =c0x4
 -----END PGP SIGNATURE-----

Merge tag 'for-5.13/drivers-2021-04-27' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:

 - MD changes via Song:
        - raid5 POWER fix
        - raid1 failure fix
        - UAF fix for md cluster
        - mddev_find_or_alloc() clean up
        - Fix NULL pointer deref with external bitmap
        - Performance improvement for raid10 discard requests
        - Fix missing information of /proc/mdstat

 - rsxx const qualifier removal (Arnd)

 - Expose allocated brd pages (Calvin)

 - rnbd via Gioh Kim:
        - Change maintainer
        - Change domain address of maintainers' email
        - Add polling IO mode and document update
        - Fix memory leak and some bug detected by static code analysis
          tools
        - Code refactoring

 - Series of floppy cleanups/fixes (Denis)

 - s390 dasd fixes (Julian)

 - kerneldoc fixes (Lee)

 - null_blk double free (Lv)

 - null_blk virtual boundary addition (Max)

 - Remove xsysace driver (Michal)

 - umem driver removal (Davidlohr)

 - ataflop fixes (Dan)

 - Revalidate disk removal (Christoph)

 - Bounce buffer cleanups (Christoph)

 - Mark lightnvm as deprecated (Christoph)

 - mtip32xx init cleanups (Shixin)

 - Various fixes (Tian, Gustavo, Coly, Yang, Zhang, Zhiqiang)

* tag 'for-5.13/drivers-2021-04-27' of git://git.kernel.dk/linux-block: (143 commits)
  async_xor: increase src_offs when dropping destination page
  drivers/block/null_blk/main: Fix a double free in null_init.
  md/raid1: properly indicate failure when ending a failed write request
  md-cluster: fix use-after-free issue when removing rdev
  nvme: introduce generic per-namespace chardev
  nvme: cleanup nvme_configure_apst
  nvme: do not try to reconfigure APST when the controller is not live
  nvme: add 'kato' sysfs attribute
  nvme: sanitize KATO setting
  nvmet: avoid queuing keep-alive timer if it is disabled
  brd: expose number of allocated pages in debugfs
  ataflop: fix off by one in ataflop_probe()
  ataflop: potential out of bounds in do_format()
  drbd: Fix fall-through warnings for Clang
  block/rnbd: Use strscpy instead of strlcpy
  block/rnbd-clt-sysfs: Remove copy buffer overlap in rnbd_clt_get_path_name
  block/rnbd-clt: Remove max_segment_size
  block/rnbd-clt: Generate kobject_uevent when the rnbd device state changes
  block/rnbd-srv: Remove unused arguments of rnbd_srv_rdma_ev
  Documentation/ABI/rnbd-clt: Add description for nr_poll_queues
  ...
2021-04-28 14:39:37 -07:00
Linus Torvalds 6c00292113 for-5.13/block-2021-04-27
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCIJW0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpr8sD/4qP+MsFTB1IFUu8fW7BjBPdduoK8Vq9o3S
 HB8iF/yhJZ73nLecMMdn/jTO8SCW0Iw+okywW3BugGnNPbwXo0UQ4jLhzbTts76P
 JvZaguZFhBsF3ceFOt3CRCQDOeoDfMp3sitLUVivkN+2vwMs9vJpVNaEeUjcCC1Z
 8QjlpqYSMuakTwEn7QhlnKxVWn1V2B6PDjZMcf48ONRZGsCkoOXH1SE4Ge8nxjqa
 KHKO5bvwgRzGhKpvdHEIl8dmFL9WEWElBVoY3vE2EHL0SPE32zHlxtYLS0NAhY2M
 aprkJ0QP0Rgl8HpYiCstwAnJGKDg4a0ArWhf/CJTuLAWmTNFR7v5n7vw2SilJHTG
 0FtiFiOnpvvBmUC0B1PUEQX8AiFcdXueLb6xboExcp2WtxIAe8wPoGFl6T1tobBY
 qsfWggGs/vD1RVrJISPC+20cJemcRyeakMV48w+n3Lt/ES3IEv/LXx6PO/PbXvOo
 B7HJXTofkoaX52A/1+NxraGapwzhYouhi6Sb6Fc++X59/a/oBuOUGuur0eZ+/oWA
 9787mUUDmW/sahfZUgZh5AxqKo2jJULjeggANCICW9/RN6duV8TBQVOLW1/0Wddp
 9lndiA9ZMveWF+J19+sjBoiYMYawLmURaOlDK77ctTCcR/ji3l4GZ+2KvBEMeIT8
 O1OYEnwaIQ==
 =oza6
 -----END PGP SIGNATURE-----

Merge tag 'for-5.13/block-2021-04-27' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Pretty quiet round this time, which is nice. In detail:

   - Series revamping bounce buffer support (Christoph)

   - Dead code removal (Christoph, Bart)

   - Partition iteration revamp, now using xarray (Christoph)

   - Passthrough request scheduler improvements (Lin)

   - Series of BFQ improvements (Paolo)

   - Fix ioprio task iteration (Peter)

   - Various little tweaks and fixes (Tejun, Saravanan, Bhaskar, Max,
     Nikolay)"

* tag 'for-5.13/block-2021-04-27' of git://git.kernel.dk/linux-block: (41 commits)
  blk-iocost: don't ignore vrate_min on QD contention
  blk-mq: Fix spurious debugfs directory creation during initialization
  bfq/mq-deadline: remove redundant check for passthrough request
  blk-mq: bypass IO scheduler's limit_depth for passthrough request
  block: Remove an obsolete comment from sg_io()
  block: move bio_list_copy_data to pktcdvd
  block: remove zero_fill_bio_iter
  block: add queue_to_disk() to get gendisk from request_queue
  block: remove an incorrect check from blk_rq_append_bio
  block: initialize ret in bdev_disk_changed
  block: Fix sys_ioprio_set(.which=IOPRIO_WHO_PGRP) task iteration
  block: remove disk_part_iter
  block: simplify diskstats_show
  block: simplify show_partition
  block: simplify printk_all_partitions
  block: simplify partition_overlaps
  block: simplify partition removal
  block: take bd_mutex around delete_partitions in del_gendisk
  block: refactor blk_drop_partitions
  block: move more syncing and invalidation to delete_partition
  ...
2021-04-28 14:27:12 -07:00
Minwoo Im 2637baed78 nvme: introduce generic per-namespace chardev
Userspace has not been allowed to I/O to device that's failed to
be initialized.  This patch introduces generic per-namespace character
device to allow userspace to I/O regardless the block device is there or
not.

The chardev naming convention will similar to the existing blkdev naming,
using a ng prefix instead of nvme, i.e.

	- /dev/ngXnY

It also supports multipath which means it will not expose chardev for the
hidden namespace blkdevs (e.g., nvmeXcYnZ).  If /dev/ngXnY is created for
a ns_head, then I/O request will be routed to a specific controller
selected by the iopolicy of the subsystem.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Tested-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-04-22 07:25:17 +02:00
Christoph Hellwig 60df5de9b0 nvme: cleanup nvme_configure_apst
Remove a level of indentation from the main code implementating the table
search by using a goto for the APST not supported case.  Also move the
main comment above the function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>
2021-04-21 19:13:16 +02:00
Christoph Hellwig 53fe2a30bc nvme: do not try to reconfigure APST when the controller is not live
Do not call nvme_configure_apst when the controller is not live, given
that nvme_configure_apst will fail due the lack of an admin queue when
the controller is being torn down and nvme_set_latency_tolerance is
called from dev_pm_qos_hide_latency_tolerance.

Fixes: 510a405d945b("nvme: fix memory leak for power latency tolerance")
Reported-by: Peng Liu <liupeng17@lenovo.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2021-04-21 19:13:16 +02:00
Hannes Reinecke 74c22990f0 nvme: add 'kato' sysfs attribute
Add a 'kato' controller sysfs attribute to display the current
keep-alive timeout value (if any). This allows userspace to identify
persistent discovery controllers, as these will have a non-zero
KATO value.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-04-21 19:13:15 +02:00
Hannes Reinecke a70b81bd4d nvme: sanitize KATO setting
According to the NVMe base spec the KATO commands should be sent
at half of the KATO interval, to properly account for round-trip
times.
As we now will only ever send one KATO command per connection we
can easily use the recommended values.
This also fixes a potential issue where the request timeout for
the KATO command does not match the value in the connect command,
which might be causing spurious connection drops from the target.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-04-21 19:13:15 +02:00
Gopal Tiwari d6609084b0 nvme: fix NULL derefence in nvme_ctrl_fast_io_fail_tmo_show/store
Adding entry for dev_attr_fast_io_fail_tmo to avoid the kernel crash
while reading and writing the fast_io_fail_tmo.

Fixes: 09fbed6363 (nvme: export fast_io_fail_tmo to sysfs)
Signed-off-by: Gopal Tiwari <gtiwari@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-04-15 08:12:56 +02:00
Christoph Hellwig a9e0e6bc72 nvme: let namespace probing continue for unsupported features
Instead of failing to scan the namespace entirely when unsupported
features are detected, just mark the gendisk hidden but allow other
access like the upcoming per-namespace character device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Javier González <javier.gonz@samsung.com>
2021-04-15 08:12:56 +02:00
Christoph Hellwig f5b9a51db2 nvme: factor out nvme_ns_open and nvme_ns_release helpers
These will be reused for the per-namespace character devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2021-04-15 08:12:55 +02:00
Christoph Hellwig 1496bd4936 nvme: move nvme_ns_head_ops to multipath.c
Move the multipath block_device_operations to multipath.c, where they
belong.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2021-04-15 08:12:55 +02:00
Christoph Hellwig 871ca3ef13 nvme: factor out a nvme_tryget_ns_head helper
Add a helper to avoid opencoding ns_head->ref manipulations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2021-04-15 08:12:55 +02:00
Christoph Hellwig 2405252a68 nvme: move the ioctl code to a separate file
Split out the ioctl code from core.c into a new file.  Also update
copyrights while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2021-04-15 08:12:55 +02:00
Christoph Hellwig 3557a44097 nvme: don't bother to look up a namespace for controller ioctls
Don't bother to look up a namespace just to drop if after retreiving the
controller for the multipath case.  Just look up a live controller for
the subsystem directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Javier González <javier.gonz@samsung.com>
2021-04-15 08:12:55 +02:00
Christoph Hellwig 2f907f7f96 nvme: simplify block device ioctl handling for the !multipath case
Only use the existing ioctl handler for the multipath case, and add a
simpler one that reverts to the pre-multipath case for not shared
use case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Javier González <javier.gonz@samsung.com>
2021-04-15 08:12:55 +02:00
Christoph Hellwig 89b3d6e605 nvme: simplify the compat ioctl handling
Don't bother defining a separate compat_ioctl handler, and just handle
the NVME_IOCTL_SUBMIT_IO32 case inline.  Also only defined it for those
ABIs (currently just i386 vs x86_64) that are affected.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Javier González <javier.gonz@samsung.com>
2021-04-15 08:12:55 +02:00
Christoph Hellwig a5d737f100 nvme: factor out a nvme_ns_ioctl helper
Factor out a helper for the namespace based ioctls.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2021-04-15 08:12:54 +02:00
Christoph Hellwig d7790d3739 nvme: pass a user pointer to nvme_nvm_ioctl
Pass the proper user pointer instead of the not all that useful integer
representation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Javier González <javier.gonz@samsung.com>
2021-04-15 08:12:54 +02:00
Christoph Hellwig 9953ab0c5a nvme: cleanup setting the disk name
Return false from nvme_set_disk_name and let the caller set the
non-multipath name instead of duplicating the naming information in two
places.  Also remove the pointless local variables for the disk name
and flags and the not needed ctrl argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Javier González <javier.gonz@samsung.com>
2021-04-15 08:12:54 +02:00
Minwoo Im 3089738868 nvme: add a nvme_ns_head_multipath helper
Move the multipath gendisk out of #ifdef CONFIG_NVME_MULTIPATH and add
a new nvme_ns_head_multipath that uses it to check if a ns_head has
a multipath device associated with it.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
[hch: added the IS_ENABLED, converted a few existing users]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2021-04-15 08:12:54 +02:00
Niklas Cassel 95d54bd1a4 nvme: remove single trailing whitespace
There is a single trailing whitespace in core.c.
Since this is just a single whitespace, the chances of this affecting
backports to stable should be quite low, so let's just remove it.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-04-15 08:12:54 +02:00
Niklas Cassel e234f1f8bb nvme-multipath: remove single trailing whitespace
There is a single trailing whitespace in multipath.c.
Since this is just a single whitespace, the chances of this affecting
backports to stable should be quite low, so let's just remove it.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-04-15 08:12:54 +02:00
Niklas Cassel 53dc180e7c nvme-pci: remove single trailing whitespace
There is a single trailing whitespace in pci.c.
Since this is just a single whitespace, the chances of this affecting
backports to stable should be quite low, so let's just remove it.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-04-15 08:12:54 +02:00
Niklas Cassel e51183be1f nvme-pci: don't simple map sgl when sgls are disabled
According to the module parameter description for sgl_threshold,
a value of 0 means that SGLs are disabled.

If SGLs are disabled, we should respect that, even for the case
where the request is made up of a single physical segment.

Fixes: 297910571f ("nvme-pci: optimize mapping single segment requests using SGLs")
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-04-15 08:12:53 +02:00
Chaitanya Kulkarni 327e1d2957 lightnvm: use kobj_to_dev()
This fixs coccicheck warning:

drivers/nvme//host/lightnvm.c:1243:60-61: WARNING opportunity for
kobj_to_dev()

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com>
Link: https://lore.kernel.org/r/20210413105257.159260-2-matias.bjorling@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-13 09:16:12 -06:00
Sami Tolvanen 4f0f586bf0 treewide: Change list_sort to use const pointers
list_sort() internally casts the comparison function passed to it
to a different type with constant struct list_head pointers, and
uses this pointer to call the functions, which trips indirect call
Control-Flow Integrity (CFI) checking.

Instead of removing the consts, this change defines the
list_cmp_func_t type and changes the comparison function types of
all list_sort() callers to use const pointers, thus avoiding type
mismatches.

Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com
2021-04-08 16:04:22 -07:00
Christoph Hellwig 393bb12e00 block: stop calling blk_queue_bounce for passthrough requests
Instead of overloading the passthrough fast path with the deprecated
block layer bounce buffering let the users that combine an old
undermaintained driver with a highmem system pay the price by always
falling back to copies in that case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210331073001.46776-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-06 09:28:18 -06:00
Bart Van Assche 8609c63fce nvme: fix handling of large MDTS values
Instead of triggering an integer overflow and undefined behavior if MDTS is
large, set max_hw_sectors to UINT_MAX.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
[hch: rebased to account for the new nvme_mps_to_sectors helper]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-04-06 08:34:39 +02:00
Keith Busch 5befc7c26e nvme: implement non-mdts command limits
Commands that access LBA contents without a data transfer between the
host historically have not had a spec defined upper limit. The driver
set the queue constraints for such commands to the max data transfer
size just to be safe, but this artificial constraint frequently limits
devices below their capabilities.

The NVMe Workgroup ratified TP4040 defines how a controller may
advertise their non-MDTS limits. Use these if provided and default to
the current constraints if not. Since the Dataset Management command
limits are defined in logical blocks, but without a namespace to tell us
the logical block size, the code defaults to the safe 512b size.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-04-06 08:34:39 +02:00
Niklas Cassel c881a23fb6 nvme: disallow passthru cmd from targeting a nsid != nsid of the block dev
When a passthru command targets a specific namespace, the ns parameter to
nvme_user_cmd()/nvme_user_cmd64() is set. However, there is currently no
validation that the nsid specified in the passthru command targets the
namespace/nsid represented by the block device that the ioctl was
performed on.

Add a check that validates that the nsid in the passthru command matches
that of the supplied namespace.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-04-06 08:34:38 +02:00
Hannes Reinecke dd8f7fa908 nvme: retrigger ANA log update if group descriptor isn't found
If ANA is enabled but no ANA group descriptor is found when creating
a new namespace the ANA log is most likely out of date, so trigger
a re-read. The namespace will be tagged with the NS_ANA_PENDING flag
to exclude it from path selection until the ANA log has been re-read.

Fixes: 32acab3181 ("nvme: implement multipath access to nvme subsystems")
Reported-by: Martin George <marting@netapp.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-04-06 08:34:38 +02:00
Daniel Wagner 09fbed6363 nvme: export fast_io_fail_tmo to sysfs
Commit 8c4dfea97f ("nvme-fabrics: reject I/O to offline device")
introduced fast_io_fail_tmo but didn't export the value to sysfs. The
value can be set during the 'nvme connect'. Export the timeout value
to user space via sysfs to allow runtime configuration.

Cc: Victor Gladkov <Victor.Gladkov@kioxia.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Himanshu Madhani <himanshu.madhaani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-04-02 18:48:29 +02:00
Daniel Wagner 25a64e4e7e nvme: remove superfluous else in nvme_ctrl_loss_tmo_store
If there is an error we will leave the function early. So there
is no need for an else. Remove it.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-04-02 18:48:29 +02:00
Daniel Wagner bff4bcf3cf nvme: use sysfs_emit instead of sprintf
sysfs_emit is the recommended API to use for formatting strings to be
returned to user space. It is equivalent to scnprintf and aware of the
PAGE_SIZE buffer size.

Suggested-by: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-04-02 18:48:29 +02:00
Max Gurtovoy 8df1bff57c nvme-fc: check sgl supported by target
SGLs support is mandatory for NVMe/FC, make sure that the target is
aligned to the specification.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-04-02 18:48:29 +02:00
Max Gurtovoy 73ffcefcfc nvme-tcp: check sgl supported by target
SGLs support is mandatory for NVMe/tcp, make sure that the target is
aligned to the specification.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-04-02 18:48:28 +02:00
Sagi Grimberg 8b73b45d54 nvme-tcp: block BH in sk state_change sk callback
The TCP stack can run from process context for a long time
so we should disable BH here.

Fixes: 3f2304f8c6 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-04-02 18:48:28 +02:00