OpenCloudOS-Kernel/drivers
Alexey Bogoslavsky ebd8a93aa4 nvme: extend and modify the APST configuration algorithm
The algorithm that was used until now for building the APST configuration
table has been found to produce entries with excessively long ITPT
(idle time prior to transition) for devices declaring relatively long
entry and exit latencies for non-operational power states. This leads
to unnecessary waste of power and, as a result, failure to pass
mandatory power consumption tests on Chromebook platforms.

The new algorithm is based on two predefined ITPT values and two
predefined latency tolerances. Based on these values, as well as on
exit and entry latencies reported by the device, the algorithm looks
for up to 2 suitable non-operational power states to use as primary
and secondary APST transition targets. The predefined values are
supplied to the nvme driver as module parameters:

 - apst_primary_timeout_ms (default: 100)
 - apst_secondary_timeout_ms (default: 2000)
 - apst_primary_latency_tol_us (default: 15000)
 - apst_secondary_latency_tol_us (default: 100000)

The algorithm echoes the approach used by Intel's and Microsoft's drivers
on Windows. The specific default parameter values are also based on those
drivers. Yet, this patch doesn't introduce the ability to dynamically
regenerate the APST table in the event of switching the power source from
AC to battery and back. Adding this functionality may be considered in the
future. In the meantime, the timeouts and tolerances reflect a compromise
between values used by Microsoft for AC and battery scenarios.

In most NVMe devices the new algorithm causes them to implement a more
aggressive power saving policy. While beneficial in most cases, this
sometimes comes at the price of a higher IO processing latency in certain
scenarios as well as at the price of a potential impact on the drive's
endurance (due to more frequent context saving when entering deep non-
operational states). So in order to provide a fallback for systems where
these regressions cannot be tolerated, the patch allows to revert to
the legacy behavior by setting either apst_primary_timeout_ms or
apst_primary_latency_tol_us parameter to 0. Eventually (and possibly after
fine tuning the default values of the module parameters) the legacy behavior
can be removed.

TESTING.

The new algorithm has been extensively tested. Initially, simulations were
used to compare APST tables generated by old and new algorithms for a wide
range of devices. After that, power consumption, performance and latencies
were measured under different workloads on devices from multiple vendors
(WD, Intel, Samsung, Hynix, Kioxia). Below is the description of the tests
and the findings.

General observations.
The effect the patch has on the APST table varies depending on the entry and
exit latencies advertised by the devices. For some devices, the effect is
negligible (e.g. Kioxia KBG40ZNS), for some significant, making the
transitions to PS3 and PS4 much quicker (e.g. WD SN530, Intel 760P), or making
the sleep deeper, PS4 rather than PS3 after a similar amount of time (e.g.
SK Hynix BC511). For some devices (e.g. Samsung PM991) the effect is mixed:
the initial transition happens after a longer idle time, but takes the device
to a lower power state.

Workflows.
In order to evaluate the patch's effect on the power consumption and latency,
7 workflows were used for each device. The workflows were designed to test
the scenarios where significant differences between the old and new behaviors
are most likely. Each workflow was tested twice: with the new and with the
old APST table generation implementation. Power consumption, performance and
latency were measured in the process. The following workflows were used:
1) Consecutive write at the maximum rate with IO depth of 2, with no pauses
2) Repeated pattern of 1000 consecutive writes of 4K packets followed by 50ms
   idle time
3) Repeated pattern of 1000 consecutive writes of 4K packets followed by 150ms
   idle time
4) Repeated pattern of 1000 consecutive writes of 4K packets followed by 500ms
   idle time
5) Repeated pattern of 1000 consecutive writes of 4K packets followed by 1.5s
   idle time
6) Repeated pattern of 1000 consecutive writes of 4K packets followed by 5s
   idle time
7) Repeated pattern of a single random read of a 4K packet followed by 150ms
   idle time

Power consumption
Actual power consumption measurements produced predictable results in
accordance with the APST mechanism's theory of operation.
Devices with long entry and exit latencies such as WD SN530 showed huge
improvement on scenarios 4,5 and 6 of up to 62%. Devices such as Kioxia
KBG40ZNS where the resulting APST table looks virtually identical with
both legacy and new algorithms, showed little or no change in the average power
consumption on all workflows. Devices with extra short latencies such as
Samsung PM991 showed moderate increase in power consumption of up to 18% in
worst case scenarios.
In addition, on Intel and Samsung devices a more complex impact was observed
on scenarios 3, 4 and 7. Our understanding is that due to longer stay in deep
non-operational states between the writes the devices start performing background
operations leading to an increase of power consumption. With the old APST tables
part of these operations are delayed until the scenario is over and a longer idle
period begins, but eventually this extra power is consumed anyway.

Performance.
In terms of performance measured on sustained write or read scenarios, the
effect of the patch is minimal as in this case the device doesn't enter low power
states.

Latency
As expected, in devices where the patch causes a more aggressive power saving
policy (e.g. WD SN530, Intel 760P), an increase in latency was observed in
certain scenarios. Workflow number 7, specifically designed to simulate the
worst case scenario as far as latency is concerned, indeed shows a sharp
increase in average latency (~2ms -> ~53ms on Intel 760P and 0.6 -> 10ms on
WD SN530). The latency increase on other workloads and other devices is much
milder or non-existent.

Signed-off-by: Alexey Bogoslavsky <alexey.bogoslavsky@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-03 10:29:24 +03:00
..
accessibility TTY/Serial driver updates for 5.13-rc1 2021-04-26 11:20:10 -07:00
acpi libnvdimm fixes for 5.13-rc2 2021-05-15 08:32:51 -07:00
amba
android binder: Return EFAULT if we fail BINDER_ENABLE_ONEWAY_SPAM_DETECTION 2021-05-13 20:35:26 +02:00
ata pci-v5.13-changes 2021-05-05 13:24:11 -07:00
atm atm: firestream: Use fallthrough pseudo-keyword 2021-05-07 16:01:08 -07:00
auxdisplay treewide: remove editor modelines and cruft 2021-05-07 00:26:34 -07:00
base Driver core fixes for 5.13-rc2 2021-05-16 10:13:14 -07:00
bcma bcma: remove unused function 2021-04-18 09:36:56 +03:00
block rsxx: Use struct_size() in vmalloc() 2021-05-24 06:47:30 -06:00
bluetooth Networking changes for 5.13. 2021-04-29 11:57:23 -07:00
bus ARM: SoC drivers for v5.13 2021-04-26 12:11:52 -07:00
cdrom cdrom: gdrom: initialize global variable at init time 2021-05-13 18:58:44 +02:00
char Char/misc driver fixes for 5.13-rc3 2021-05-20 06:31:52 -10:00
clk clk: Skip clk provider registration when np is NULL 2021-05-11 08:47:25 +02:00
clocksource clocksource/drivers/hyper-v: Re-enable VDSO_CLOCKMODE_HVCLOCK on X86 2021-05-14 14:55:13 +02:00
comedi staging: comedi: move out of staging directory 2021-04-15 09:26:25 +02:00
connector
counter
cpufreq Fix an idle CPU selection bug, and an AMD Ryzen maximum frequency enumeration bug. 2021-05-15 10:24:48 -07:00
cpuidle
crypto Revert "crypto: cavium/nitrox - add an error message to explain the failure of pci_request_mem_regions" 2021-05-13 17:23:05 +02:00
cxl cxl/mem: Fix memory device capacity probing 2021-04-16 18:21:56 -07:00
dax
dca
devfreq
dio
dma dmaengine: qcom_hidma: comment platform_driver_register call 2021-05-13 18:32:29 +02:00
dma-buf dma-buf: fix unintended pin/unpin warnings 2021-05-20 14:02:27 +02:00
edac x86/msr: Rename MSR_K8_SYSCFG to MSR_AMD64_SYSCFG 2021-05-10 07:51:38 +02:00
eisa
extcon - Core Frameworks 2021-04-28 15:59:13 -07:00
firewire The usual updates from the irq departement: 2021-04-26 09:43:16 -07:00
firmware ARM: SoC fixes for 5.13 2021-05-20 14:46:26 -10:00
fpga ARM: SoC drivers for v5.13 2021-04-26 12:11:52 -07:00
fsi
gnss
gpio gpio: tegra186: Don't set parent IRQ affinity 2021-05-12 13:56:43 +02:00
gpu drm fixes for 5.13-rc3 2021-05-20 20:15:43 -10:00
greybus greybus: es2: fix kernel-doc warnings 2021-04-16 07:26:50 +02:00
hid Merge branch 'for-5.13/warnings' into for-linus 2021-04-29 21:47:22 +02:00
hsi HSI: core: fix resource leaks in hsi_add_client_from_dt() 2021-04-16 00:14:49 +02:00
hv printk changes for 5.13 2021-04-27 18:09:44 -07:00
hwmon Char/misc driver fixes for 5.13-rc3 2021-05-20 06:31:52 -10:00
hwspinlock
hwtracing ARM: 2021-05-01 10:14:08 -07:00
i2c Merge branch 'i2c/for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux 2021-04-30 13:01:02 -07:00
i3c Revert "i3c master: fix missing destroy_workqueue() on error in i3c_master_register" 2021-04-24 22:21:01 +02:00
ide
idle
iio iio: tsl2583: Fix division by a zero lux_val 2021-05-10 14:01:48 +01:00
infiniband RDMA/uverbs: Fix a NULL vs IS_ERR() bug 2021-05-19 15:32:07 -03:00
input Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2021-05-06 23:37:55 -07:00
interconnect CFI on arm64 series for v5.13-rc1 2021-04-27 10:16:46 -07:00
iommu pci-v5.13-changes 2021-05-05 13:24:11 -07:00
ipack
irqchip irqchip: Remove redundant error printing 2021-05-16 13:07:18 +01:00
isdn isdn: mISDN: correctly handle ph_info allocation failure in hfcsusb_ph_info 2021-05-13 18:32:23 +02:00
leds leds: lp5523: check return value of lp5xx_read and jump to cleanup code 2021-05-13 17:30:15 +02:00
lightnvm
macintosh macintosh/via-pmu: Fix build warning 2021-04-16 23:57:51 +10:00
mailbox - qcom: enable support for SM8350 and SC7280 2021-04-28 16:10:33 -07:00
mcb
md dm integrity: fix sparse warnings 2021-05-13 14:53:49 -04:00
media media: gspca: properly check for errors in po1030_probe() 2021-05-13 18:58:32 +02:00
memory .gitignore: prefix local generated files with a slash 2021-05-02 00:43:35 +09:00
memstick memstick: r592: ignore kfifo_out() return code again 2021-04-26 11:08:23 +02:00
message
mfd - Core Frameworks 2021-04-28 15:59:13 -07:00
misc platform-drivers-x86 for v5.13-2 2021-05-20 06:40:20 -10:00
mmc mmc: sdhci-pci-gli: increase 1.8V regulator wait 2021-05-10 14:39:06 +02:00
most Staging/IIO driver updates for 5.13-rc1 2021-04-26 11:14:21 -07:00
mtd This pull request contains changes for JFFS2, UBI and UBIFS 2021-05-04 18:08:40 -07:00
mux
net brcmfmac: properly check for bus register errors 2021-05-13 18:58:42 +02:00
nfc nfc: st-nci: remove unnecessary label 2021-04-13 14:50:57 -07:00
ntb
nubus
nvdimm include: remove pagemap.h from blkdev.h 2021-05-06 19:24:11 -07:00
nvme nvme: extend and modify the APST configuration algorithm 2021-06-03 10:29:24 +03:00
nvmem
of of: overlay: Remove redundant assignment to ret 2021-05-03 13:57:56 -05:00
opp
parisc
parport treewide: remove editor modelines and cruft 2021-05-07 00:26:34 -07:00
pci more s390 updates for 5.13 merge window 2021-05-06 14:39:50 -07:00
pcmcia
perf ARM: 2021-05-01 10:14:08 -07:00
phy Networking changes for 5.13. 2021-04-29 11:57:23 -07:00
pinctrl linux/kconfig.h: replace IF_ENABLED() with PTR_IF() in <linux/kernel.h> 2021-05-09 00:29:45 +09:00
platform platform/x86: touchscreen_dmi: Add info for the Chuwi Hi10 Pro (CWI529) tablet 2021-05-20 14:11:03 +02:00
pnp
power power supply and reset changes for the v5.13 series 2021-04-28 15:43:58 -07:00
powercap
pps TTY/Serial driver updates for 5.13-rc1 2021-04-26 11:20:10 -07:00
ps3
ptp ARM: 2021-05-01 10:14:08 -07:00
pwm pwm: Changes for v5.13-rc1 2021-05-05 12:53:16 -07:00
rapidio rapidio: handle create_workqueue() failure 2021-05-13 18:32:19 +02:00
ras
regulator - Core Frameworks 2021-04-28 15:59:13 -07:00
remoteproc remoteproc updates for v5.13 2021-05-04 11:13:33 -07:00
reset pci-v5.13-changes 2021-05-05 13:24:11 -07:00
rpmsg
rtc RTC for 5.13 2021-05-03 12:15:21 -07:00
s390 block-5.13-2021-05-07 2021-05-07 11:35:12 -07:00
sbus
scsi SCSI fixes on 20210520 2021-05-20 14:41:35 -10:00
sh The usual updates from the irq departement: 2021-04-26 09:43:16 -07:00
siox
slimbus
soc IOMMU Updates for Linux v5.13 2021-05-01 09:33:00 -07:00
soundwire
spi sound updates for 5.13 2021-04-30 12:48:14 -07:00
spmi
ssb
staging staging: rtl8723bs: avoid bogus gcc warning 2021-05-10 14:26:24 +02:00
target SCSI misc on 20210508 2021-05-08 10:44:36 -07:00
tc
tee AMD-TEE reference count loaded TAs 2021-05-17 16:06:02 +02:00
thermal - Remove duplicate error message for the amlogic driver (Tang Bin) 2021-05-05 12:46:48 -07:00
thunderbolt
tty Char/misc driver fixes for 5.13-rc3 2021-05-20 06:31:52 -10:00
uio uio_hv_generic: Fix another memory leak in error handling paths 2021-05-14 13:26:04 +02:00
usb Driver core fixes for 5.13-rc2 2021-05-16 10:13:14 -07:00
vdpa vDPA/ifcvf: get_config_size should return dev specific config size 2021-05-03 04:55:54 -04:00
vfio IOMMU Updates for Linux v5.13 2021-05-01 09:33:00 -07:00
vhost virtio,vhost,vdpa: features, fixes 2021-05-05 13:31:39 -07:00
video Char/misc driver fixes for 5.13-rc3 2021-05-20 06:31:52 -10:00
virt nitro_enclaves: Fix stale file descriptors on failed usercopy 2021-04-29 19:06:49 +02:00
virtio virtio_pci_modern: correct sparse tags for notify 2021-05-04 04:19:59 -04:00
visorbus
vlynq
vme
w1
watchdog - Core Frameworks 2021-04-28 15:59:13 -07:00
xen xen-pciback: reconfigure also from backend watch handler 2021-05-21 09:55:16 +02:00
zorro
Kconfig staging: comedi: move out of staging directory 2021-04-15 09:26:25 +02:00
Makefile virtio,vhost,vdpa: features, fixes 2021-05-05 13:31:39 -07:00