OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Peter Zijlstra	73b083a196	stop_machine: Add stop_core_cpuslocked() for per-core operations commit `2760f5a415` upstream. Hardware core level testing features require near simultaneous execution of WRMSR instructions on all threads of a core to initiate a test. Provide a customized cut down version of stop_machine_cpuslocked() that just operates on the threads of a single core. Intel-SIG: commit `2760f5a415` stop_machine: Add stop_core_cpuslocked() for per-core operations Backport Intel In Field Scan(IFS) single-blob image support. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220506225410.1652287-4-tony.luck@intel.com Signed-off-by: Hans de Goede <hdegoede@redhat.com> [ Aichun Shi: amend commit log ] Signed-off-by: Aichun Shi <aichun.shi@intel.com>	2024-06-11 21:24:32 +08:00
Jianping Liu	22599f9f53	irqlatency: add irq latency monitor support Long irq latency will affect other latency, such as answer a net packet. Add a ko to debug long irq latency, account the delay and show the stack. Signed-off-by: Jianping Liu <frankjpliu@tencent.com> (cherry picked from commit `9056a7a86e`) Signed-off-by: Alex Shi <alexsshi@tencent.com>	2024-06-11 21:18:47 +08:00
Kan Liang	bc2637e898	perf/core: Add PERF_SAMPLE_WEIGHT_STRUCT commit `2a6c6b7d7a` upstream. Current PERF_SAMPLE_WEIGHT sample type is very useful to expresses the cost of an action represented by the sample. This allows the profiler to scale the samples to be more informative to the programmer. It could also help to locate a hotspot, e.g., when profiling by memory latencies, the expensive load appear higher up in the histograms. But current PERF_SAMPLE_WEIGHT sample type is solely determined by one factor. This could be a problem, if users want two or more factors to contribute to the weight. For example, Golden Cove core PMU can provide both the instruction latency and the cache Latency information as factors for the memory profiling. For current X86 platforms, although meminfo::latency is defined as a u64, only the lower 32 bits include the valid data in practice (No memory access could last than 4G cycles). The higher 32 bits can be used to store new factors. Add a new sample type, PERF_SAMPLE_WEIGHT_STRUCT, to indicate the new sample weight structure. It shares the same space as the PERF_SAMPLE_WEIGHT sample type. Users can apply either the PERF_SAMPLE_WEIGHT sample type or the PERF_SAMPLE_WEIGHT_STRUCT sample type to retrieve the sample weight, but they cannot apply both sample types simultaneously. Currently, only X86 and PowerPC use the PERF_SAMPLE_WEIGHT sample type. - For PowerPC, there is nothing changed for the PERF_SAMPLE_WEIGHT sample type. There is no effect for the new PERF_SAMPLE_WEIGHT_STRUCT sample type. PowerPC can re-struct the weight field similarly later. - For X86, the same value will be dumped for the PERF_SAMPLE_WEIGHT sample type or the PERF_SAMPLE_WEIGHT_STRUCT sample type for now. The following patches will apply the new factors for the PERF_SAMPLE_WEIGHT_STRUCT sample type. The field in the union perf_sample_weight should be shared among different architectures. A generic name is required, but it's hard to abstract a name that applies to all architectures. For example, on X86, the fields are to store all kinds of latency. While on PowerPC, it stores MMCRA[TECX/TECM], which should not be latency. So a general name prefix 'var$NUM' is used here. Intel-SIG: commit `2a6c6b7d7a` perf/core: Add PERF_SAMPLE_WEIGHT_STRUCT Backport for Sapphire Rapids core PMU support. Note: This backported patch has some deviations from upstream version. To avoid enum hole in perf_event_sample_format struct, we added PERF_SAMPLE_{AUX,CGROUP,DATA_PAGE_SIZE,CODE_PAGE_SIZE} to file include/uapi/linux/perf_event.h, but didn't backport the full patchsets that introducing these enumeration values. To avoid mishandling of these sampling formats, we added check to perf_copy_attr() in kernel/events/core.c, to make sure -EINVAL always being returned for these lack-of-kernel-support sampling formats. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1611873611-156687-2-git-send-email-kan.liang@linux.intel.com [ Yunying Sun: amend commit log ] Signed-off-by: Yunying Sun <yunying.sun@intel.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:18:41 +08:00
Petr Mladek	9a4ad8a725	printk: Prepare for nested printk_nmi_enter() commit `2ade0d6093` upstream. There is plenty of space in the printk_context variable. Reserve one byte there for the NMI context to be on the safe side. It should never overflow. The BUG_ON(in_nmi() == NMI_MASK) in nmi_enter() will trigger much earlier. Intel-SIG: commit `2ade0d6093` printk: Prepare for nested printk_nmi_enter(). Backport to kernel 5.4 to enhance MCA-R. Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134100.681374113@linutronix.de [ Youquan Song: amend commit log ] Signed-off-by: Youquan Song <youquan.song@intel.com>	2024-06-11 21:18:17 +08:00
Will Deacon	6effe01d41	locking/refcount: Remove unused 'refcount_error_report()' function commit `7221762c48` upstream. 'refcount_error_report()' has no callers. Remove it. Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Tested-by: Hanjun Guo <guohanjun@huawei.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Elena Reshetova <elena.reshetova@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191121115902.2551-10-will@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:18:08 +08:00
Yang Weijiang	c9ad1a605c	drivers/idxd: Fixup errors reported during DSA device migration commit opencloudos. When there's no MSI generated during LM, device dev_msi_list is actually empty. Add empty check before process the list to prevent returning an invalid entry pointer, therefore avoiding parsing invalid memory and eventually eliminating induced error messages. Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:18:02 +08:00
Chen Zhuo	cebecd3add	clocksource: Reduce the default clocksource_watchdog() retries to 2 commit `1a5620671a` upstream. With the previous patch, there is an extra watchdog read in each retry. Now the total number of clocksource reads is increased to 4 per iteration. In order to avoid increasing the clock skew check overhead, the default maximum number of retries is reduced from 3 to 2 to maintain the same 12 clocksource reads in the worst case. Suggested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:17:48 +08:00
Waiman Long	365291c655	clocksource: Avoid accidental unstable marking of clocksources commit `c86ff8c55b` upstream. Since commit `db3a34e174` ("clocksource: Retry clock read if long delays detected") and commit `2e27e793e2` ("clocksource: Reduce clocksource-skew threshold"), it is found that tsc clocksource fallback to hpet can sometimes happen on both Intel and AMD systems especially when they are running stressful benchmarking workloads. Of the 23 systems tested with a v5.14 kernel, 10 of them have switched to hpet clock source during the test run. The result of falling back to hpet is a drastic reduction of performance when running benchmarks. For example, the fio performance tests can drop up to 70% whereas the iperf3 performance can drop up to 80%. 4 hpet fallbacks happened during bootup. They were: [ 8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable [ 12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable [ 17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable [ 17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable Other fallbacks happen when the systems were running stressful benchmarks. For example: [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable Commit `2e27e793e2` ("clocksource: Reduce clocksource-skew threshold"), changed the skew margin from 100us to 50us. I think this is too small and can easily be exceeded when running some stressful workloads on a thermally stressed system. So it is switched back to 100us. Even a maximum skew margin of 100us may be too small in for some systems when booting up especially if those systems are under thermal stress. To eliminate the case that the large skew is due to the system being too busy slowing down the reading of both the watchdog and the clocksource, an extra consecutive read of watchdog clock is being done to check this. The consecutive watchdog read delay is compared against WATCHDOG_MAX_SKEW/2. If the delay exceeds the limit, we assume that the system is just too busy. A warning will be printed to the console and the clock skew check is skipped for this round. Fixes: `db3a34e174` ("clocksource: Retry clock read if long delays detected") Fixes: `2e27e793e2` ("clocksource: Reduce clocksource-skew threshold") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:17:48 +08:00
Paul E. McKenney	4dfde4e428	clocksource: Retry clock read if long delays detected commit `db3a34e174` upstream. When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays that happen to occur between the reads of the two clocks. Yes, interrupts are disabled across those two reads, but there are no shortage of things that can delay interrupts-disabled regions of code ranging from SMI handlers to vCPU preemption. It would be good to have some indication as to why the clock was marked unstable. Therefore, re-read the watchdog clock on either side of the read from the clock under test. If the watchdog clock shows an excessive time delta between its pair of reads, the reads are retried. The maximum number of retries is specified by a new kernel boot parameter clocksource.max_cswd_read_retries, which defaults to three, that is, up to four reads, one initial and up to three retries. If more than one retry was required, a message is printed on the console (the occasional single retry is expected behavior, especially in guest OSes). If the maximum number of retries is exceeded, the clock under test will be marked unstable. However, the probability of this happening due to various sorts of delays is quite small. In addition, the reason (clock-read delays) for the unstable marking will be apparent. Reported-by: Chris Mason <clm@fb.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Feng Tang <feng.tang@intel.com> Link: https://lore.kernel.org/r/20210527190124.440372-1-paulmck@kernel.org Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:17:48 +08:00
Jacob Pan	77859b04f8	ioasid: Reject null notifier call registration commit 888f44bd1b53913f7d809369489fd7daae5ae23c Intel-BKC. Booting multiple VMs causes Invalid notifier called! call trace. Error message below observed during boot: [ 638.774647] Invalid notifier called! [ 638.774669] WARNING: CPU: 319 PID: 46882 at kernel/notifier.c:78 notifier_call_chain+0x79/0xb0 [ 638.870074] RIP: 0010:notifier_call_chain+0x79/0xb0 [ 638.972742] atomic_notifier_call_chain+0x17/0x30 [ 638.989967] ioasid_notify+0x77/0xe0 [ 638.989987] ioasid_alloc+0x19e/0x230 [ 639.003162] ioasid_fops_unl_ioctl+0xb4/0x1b0 This is due to early registration of NULL callback function pointer. [ 613.023815] ioasid_add_pending_nb: nh ff37c3c3913de380 b call ffffffffc115a770, nr_ioasids 0 [ 613.023883] ioasid_add_pending_nb: nh ff37c3c3913de380 b call 0, It is unclear who registered the invalid callback, this patch fixes the issue by rejecting the registration. Dump stack should show the culprit. It is also possible that nb got cleared somehow outside ioasid core. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:17:21 +08:00
Zhuo Chen	4daace26fb	msi: dynamic dev_msi commit 86d8cd35c5641a6807506674f72030c4d3ceaa8e Intel-BKC. Enable a single device to hold both device msi and msi/x interrupts. Signed-off-by: Megha Dey <megha.dey@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:16:19 +08:00
Zenghui Yu	f2a3fa6707	genirq/msi: Initialize msi_alloc_info before calling msi_domain_prepare_irqs() commit `06fde695ee` upstream. Since commit `5fe71d271d` ("irqchip/gic-v3-its: Tag ITS device as shared if allocating for a proxy device"), some of the devices are wrongly marked as "shared" by the ITS driver on systems equipped with the ITS(es). The problem is that the @info->flags may not be initialized anywhere and we end up looking at random bits on the stack. That's obviously not good. We can perform the initialization in the IRQ core layer before calling msi_domain_prepare_irqs(), which is neat enough. Fixes: `5fe71d271d` ("irqchip/gic-v3-its: Tag ITS device as shared if allocating for a proxy device") Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20201218060039.1770-1-yuzenghui@huawei.com Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:16:18 +08:00
Dave Jiang	723fbc97d0	MSI: export msi_domain_alloc_irqs() symbol commit 0963865b1fb1c2685c42c6ea5957d1fe8096932b Intel-BKC. Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:16:18 +08:00
Megha Dey	aaad75d44c	PCI/MSI: Free MSI-X interrupt with a given IRQ number commit f35723a81e0880e3923bc0b3d198bba8d814a3fd Intel-BKC. Currently, the pci_msix_disable() frees all the allocated resources associated with a PCIe device when the device is being shut down. With the introduction of dynamic allocation of MSI-X vectors, there may be cases where drivers want to free only a particular interrupt, even when the device is not being shut down. A new API, pci_free_msix_irq_vector() provides this type of interface. Signed-off-by: Megha Dey <megha.dey@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:16:17 +08:00
Megha Dey	67f3acd4dd	PCI/MSI: Enable dynamic allocation of MSI-X vectors commit ef818654dbb197aa1e2cd3baaeb0ac1b98b47e26 Intel-BKC. Introduce a new API pci_add_msix_irq_vector(), which can be called multiple times by a driver to add a new MSI-X vector to the device after some number of interrupts have already been allocated using the existing pci_alloc_irq_vectors API. If successful, the API returns the device-relative interrupt vector index (0-based) which can be passed to pci_irq_vector() to retrieve the Linux IRQ number of that device vector. It should be called only after the pci_alloc_irq_vectors(). Add a new member msix_alloc_count to keep track of the number of the MSI-X vectors currently allocated to the device. Signed-off-by: Megha Dey <megha.dey@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:16:16 +08:00
Zhuo Chen	13a7cba2c0	genirq/msi: Iterate msi_list starting from a given desc commit 235988b35dba007134f4ec4d973408b8a5c0c8f5 Intel-BKC. This is a preparatory patch to enable dynamic allocation of MSI-X interrupts. With the addition of dynamic msix, the msi_list of the device is no longer immutable. To set up new vectors, we need to iterate only through the newly added entries of this list. To help with this: 1. A msi_last_list pointer is added to struct device which points to the last msi_desc's list before a new allocation. 2. New macros are introduced which iterate the msi_list from the 1st newly added msi_desc of every allocation using (1). No functional change. Signed-off-by: Megha Dey <megha.dey@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:16:15 +08:00
Dave Jiang	6d52dbbfd2	genirq/msi: Provide helpers to return Linux IRQ/dev_msi hw IRQ number commit 84ede51031d8df62a94b4014294e81ff25b502fb Intel-BKC. Add new helpers to get the Linux IRQ number and device specific index for given device-relative vector so that the drivers don't need to allocate their own arrays to keep track of the vectors and hwirq for the multi vector device MSI case. Reviewed-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Megha Dey <megha.dey@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:16:10 +08:00
Megha Dey	93a5b0469a	genirq: Set auxiliary data for an interrupt commit 459da077f38ceffaa524bd504d66739e37314fd1 Intel-BKC. Introduce a new function pointer in the irq_chip structure(irq_set_auxdata) which is responsible for updating data which is stored in a shared register or data storage. For example, the idxd driver uses the auxiliary data API to enable/set and disable PASID field that is in the IMS entry (introduced in a later patch) and that data are not typically present in MSI entry. Reviewed-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Megha Dey <megha.dey@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:16:08 +08:00
Thomas Gleixner	960bed491d	irqdomain/msi: Provide msi_alloc/free_store() callbacks commit edd679d4da0966b436c15ba1867b995ce582e39b Intel-BKC. For devices which don't have a standard storage for MSI messages like the upcoming IMS (Interrupt Message Store) it's required to allocate storage space before allocating interrupts and after freeing them. This could be achieved with the existing callbacks, but that would be awkward because they operate on msi_alloc_info_t which is not uniform across architectures. Also these callbacks are invoked per interrupt but the allocation might have bulk requirements depending on the device. As such devices can operate on different architectures it is simpler to have separate callbacks which operate on struct device. The resulting storage information has to be stored in struct msi_desc so the underlying irq chip implementation can retrieve it for the relevant operations. Reviewed-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Megha Dey <megha.dey@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:16:07 +08:00
Bixuan Cui	8cc9437ccd	genirq/msi: Ensure deactivation on teardown commit `dbbc93576e` upstream. msi_domain_alloc_irqs() invokes irq_domain_activate_irq(), but msi_domain_free_irqs() does not enforce deactivation before tearing down the interrupts. This happens when PCI/MSI interrupts are set up and never used before being torn down again, e.g. in error handling pathes. The only place which cleans that up is the error handling path in msi_domain_alloc_irqs(). Move the cleanup from msi_domain_alloc_irqs() into msi_domain_free_irqs() to cure that. Fixes: `f3b0946d62` ("genirq/msi: Make sure PCI MSIs are activated early") Signed-off-by: Bixuan Cui <cuibixuan@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210518033117.78104-1-cuibixuan@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:16:07 +08:00
Thomas Gleixner	e16c62e5e3	irqdomain/msi: Allow to override msi_domain_alloc/free_irqs() commit `43e9e705dd` upstream. To support MSI irq domains which do not fit at all into the regular MSI irqdomain scheme, like the XEN MSI interrupt management for PV/HVM/DOM0, it's necessary to allow to override the alloc/free implementation. This is a preperatory step to switch X86 away from arch_*_msi_irqs() and store the irq domain pointer right in struct device. No functional change for existing MSI irq domain users. Aside of the evil XEN wrapper this is also useful for special MSI domains which need to do extra alloc/free work before/after calling the generic core function. Work like allocating/freeing MSI descriptors, MSI storage space etc. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20200826112333.526797548@linutronix.de Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:16:07 +08:00
Thomas Gleixner	e1e1af9d09	platform-msi: Add device MSI infrastructure commit 2f4841d50818106b263e60c4fbb6ead6bd770634 Intel-BKC. Add device specific MSI domain infrastructure for devices which have their own resource management and interrupt chip. These devices are not related to PCI and contrary to platform MSI they do not share a common resource and interrupt chip. They provide their own domain specific resource management and interrupt chip. This utilizes the new alloc/free override in a non evil way which avoids having yet another set of specialized alloc/free functions. Just using msi_domain_alloc/free_irqs() is sufficient While initially it was suggested and tried to piggyback device MSI on platform MSI, the better variant is to reimplement platform MSI on top of device MSI. Reviewed-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Megha Dey <megha.dey@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:16:06 +08:00
Thomas Gleixner	d3a99b4737	genirq/msi: Provide and use msi_domain_set_default_info_flags() commit d4a4db1e8108a2a78183490484bc9adda867cc0f Intel-BKC. MSI interrupts have some common flags which should be set not only for PCI/MSI interrupts. Move the PCI/MSI flag setting into a common function so it can be reused. Reviewed-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Megha Dey <megha.dey@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:16:06 +08:00
Barry Song	ae6052f2f9	genirq/msi: Move MSI sysfs handling from PCI to MSI core commit `2f170814bd` upstream. Move PCI's MSI sysfs code to the irq core so that other busses such as platform can reuse it. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210813035628.6844-2-21cnbao@gmail.com Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:16:05 +08:00
Thomas Gleixner	9141b1e1ca	genirq/proc: Take buslock on affinity write commit a99ffe1a97b9a6151c2f8913fad6da2face81ef8 Intel-BKC. Until now interrupt chips which support setting affinity are not locking the associated bus lock for two reasons: - All chips which support affinity setting do not use buslock because they just can operated directly on the hardware. - All chips which use buslock do not support affinity setting because their interrupt chips are not capable. These chips are usually connected over a bus like I2C, SPI etc. and have an interrupt output which is conneted to CPU interrupt of some sort. So there is no way to set the affinity on the chip itself. Upcoming hardware which is PCIE based sports a non standard MSI(X) variant which stores the MSI message in RAM which is associated to e.g. a device queue. The device manages this RAM and writes have to be issued via command queues or similar mechanisms which is obviously not possible from interrupt disabled, raw spinlock held context. The buslock mechanism of irq chips can be utilized to support that. The affinity write to the chip writes to shadow state, marks it pending and the irq chip's irq_bus_sync_unlock() callback handles the command queue and wait for completion similar to the other chip operations on I2C or SPI buses. Change the locking in irq_set_affinity() to bus_lock/unlock to help with that. There are a few other callers than the proc interface, but none of them is affected by this change as none of them affects an irq chip with bus lock support. Reviewed-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Megha Dey <megha.dey@intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:16:01 +08:00
Thomas Gleixner	9d3ba14962	genirq: Export affinity setter for modules commit `4d80d6ca5d` upstream. Perf modules abuse irq_set_affinity_hint() to set the affinity of system PMU interrupts just because irq_set_affinity() was not exported. The fact that irq_set_affinity_hint() actually sets the affinity is a non-documented side effect and the name is clearly saying it's a hint. To clean this up, export the real affinity setter. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20210518093117.968251441@linutronix.de Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:16:01 +08:00
Zhuo Chen	70d2064b05	cgroup: Introduce ioasids controller commit 72d6878a1c013c048823bf66cce9c00f20d10956 Intel-BKC. IOASIDs are used to associate DMA requests with virtual address spaces. They are a system-wide limited resource made available to the userspace applications. Let it be VMs or user-space device drivers. This RFC patch introduces a cgroup controller to address the following problems: 1. Some user applications exhaust all the available IOASIDs thus depriving others of the same host. 2. System admins need to provision VMs based on their needs for IOASIDs, e.g. the number of VMs with assigned devices that perform DMA requests with PASID. This patch is nowhere near its completion, it merely provides the basic functionality for resource distribution and cgroup hierarchy organizational changes. Since this is part of a greater effort to enable Shared Virtual Address (SVA) virtualization. We would like to have a direction check and collect feedback early. For details, please refer to the documentation: Documentation/admin-guide/cgroup-v1/ioasids.rst Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:14:17 +08:00
Zhuo Chen	bec777b4ce	Revert "iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit" commit opencloudos. This reverts commit 0ed3e1e3c618b8069f5b57aac5340e3efd1ea2bc. Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:14:10 +08:00
Claire Chang	b56b655104	swiotlb: Fix the type of index commit `95b079d821` upstream. Fix the type of index from unsigned int to int since find_slots() might return -1. Fixes: `26a7e09478` ("swiotlb: refactor swiotlb_tbl_map_single") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Claire Chang <tientzu@chromium.org> Signed-off-by: Konrad Rzeszutek Wilk <konrad@kernel.org> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:14:08 +08:00
Christoph Hellwig	87e6775225	swiotlb: split swiotlb_tbl_sync_single commit `80808d273a` upstream. Split swiotlb_tbl_sync_single into two separate funtions for the to device and to cpu synchronization. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:14:08 +08:00
Christoph Hellwig	dea6871daa	swiotlb: move orig addr and size validation into swiotlb_bounce commit `2bdba622c3` upstream. Move the code to find and validate the original buffer address and size from the callers into swiotlb_bounce. This means a tiny bit of extra work in the swiotlb_map path, but avoids code duplication and a leads to a better code structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:14:08 +08:00
Christoph Hellwig	c7e40421dd	swiotlb: remove the alloc_size parameter to swiotlb_tbl_unmap_single commit `2973073a80` upstream. Now that swiotlb remembers the allocation size there is no need to pass it back to swiotlb_tbl_unmap_single. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:14:07 +08:00
Martin Radev	1d10dfc9fa	swiotlb: Validate bounce size in the sync/unmap path commit `daf9514fd5` upstream. The size of the buffer being bounced is not checked if it happens to be larger than the size of the mapped buffer. Because the size can be controlled by a device, as it's the case with virtio devices, this can lead to memory corruption. This patch saves the remaining buffer memory for each slab and uses that information for validation in the sync/unmap paths before swiotlb_bounce is called. Validating this argument is important under the threat models of AMD SEV-SNP and Intel TDX, where the HV is considered untrusted. Signed-off-by: Martin Radev <martin.b.radev@gmail.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:14:07 +08:00
Christoph Hellwig	75e44d5f56	swiotlb: respect min_align_mask commit `1f221a0d0d` upstream. Respect the min_align_mask in struct device_dma_parameters in swiotlb. There are two parts to it: 1) for the lower bits of the alignment inside the io tlb slot, just extent the size of the allocation and leave the start of the slot empty 2) for the high bits ensure we find a slot that matches the high bits of the alignment to avoid wasting too much memory Based on an earlier patch from Jianxiong Gao <jxgao@google.com>. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jianxiong Gao <jxgao@google.com> Tested-by: Jianxiong Gao <jxgao@google.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:14:06 +08:00
Christoph Hellwig	bbec749a45	swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single commit `16fc3cef33` upstream. swiotlb_tbl_map_single currently nevers sets a tlb_addr that is not aligned to the tlb bucket size. But we're going to add such a case soon, for which this adjustment would be bogus. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jianxiong Gao <jxgao@google.com> Tested-by: Jianxiong Gao <jxgao@google.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:14:05 +08:00
Christoph Hellwig	de94afeebc	swiotlb: refactor swiotlb_tbl_map_single commit `26a7e09478` upstream. Split out a bunch of a self-contained helpers to make the function easier to follow. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jianxiong Gao <jxgao@google.com> Tested-by: Jianxiong Gao <jxgao@google.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:14:05 +08:00
Linus Torvalds	28d2a1e194	Reinstate some of "swiotlb: rework "fix info leak with DMA_FROM_DEVICE"" commit `901c7280ca` upstream. Halil Pasic points out [1] that the full revert of that commit (revert in `bddac7c1e0`), and that a partial revert that only reverts the problematic case, but still keeps some of the cleanups is probably better. And that partial revert [2] had already been verified by Oleksandr Natalenko to also fix the issue, I had just missed that in the long discussion. So let's reinstate the cleanups from commit `aa6f8dcbab` ("swiotlb: rework "fix info leak with DMA_FROM_DEVICE""), and effectively only revert the part that caused problems. Link: https://lore.kernel.org/all/20220328013731.017ae3e3.pasic@linux.ibm.com/ [1] Link: https://lore.kernel.org/all/20220324055732.GB12078@lst.de/ [2] Link: https://lore.kernel.org/all/4386660.LvFx2qVVIh@natalenko.name/ [3] Suggested-by: Halil Pasic <pasic@linux.ibm.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Cc: Christoph Hellwig" <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [OP: backport to 5.4: adjusted context] Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:14:05 +08:00
Halil Pasic	e14c7aa066	swiotlb: fix info leak with DMA_FROM_DEVICE commit `ddbd89deb7` upstream. The problem I'm addressing was discovered by the LTP test covering cve-2018-1000204. A short description of what happens follows: 1) The test case issues a command code 00 (TEST UNIT READY) via the SG_IO interface with: dxfer_len == 524288, dxdfer_dir == SG_DXFER_FROM_DEV and a corresponding dxferp. The peculiar thing about this is that TUR is not reading from the device. 2) In sg_start_req() the invocation of blk_rq_map_user() effectively bounces the user-space buffer. As if the device was to transfer into it. Since commit `a45b599ad8` ("scsi: sg: allocate with __GFP_ZERO in sg_build_indirect()") we make sure this first bounce buffer is allocated with GFP_ZERO. 3) For the rest of the story we keep ignoring that we have a TUR, so the device won't touch the buffer we prepare as if the we had a DMA_FROM_DEVICE type of situation. My setup uses a virtio-scsi device and the buffer allocated by SG is mapped by the function virtqueue_add_split() which uses DMA_FROM_DEVICE for the "in" sgs (here scatter-gather and not scsi generics). This mapping involves bouncing via the swiotlb (we need swiotlb to do virtio in protected guest like s390 Secure Execution, or AMD SEV). 4) When the SCSI TUR is done, we first copy back the content of the second (that is swiotlb) bounce buffer (which most likely contains some previous IO data), to the first bounce buffer, which contains all zeros. Then we copy back the content of the first bounce buffer to the user-space buffer. 5) The test case detects that the buffer, which it zero-initialized, ain't all zeros and fails. One can argue that this is an swiotlb problem, because without swiotlb we leak all zeros, and the swiotlb should be transparent in a sense that it does not affect the outcome (if all other participants are well behaved). Copying the content of the original buffer into the swiotlb buffer is the only way I can think of to make swiotlb transparent in such scenarios. So let's do just that if in doubt, but allow the driver to tell us that the whole mapped buffer is going to be overwritten, in which case we can preserve the old behavior and avoid the performance impact of the extra bounce. Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:14:04 +08:00
Christoph Hellwig	669c6ffb56	swiotlb: clean up swiotlb_tbl_unmap_single commit `ca10d0f8e5` upstream. Remove a layer of pointless indentation, replace a hard to follow ternary expression with a plain if/else. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jianxiong Gao <jxgao@google.com> Tested-by: Jianxiong Gao <jxgao@google.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:14:04 +08:00
Christoph Hellwig	4e7732a1dd	swiotlb: factor out a nr_slots helper commit `c32a77fd18` upstream. Factor out a helper to find the number of slots for a given size. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jianxiong Gao <jxgao@google.com> Tested-by: Jianxiong Gao <jxgao@google.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:14:04 +08:00
Christoph Hellwig	482356b07e	swiotlb: factor out an io_tlb_offset helper commit `c7fbeca757` upstream. Replace the very genericly named OFFSET macro with a little inline helper that hardcodes the alignment to the only value ever passed. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jianxiong Gao <jxgao@google.com> Tested-by: Jianxiong Gao <jxgao@google.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:14:03 +08:00
Christoph Hellwig	dc1e9d308a	swiotlb: add a IO_TLB_SIZE define commit `b5d7ccb7aa` upstream. Add a new IO_TLB_SIZE define instead open coding it using IO_TLB_SHIFT all over. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jianxiong Gao <jxgao@google.com> Tested-by: Jianxiong Gao <jxgao@google.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:14:03 +08:00
Ashish Kalra	0987625ace	x86,swiotlb: Adjust SWIOTLB bounce buffer size for SEV guests commit `e998879d4f` upstream. For SEV, all DMA to and from guest has to use shared (un-encrypted) pages. SEV uses SWIOTLB to make this happen without requiring changes to device drivers. However, depending on the workload being run, the default 64MB of it might not be enough and it may run out of buffers to use for DMA, resulting in I/O errors and/or performance degradation for high I/O workloads. Adjust the default size of SWIOTLB for SEV guests using a percentage of the total memory available to guest for the SWIOTLB buffers. Adds a new sev_setup_arch() function which is invoked from setup_arch() and it calls into a new swiotlb generic code function swiotlb_adjust_size() to do the SWIOTLB buffer adjustment. v5 fixed build errors and warnings as Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Co-developed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:14:03 +08:00
Christoph Hellwig	5e282a42ea	dma-direct: simplify the DMA_ATTR_NO_KERNEL_MAPPING handling commit `849facea92` upstream. Use and entirely separate code path for the DMA_ATTR_NO_KERNEL_MAPPING path. This avoids any confusion about the ret type, and avoids lots of attr checks and helpers that can be significantly simplified now. It also ensures that common handling is applied to architetures still using the arch alloc/free hooks. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:13:40 +08:00
Christoph Hellwig	328b17f568	dma-direct: factor out a dma_direct_alloc_from_pool helper commit `5b138c534f` upstream. This ensures dma_direct_alloc_pages will use the right gfp mask, as well as keeping the code for that common between the two allocators. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:13:39 +08:00
Christoph Hellwig	fc6a568014	dma-direct check for highmem pages in dma_direct_alloc_pages commit `08a89c2830` upstream. Check for highmem pages from CMA, just like in the dma_direct_alloc path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:13:39 +08:00
Thomas Tai	7b2ba19af4	dma-direct: Fix potential NULL pointer dereference commit `f959dcd6dd` upstream. When booting the kernel v5.9-rc4 on a VM, the kernel would panic when printing a warning message in swiotlb_map(). The dev->dma_mask must not be a NULL pointer when calling the dma mapping layer. A NULL pointer check can potentially avoid the panic. Signed-off-by: Thomas Tai <thomas.tai@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:13:39 +08:00
Christoph Hellwig	c0471bdaec	dma-mapping: return an unsigned int from dma_map_sg{,_attrs} commit `2a047e0662` upstream. These can only return 0 for failure or the number of entries, so turn the return value into an unsigned int. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:13:38 +08:00
Logan Gunthorpe	226f754b23	dma-mapping: disallow .map_sg operations from returning zero on error commit `d03c544192` upstream. Now that all the .map_sg operations have been converted to returning proper error codes, drop the code to handle a zero return value, add a warning if a zero is returned. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:13:38 +08:00
Martin Oliveira	6fabdf3779	dma-mapping: return error code from dma_dummy_map_sg() commit `6506932b32` upstream. The .map_sg() op now expects an error code instead of zero on failure. The only errno to return is -EINVAL in the case when DMA is not supported. Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chen Zhuo <sagazchen@tencent.com> Signed-off-by: Xinghui Li <korantli@tencent.com>	2024-06-11 21:13:38 +08:00

1 2 3 4 5 ...

31639 Commits