Put all the debugging output behind "mtrr=debug" and get rid of
"mtrr_cleanup_debug" which wasn't even documented anywhere.
No functional changes.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20230531174857.GDZHeIib57h5lT5Vh1@fat_crate.local
mtrr_centaur_report_mcr() isn't used by anyone, so it can be removed.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20230502120931.20719-17-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Today pud_set_huge() and pmd_set_huge() test for the MTRR type to be
WB or INVALID after calling mtrr_type_lookup().
Those tests can be dropped as the only reason not to use a large mapping
would be uniform being 0.
Any MTRR type can be accepted as long as it applies to the whole memory
range covered by the mapping, as the alternative would only be to map
the same region with smaller pages instead, using the same PAT type as
for the large mapping.
[ bp: Massage commit message. ]
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20230502120931.20719-16-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
mtrr_type_lookup() should always return a valid memory type. In case
there is no information available, it should return the default UC.
This will remove the last case where mtrr_type_lookup() can return
MTRR_TYPE_INVALID, so adjust the comment in include/uapi/asm/mtrr.h.
Note that removing the MTRR_TYPE_INVALID #define from that header
could break user code, so it has to stay.
At the same time the mtrr_type_lookup() stub for the !CONFIG_MTRR
case should set uniform to 1, as if the memory range would be
covered by no MTRR at all.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20230502120931.20719-15-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Instead of crawling through the MTRR register state, use the new
cache_map for looking up the cache type(s) of a memory region.
This allows now to set the uniform parameter according to the
uniformity of the cache mode of the region, instead of setting it
only if the complete region is mapped by a single MTRR. This now
includes even the region covered by the fixed MTRR registers.
Make sure uniform is always set.
[ bp: Massage. ]
[ jgross: Explain mtrr_type_lookup() logic. ]
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20230502120931.20719-14-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Add a new command line option "mtrr=debug" for getting debug output
after building the new cache mode map. The output will include MTRR
register values and the resulting map.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20230502120931.20719-13-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
After MTRR initialization construct a memory map with cache modes from
MTRR values. This will speed up lookups via mtrr_lookup_type()
especially in case of overlapping MTRRs.
This will be needed when switching the semantics of the "uniform"
parameter of mtrr_lookup_type() from "only covered by one MTRR" to
"memory range has a uniform cache mode", which is the data the callers
really want to know. Today this information is not easily available,
in case MTRRs are not well sorted regarding base address.
The map will be built in __initdata. When memory management is up, the
map will be moved to dynamically allocated memory, in order to avoid
the need of an overly large array. The size of this array is calculated
using the number of variable MTRR registers and the needed size for
fixed entries.
Only add the map creation and expansion for now. The lookup will be
added later.
When writing new MTRR entries in the running system rebuild the map
inside the call from mtrr_rendezvous_handler() in order to avoid nasty
race conditions with concurrent lookups.
[ bp: Move out rebuild_map() call and rename it. ]
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20230502120931.20719-12-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Add a service function for obtaining the effective cache mode of
overlapping MTRR registers.
Make use of that function in check_type_overlap().
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20230502120931.20719-11-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
The mtrr_value[] array is a static variable which is used only in a few
configurations. Consuming 6kB is ridiculous for this case, especially as
the array doesn't need to be that large and it can easily be allocated
dynamically.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20230502120931.20719-10-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
There is some code in mtrr.c which is relevant for old 32-bit CPUs
only. Move it to a new source legacy.c.
While modifying mtrr_init_finalize() fix spelling of its name.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20230502120931.20719-9-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Today there are two variants of set_mtrr(): one calling stop_machine()
and one calling stop_machine_cpuslocked().
The first one (set_mtrr()) has only one caller, and this caller is
running only when resuming from suspend when the interrupts are still
off and only one CPU is active. Additionally this code is used only on
rather old 32-bit CPUs not supporting SMP.
For these reasons the first variant can be replaced by a simple call of
mtrr_if->set().
Rename the second variant set_mtrr_cpuslocked() to set_mtrr() now that
there is only one variant left, in order to have a shorter function
name.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20230502120931.20719-8-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Modern CPUs all share the same MTRR interface implemented via
generic_mtrr_ops.
At several places in MTRR code this generic interface is deduced via
is_cpu(INTEL) tests, which is only working due to X86_VENDOR_INTEL
being 0 (the is_cpu() macro is testing mtrr_if->vendor, which isn't
explicitly set in generic_mtrr_ops).
Test the generic CPU feature X86_FEATURE_MTRR instead.
The only other place where the .vendor member of struct mtrr_ops is
being used is in set_num_var_ranges(), where depending on the vendor
the number of MTRR registers is determined. This can easily be changed
by replacing .vendor with the static number of MTRR registers.
It should be noted that the test "is_cpu(HYGON)" wasn't ever returning
true, as there is no struct mtrr_ops with that vendor information.
[ bp: Use mtrr_enabled() before doing mtrr_if-> accesses, esp. in
mtrr_trim_uncached_memory() which gets called independently from
whether mtrr_if is set or not. ]
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230502120931.20719-7-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
When running as Xen PV initial domain (aka dom0), MTRRs are disabled
by the hypervisor, but the system should nevertheless use correct
cache memory types. This has always kind of worked, as disabled MTRRs
resulted in disabled PAT, too, so that the kernel avoided code paths
resulting in inconsistencies. This bypassed all of the sanity checks
the kernel is doing with enabled MTRRs in order to avoid memory
mappings with conflicting memory types.
This has been changed recently, leading to PAT being accepted to be
enabled, while MTRRs stayed disabled. The result is that
mtrr_type_lookup() no longer is accepting all memory type requests,
but started to return WB even if UC- was requested. This led to
driver failures during initialization of some devices.
In reality MTRRs are still in effect, but they are under complete
control of the Xen hypervisor. It is possible, however, to retrieve
the MTRR settings from the hypervisor.
In order to fix those problems, overwrite the MTRR state via
mtrr_overwrite_state() with the MTRR data from the hypervisor, if the
system is running as a Xen dom0.
Fixes: 72cbc8f04f ("x86/PAT: Have pat_enabled() properly reflect state when running on Xen")
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20230502120931.20719-6-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
In order to avoid mappings using the UC- cache attribute, set the
MTRR state to use WB caching as the default.
This is needed in order to cope with the fact that PAT is enabled,
while MTRRs are not supported by the hypervisor.
Fixes: 90b926e68f ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case")
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20230502120931.20719-5-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
When running virtualized, MTRR access can be reduced (e.g. in Xen PV
guests or when running as a SEV-SNP guest under Hyper-V). Typically, the
hypervisor will not advertize the MTRR feature in CPUID data, resulting
in no MTRR memory type information being available for the kernel.
This has turned out to result in problems (Link tags below):
- Hyper-V SEV-SNP guests using uncached mappings where they shouldn't
- Xen PV dom0 mapping memory as WB which should be UC- instead
Solve those problems by allowing an MTRR static state override,
overwriting the empty state used today. In case such a state has been
set, don't call get_mtrr_state() in mtrr_bp_init().
The set state will only be used by mtrr_type_lookup(), as in all other
cases mtrr_enabled() is being checked, which will return false. Accept
the overwrite call only for selected cases when running as a guest.
Disable X86_FEATURE_MTRR in order to avoid any MTRR modifications by
just refusing them.
[ bp: Massage. ]
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/all/4fe9541e-4d4c-2b2a-f8c8-2d34a7284930@nerdbynature.de/
Link: https://lore.kernel.org/lkml/BYAPR21MB16883ABC186566BD4D2A1451D7FE9@BYAPR21MB1688.namprd21.prod.outlook.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Replace size_or_mask and size_and_mask with the much easier concept of
high reserved bits.
While at it, instead of using constants in the MTRR code, use some new
[ bp:
- Drop mtrr_set_mask()
- Unbreak long lines
- Move struct mtrr_state_type out of the uapi header as it doesn't
belong there. It also fixes a HDRTEST breakage "unknown type name ‘bool’"
as Reported-by: kernel test robot <lkp@intel.com>
- Massage.
]
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230502120931.20719-3-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Previous Sub-NUMA Clustering changes need not just a count of blades
present, but a count that includes any missing ids for blades not
present; in other words, the range from lowest to highest blade id.
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230519190752.3297140-9-steve.wahl%40hpe.com
Sub-NUMA clustering (SNC) invalidates previous assumptions of a 1:1
relationship between blades, sockets, and nodes. Fix these
assumptions and build tables correctly when SNC is enabled.
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230519190752.3297140-7-steve.wahl%40hpe.com
Add alloc_conv_table() and FREE_1_TO_1_TABLE() to reduce duplicated
code among the conversion tables we use.
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230519190752.3297140-6-steve.wahl%40hpe.com
Using a starting value of INT_MAX rather than 999999 or 99999 means
this algorithm won't fail should the numbers being compared ever
exceed this value.
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230519190752.3297140-5-steve.wahl%40hpe.com
Fix incorrect mask names and values in calc_mmioh_map() that caused it
to print wrong NASID information. And an unused blade position is not
an error condition, but will yield an invalid NASID value, so change
the invalid NASID message from an error to a debug message.
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230519190752.3297140-4-steve.wahl%40hpe.com
Add and use uv_pnode_to_socket() function, which parallels other
helper functions in here, and will enable avoiding duplicate code
in an upcoming patch.
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230519190752.3297140-3-steve.wahl%40hpe.com
Upcoming changes will require use of new #defines
UVH_RH_GAM_MMIOH_REDIRECT_CONFIG0_NASID_MASK and
UVH_RH_GAM_MMIOH_REDIRECT_CONFIG1_NASID_MASK, which provide the
appropriate values on different uv platforms.
Also, fix typo that defined a couple of "*_CONFIG0_*" values twice
where "*_CONFIG1_*" was intended.
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230519190752.3297140-2-steve.wahl%40hpe.com
The decision to allow parallel bringup of secondary CPUs checks
CC_ATTR_GUEST_STATE_ENCRYPT to detect encrypted guests. Those cannot use
parallel bootup because accessing the local APIC is intercepted and raises
a #VC or #VE, which cannot be handled at that point.
The check works correctly, but only for AMD encrypted guests. TDX does not
set that flag.
As there is no real connection between CC attributes and the inability to
support parallel bringup, replace this with a generic control flag in
x86_cpuinit and let SEV-ES and TDX init code disable it.
Fixes: 0c7ffa32db ("x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it")
Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/87ilc9gd2d.ffs@tglx
RESET_CALL_DEPTH is a pretty fat monster and blows up UNTRAIN_RET to
20 bytes:
19: 48 c7 c0 80 00 00 00 mov $0x80,%rax
20: 48 c1 e0 38 shl $0x38,%rax
24: 65 48 89 04 25 00 00 00 00 mov %rax,%gs:0x0 29: R_X86_64_32S pcpu_hot+0x10
Shrink it by 4 bytes:
0: 31 c0 xor %eax,%eax
2: 48 0f ba e8 3f bts $0x3f,%rax
7: 65 48 89 04 25 00 00 00 00 mov %rax,%gs:0x0
Shrink RESET_CALL_DEPTH_FROM_CALL by 5 bytes by only setting %al, the
other bits are shifted out (the same could be done for RESET_CALL_DEPTH,
but the XOR+BTS sequence has less dependencies due to the zeroing).
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230515093020.729622326@infradead.org
By adding support for longer NOPs there are a few more alternatives
that can turn into a single instruction.
Add up to NOP11, the same limit where GNU as .nops also stops
generating longer nops. This is because a number of uarchs have severe
decode penalties for more than 3 prefixes.
[ bp: Sync up with the version in tools/ while at it. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230515093020.661756940@infradead.org
When writing a task id to the "tasks" file in an rdtgroup,
rdtgroup_tasks_write() treats the pid as a number in the current pid
namespace. But when reading the "tasks" file, rdtgroup_tasks_show() shows
the list of global pids from the init namespace, which is confusing and
incorrect.
To be more robust, let the "tasks" file only show pids in the current pid
namespace.
Fixes: e02737d5b8 ("x86/intel_rdt: Add tasks files")
Signed-off-by: Shawn Wang <shawnwang@linux.alibaba.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Acked-by: Fenghua Yu <fenghua.yu@intel.com>
Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/all/20230116071246.97717-1-shawnwang@linux.alibaba.com/
The stack locking and stack assignment macro LOAD_REALMODE_ESP fails to
work when invoked from the 64bit trampoline entry point:
trampoline_start64
trampoline_compat
LOAD_REALMODE_ESP <- lock
Accessing tr_lock is only possible from 16bit mode. For the compat entry
point this needs to be pa_tr_lock so that the required relocation entry is
generated. Otherwise it locks the non-relocated address which is
aside of being wrong never cleared in secondary_startup_64() causing all
but the first CPU to get stuck on the lock.
Make the macro take an argument lock_pa which defaults to 0 and rename it
to LOCK_AND_LOAD_REALMODE_ESP to make it clear what this is about.
Fixes: f6f1ae9128 ("x86/smpboot: Implement a bit spinlock to protect the realmode stack")
Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/87h6rujdvl.ffs@tglx
Marking primary threads in the cpumask during early boot is only correct in
certain configurations, but broken e.g. for the legacy hyperthreading
detection.
This is due to the complete mess in the CPUID evaluation code which
initializes smp_num_siblings only half during early init and fixes it up
later when identify_boot_cpu() is invoked.
So using smp_num_siblings before identify_boot_cpu() leads to incorrect
results.
Fixing the early CPU init code to provide the proper data is a larger scale
surgery as the code has dependencies on data structures which are not
initialized during early boot.
Move the initialization of cpu_primary_thread_mask wich depends on
smp_num_siblings being correct to an early initcall so that it is set up
correctly before SMP bringup.
Fixes: f54d4434c2 ("x86/apic: Provide cpu_primary_thread mask")
Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/87sfbhlwp9.ffs@tglx
Clang warns:
arch/x86/lib/csum-partial_64.c:74:20: error: variable 'result' is uninitialized when used here [-Werror,-Wuninitialized]
return csum_tail(result, temp64, odd);
^~~~~~
arch/x86/lib/csum-partial_64.c:48:22: note: initialize the variable 'result' to silence this warning
unsigned odd, result;
^
= 0
1 error generated.
The only initialization and uses of result in csum_partial() were moved
into csum_tail() but result is still being passed by value to
csum_tail() (clang's -Wuninitialized does not do interprocedural
analysis to realize that result is always assigned in csum_tail()
however). Sink the declaration of result into csum_tail() to clear up
the warning.
Closes: https://lore.kernel.org/202305262039.3HUYjWJk-lkp@intel.com/
Fixes: 688eb8191b ("x86/csum: Improve performance of `csum_partial`")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230526-csum_partial-wuninitialized-v1-1-ebc0108dcec1%40kernel.org
- Prevent a bogus setting for the number of HT siblings, which is caused
by the CPUID evaluation trainwreck of X86. That recomputes the value
for each CPU, so the last CPU "wins". That can cause completely bogus
sibling values.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmRzBxMTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodfbEACp+oRBD2zIcmf8kpakMxIbH2CkDAxl
AgqGHYVKAvsqKPQZhYmOFEsk+0isMzssuVaub9gMeLRmpMQxUqv7K20pluvWmuQm
h7MTYEAvCZpBU2BGB1mixp57tQ/gpLSB4uaJU2Q5hh9po6uImvalIdqThf7ArxgA
mm3BhtbFObLGwRaV2981zdsEs8Cjs9RusnAOrX5LbBBb6ibcnrqWy7DAhySMA0zr
7RfaGyG/T9z6/BgX54FBQ7aDNU/RkZ8YYQoMj0kAFXdM3V+sa3oPfMRQpACPHKm4
kwskJfWVdMq06kgOD7N9+WrOfBAgIsmzasT6VhTuGAGNo4CiEOthLqXDHwf4NLhN
hj0XReHVyBiQ1KuI2lGNJ71c30KRkh7SCdGNiEFtnlpteXnS4bRnzWtfMf/+2SCC
WOziRRYRzuAvvoTwoQS1BYJrDAB9f5jOX8FTYbC12rEk/RPOB6t4iFNLUzid7u1H
yPYphoftXlQlli7EajVc5pSZCftMHRCzWernptiPFU5tMvyagYppcg25JY54rquC
CTJQqoa/HPAEZhBjgZElwO9QBgu+VyJUq3R/Z+LbkQuz5z+PVDfqeDuDWhaqbcz1
WGtdGdNo4bqVhwqgwyHjgKCFe05mF6tPuotnYZE/VSkXUCvLkjGEVg8HQ7WiPhJd
IYDqxkIhX5h1Mg==
=l43p
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu fix from Thomas Gleixner:
"A single fix for x86:
- Prevent a bogus setting for the number of HT siblings, which is
caused by the CPUID evaluation trainwreck of X86. That recomputes
the value for each CPU, so the last CPU "wins". That can cause
completely bogus sibling values"
* tag 'x86-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/topology: Fix erroneous smp_num_siblings on Intel Hybrid platforms
- Make the CHA discovery based on MSR readout to work around broken
discovery tables in some SPR firmwares.
- Prevent saving PEBS configuration which has software bits set that
cause a crash when restored into the relevant MSR.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmRzBlETHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYofUsEAC6VMmscn6MDmCpO71+o80IhFUeN/Pa
/Qu4kKCQmeT/gonMG+bKCYSNobgdi4a7KNDRZypTjzMORTDmS3lT4aTbzy3ry+eI
7rUTolAo3HiY0ZwPhU/evrGDf0O2d/19nd9pXiLZFinH9WsMf7mKsXhAJvlq6SKu
SliGCQlB7+0bicdYgBkJqWPUbKY7QvcCcXiPx1Ujxg7CbMTcuj7MLzu/TVYmwZYH
Je/k+vBnhl54PF9nGiqNYZGBPh4cBgkpHEFGFrvwplBRBdwHrjapNxhIRcA2N66u
mr+f7Cl6StOP16Bn4dzG9nHLVTDz5NVgsbtPStVTcOoIzfOhch7mOL+tu/pafXdt
zRL7U6usfbQqt1rcBBtPoeFlEifQwHoz3ZGFSyIZn5IRfXtXlDJHyPxlRYMB5/uz
WjR10GzE0tZFLetvW4PdlMHcwwrOZNU+crvADmCk4KzzVDip9QLbomiGfxHVHI2J
Fjstd6ZaNkuDHRy/r76YfwRDDzQWPvtsaWRLzD4aFQQxYdHHZOabBittK9UvENvu
MjLRO+1ymN52Ap6TdcK9ZMNpb8ol/Xn5aYivNRlHsUjJ8RxJj94sElIlNHLI2yoa
hICdMcGSCd7H+FIkXdgT2O8OBxEsf139xSrG2oSTOZFCjkR1BSxtyqNY47MFVSiA
KBtbnJMkde48pg==
=Z3GU
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"A small set of perf fixes:
- Make the MSR-readout based CHA discovery work around broken
discovery tables in some SPR firmwares.
- Prevent saving PEBS configuration which has software bits set that
cause a crash when restored into the relevant MSR"
* tag 'perf-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/uncore: Correct the number of CHAs on SPR
perf/x86/intel: Save/restore cpuc->active_pebs_data_cfg when using guest PEBS
- Ensure that the stack pointer on x86 is aligned again so that the
unwinder does not read past the end of the stack
- Discard .note.gnu.property section which has a pointlessly different
alignment than the other note sections. That confuses tooling of all
sorts including readelf, libbpf and pahole.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmRzBW0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoczHD/4vCopMPBFsT5w5HhSTQSX5Kglx3UQL
g+qa0Z/uJKLpMgGkBzWBQGXHH/UqyY4sk5T3DDOkVQjdSfD+zmJu4R44wAonEU+y
SDW2MeBciyp1WtjwPNWNY9QgDUQj7Cr2dSLTVqBR9akladTcj7EKHApq0pgaqsbi
MnmfXzPyH17PRfyW8FbsBjdQvhpkYZSZntQ0cU+W5VtsUtZvg2w4I8IU76ri9L0E
OYBbk2zV8RabGVJBv79fnm4fXb78+Zkp/xvs5sTBRKRS+3T5FNNJjo8tNp1AtxdF
eJMlOeY7PjSFyAL6+/+/uRqYNCnH4Rw2MlMqJJIhzKYCw7BFo0FAhi+mro63Z85b
byGuriv1g7suOgOCK8IzTl7e1nTDP4YCZ0cMDDvcRceIiqR6Rrk3C/B5cBVz0Bng
iFYFUJSlLT4G+tGZwupV2UUGsCwS5+vVCG/FVWes8Mz4g4zr2Dmrh6YMLzrcDshE
eAY2yKMypD/JD9cJa4KOVJuFARaSDJDiBpsO1BjRC6MxLFaw1biB0mE2hkwn3LJe
fLVvU/sWnIQ+dQoy7+GBzm7Pi5SmKBuLDsgOt/ocbCwfUwUQMGf+I7bNg+08UOOS
dNzM6/agdd8rLZTS/Ma7EvnA367ZwVOb5COVMqwxAoKZbPkvgz1xvRGkFQCHYapx
BwMILF9FIM0kvg==
=TPwY
-----END PGP SIGNATURE-----
Merge tag 'objtool-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull unwinder fixes from Thomas Gleixner:
"A set of unwinder and tooling fixes:
- Ensure that the stack pointer on x86 is aligned again so that the
unwinder does not read past the end of the stack
- Discard .note.gnu.property section which has a pointlessly
different alignment than the other note sections. That confuses
tooling of all sorts including readelf, libbpf and pahole"
* tag 'objtool-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/show_trace_log_lvl: Ensure stack pointer is aligned, again
vmlinux.lds.h: Discard .note.gnu.property section
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCZHGVRAAKCRCAXGG7T9hj
vqtJAQDizKasLE7tSnfs/FrZ/4xPaDLe3bQifMx2C1dtYCjRcAD/ciZSa1L0WzZP
dpEZnlYRzsR3bwLktQEMQFOvlbh1SwE=
=K860
-----END PGP SIGNATURE-----
Merge tag 'for-linus-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- a double free fix in the Xen pvcalls backend driver
- a fix for a regression causing the MSI related sysfs entries to not
being created in Xen PV guests
- a fix in the Xen blkfront driver for handling insane input data
better
* tag 'for-linus-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/pci/xen: populate MSI sysfs entries
xen/pvcalls-back: fix double frees with pvcalls_new_active_socket()
xen/blkfront: Only check REQ_FUA for writes
Use kvm_is_cr4_bit_set() to read guest CR4.UMIP when sanity checking that
a descriptor table VM-Exit occurs if and only if guest.CR4.UMIP=1. UMIP
can't be guest-owned, i.e. using kvm_read_cr4_bits() to decache guest-
owned bits isn't strictly necessary, but eliminating raw reads of
vcpu->arch.cr4 is desirable as it makes it easy to visually audit KVM for
correctness.
Opportunistically add a compile-time assertion that UMIP isn't guest-owned
as letting the guest own UMIP isn't compatible with emulation (or any CR4
bit that is emulated by KVM).
Opportunistically change the WARN_ON() to a ONCE variant. When the WARN
fires, it fires _a lot_, and spamming the kernel logs ends up doing more
harm than whatever led to KVM's unnecessary emulation.
Reported-by: Robert Hoo <robert.hu@intel.com>
Link: https://lore.kernel.org/all/20230310125718.1442088-4-robert.hu@intel.com
Link: https://lore.kernel.org/r/20230413231914.1482782-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Advertise UMIP as emulated if and only if the host doesn't natively
support UMIP, otherwise vmx_umip_emulated() is misleading when the host
_does_ support UMIP. Of the four users of vmx_umip_emulated(), two
already check for native support, and the logic in vmx_set_cpu_caps() is
relevant if and only if UMIP isn't natively supported as UMIP is set in
KVM's caps by kvm_set_cpu_caps() when UMIP is present in hardware.
That leaves KVM's stuffing of X86_CR4_UMIP into the default cr4_fixed1
value enumerated for nested VMX. In that case, checking for (lack of)
host support is actually a bug fix of sorts, as enumerating UMIP support
based solely on descriptor table exiting works only because KVM doesn't
sanity check MSR_IA32_VMX_CR4_FIXED1. E.g. if a (very theoretical) host
supported UMIP in hardware but didn't allow UMIP+VMX, KVM would advertise
UMIP but not actually emulate UMIP. Of course, KVM would explode long
before it could run a nested VM on said theoretical CPU, as KVM doesn't
modify host CR4 when enabling VMX, i.e. would load an "illegal" value into
vmcs.HOST_CR4.
Reported-by: Robert Hoo <robert.hu@intel.com>
Link: https://lore.kernel.org/all/20230310125718.1442088-2-robert.hu@intel.com
Link: https://lore.kernel.org/r/20230413231914.1482782-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move the comment about keeping the hosts CR4.MCE loaded in hardware above
the code that actually modifies the hardware CR4 value.
No functional change indented.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230410125017.1305238-3-xiaoyao.li@intel.com
[sean: elaborate in changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Directly use vcpu->arch.cr4 is not recommended since it gets stale value
if the cr4 is not available.
Use kvm_read_cr4() instead to ensure correct value.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230410125017.1305238-2-xiaoyao.li@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
I tried to streamline our user memory copy code fairly aggressively in
commit adfcf4231b ("x86: don't use REP_GOOD or ERMS for user memory
copies"), in order to then be able to clean up the code and inline the
modern FSRM case in commit 577e6a7fd5 ("x86: inline the 'rep movs' in
user copies for the FSRM case").
We had reports [1] of that causing regressions earlier with blogbench,
but that turned out to be a horrible benchmark for that case, and not a
sufficient reason for re-instating "rep movsb" on older machines.
However, now Eric Dumazet reported [2] a regression in performance that
seems to be a rather more real benchmark, where due to the removal of
"rep movs" a TCP stream over a 100Gbps network no longer reaches line
speed.
And it turns out that with the simplified the calling convention for the
non-FSRM case in commit 427fda2c8a ("x86: improve on the non-rep
'copy_user' function"), re-introducing the ERMS case is actually fairly
simple.
Of course, that "fairly simple" is glossing over several missteps due to
having to fight our assembler alternative code. This code really wanted
to rewrite a conditional branch to have two different targets, but that
made objtool sufficiently unhappy that this instead just ended up doing
a choice between "jump to the unrolled loop, or use 'rep movsb'
directly".
Let's see if somebody finds a case where the kernel memory copies also
care (see commit 68674f94ffc9: "x86: don't use REP_GOOD or ERMS for
small memory copies"). But Eric does argue that the user copies are
special because networking tries to copy up to 32KB at a time, if
order-3 pages allocations are possible.
In-kernel memory copies are typically small, unless they are the special
"copy pages at a time" kind that still use "rep movs".
Link: https://lore.kernel.org/lkml/202305041446.71d46724-yujie.liu@intel.com/ [1]
Link: https://lore.kernel.org/lkml/CANn89iKUbyrJ=r2+_kK+sb2ZSSHifFZ7QkPLDpAtkJ8v4WUumA@mail.gmail.com/ [2]
Reported-and-tested-by: Eric Dumazet <edumazet@google.com>
Fixes: adfcf4231b ("x86: don't use REP_GOOD or ERMS for user memory copies")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add assertion to track that "mmu == vcpu->arch.mmu" is always true in the
context of __kvm_mmu_invalidate_addr(). for_each_shadow_entry_using_root()
and kvm_sync_spte() operate on vcpu->arch.mmu, but the only reason that
doesn't cause explosions is because handle_invept() frees roots instead of
doing a manual invalidation. As of now, there are no major roadblocks
to switching INVEPT emulation over to use kvm_mmu_invalidate_addr().
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20230523032947.60041-1-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Commit aee98a6838 ("KVM: x86/mmu: Use try_cmpxchg64 in
tdp_mmu_set_spte_atomic") removed the comment that iter->old_spte is
updated when different logical CPU modifies the page table entry.
Although this is what try_cmpxchg does implicitly, it won't hurt
if this fact is explicitly mentioned in a restored comment.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20230425113932.3148-1-ubizjak@gmail.com
[sean: extend comment above try_cmpxchg64()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
1) Add special case for len == 40 as that is the hottest value. The
nets a ~8-9% latency improvement and a ~30% throughput improvement
in the len == 40 case.
2) Use multiple accumulators in the 64-byte loop. This dramatically
improves ILP and results in up to a 40% latency/throughput
improvement (better for more iterations).
Results from benchmarking on Icelake. Times measured with rdtsc()
len lat_new lat_old r tput_new tput_old r
8 3.58 3.47 1.032 3.58 3.51 1.021
16 4.14 4.02 1.028 3.96 3.78 1.046
24 4.99 5.03 0.992 4.23 4.03 1.050
32 5.09 5.08 1.001 4.68 4.47 1.048
40 5.57 6.08 0.916 3.05 4.43 0.690
48 6.65 6.63 1.003 4.97 4.69 1.059
56 7.74 7.72 1.003 5.22 4.95 1.055
64 6.65 7.22 0.921 6.38 6.42 0.994
96 9.43 9.96 0.946 7.46 7.54 0.990
128 9.39 12.15 0.773 8.90 8.79 1.012
200 12.65 18.08 0.699 11.63 11.60 1.002
272 15.82 23.37 0.677 14.43 14.35 1.005
440 24.12 36.43 0.662 21.57 22.69 0.951
952 46.20 74.01 0.624 42.98 53.12 0.809
1024 47.12 78.24 0.602 46.36 58.83 0.788
1552 72.01 117.30 0.614 71.92 96.78 0.743
2048 93.07 153.25 0.607 93.28 137.20 0.680
2600 114.73 194.30 0.590 114.28 179.32 0.637
3608 156.34 268.41 0.582 154.97 254.02 0.610
4096 175.01 304.03 0.576 175.89 292.08 0.602
There is no such thing as a free lunch, however, and the special case
for len == 40 does add overhead to the len != 40 cases. This seems to
amount to be ~5% throughput and slightly less in terms of latency.
Testing:
Part of this change is a new kunit test. The tests check all
alignment X length pairs in [0, 64) X [0, 512).
There are three cases.
1) Precomputed random inputs/seed. The expected results where
generated use the generic implementation (which is assumed to be
non-buggy).
2) An input of all 1s. The goal of this test is to catch any case
a carry is missing.
3) An input that never carries. The goal of this test si to catch
any case of incorrectly carrying.
More exhaustive tests that test all alignment X length pairs in
[0, 8192) X [0, 8192] on random data are also available here:
https://github.com/goldsteinn/csum-reproduction
The reposity also has the code for reproducing the above benchmark
numbers.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230511011002.935690-1-goldstein.w.n%40gmail.com
Traditionally, all CPUs in a system have identical numbers of SMT
siblings. That changes with hybrid processors where some logical CPUs
have a sibling and others have none.
Today, the CPU boot code sets the global variable smp_num_siblings when
every CPU thread is brought up. The last thread to boot will overwrite
it with the number of siblings of *that* thread. That last thread to
boot will "win". If the thread is a Pcore, smp_num_siblings == 2. If it
is an Ecore, smp_num_siblings == 1.
smp_num_siblings describes if the *system* supports SMT. It should
specify the maximum number of SMT threads among all cores.
Ensure that smp_num_siblings represents the system-wide maximum number
of siblings by always increasing its value. Never allow it to decrease.
On MeteorLake-P platform, this fixes a problem that the Ecore CPUs are
not updated in any cpu sibling map because the system is treated as an
UP system when probing Ecore CPUs.
Below shows part of the CPU topology information before and after the
fix, for both Pcore and Ecore CPU (cpu0 is Pcore, cpu 12 is Ecore).
...
-/sys/devices/system/cpu/cpu0/topology/package_cpus:000fff
-/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-11
+/sys/devices/system/cpu/cpu0/topology/package_cpus:3fffff
+/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-21
...
-/sys/devices/system/cpu/cpu12/topology/package_cpus:001000
-/sys/devices/system/cpu/cpu12/topology/package_cpus_list:12
+/sys/devices/system/cpu/cpu12/topology/package_cpus:3fffff
+/sys/devices/system/cpu/cpu12/topology/package_cpus_list:0-21
Notice that the "before" 'package_cpus_list' has only one CPU. This
means that userspace tools like lscpu will see a little laptop like
an 11-socket system:
-Core(s) per socket: 1
-Socket(s): 11
+Core(s) per socket: 16
+Socket(s): 1
This is also expected to make the scheduler do rather wonky things
too.
[ dhansen: remove CPUID detail from changelog, add end user effects ]
CC: stable@kernel.org
Fixes: bbb65d2d36 ("x86: use cpuid vector 0xb when available for detecting cpu topology")
Fixes: 95f3d39ccf ("x86/cpu/topology: Provide detect_extended_topology_early()")
Suggested-by: Len Brown <len.brown@intel.com>
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20230323015640.27906-1-rui.zhang%40intel.com
The number of CHAs from the discovery table on some SPR variants is
incorrect, because of a firmware issue. An accurate number can be read
from the MSR UNC_CBO_CONFIG.
Fixes: 949b11381f ("perf/x86/intel/uncore: Add Sapphire Rapids server CHA support")
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Stephane Eranian <eranian@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230508140206.283708-1-kan.liang@linux.intel.com
Commit bf5e758f02 ("genirq/msi: Simplify sysfs handling") reworked the
creation of sysfs entries for MSI IRQs. The creation used to be in
msi_domain_alloc_irqs_descs_locked after calling ops->domain_alloc_irqs.
Then it moved into __msi_domain_alloc_irqs which is an implementation of
domain_alloc_irqs. However, Xen comes with the only other implementation
of domain_alloc_irqs and hence doesn't run the sysfs population code
anymore.
Commit 6c796996ee ("x86/pci/xen: Fixup fallout from the PCI/MSI
overhaul") set the flag MSI_FLAG_DEV_SYSFS for the xen msi_domain_info
but that doesn't actually have an effect because Xen uses it's own
domain_alloc_irqs implementation.
Fix this by making use of the fallback functions for sysfs population.
Fixes: bf5e758f02 ("genirq/msi: Simplify sysfs handling")
Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20230503131656.15928-1-mheyne@amazon.de
Signed-off-by: Juergen Gross <jgross@suse.com>
The GFNI routines in the AVX version of the ARIA implementation now use
explicit VMOVDQA instructions to load the constant input vectors, which
means they must be 16 byte aligned. So ensure that this is the case, by
dropping the section split and the incorrect .align 8 directive, and
emitting the constants into the 16-byte aligned section instead.
Note that the AVX2 version of this code deviates from this pattern, and
does not require a similar fix, given that it loads these contants as
8-byte memory operands, for which AVX2 permits any alignment.
Cc: Taehee Yoo <ap420073@gmail.com>
Fixes: 8b84475318 ("crypto: x86/aria-avx - Do not use avx2 instructions")
Reported-by: syzbot+a6abcf08bad8b18fd198@syzkaller.appspotmail.com
Tested-by: syzbot+a6abcf08bad8b18fd198@syzkaller.appspotmail.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
TDX reuses VMEXIT "reasons" in its guest->host hypercall ABI. This is
confusing because there might not be a VMEXIT involved at *all*.
These instances are supposed to document situation and reduce confusion
by wrapping VMEXIT reasons with hcall_func().
The decompression code does not follow this convention.
Unify the TDX decompression code with the other TDX use of VMEXIT reasons.
No functional changes.
Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/20230505120332.1429957-1-nik.borisov%40suse.com
After commit b752ea0c28 ("perf/x86/intel/ds: Flush PEBS DS when changing
PEBS_DATA_CFG"), the cpuc->pebs_data_cfg may save some bits that are not
supported by real hardware, such as PEBS_UPDATE_DS_SW. This would cause
the VMX hardware MSR switching mechanism to save/restore invalid values
for PEBS_DATA_CFG MSR, thus crashing the host when PEBS is used for guest.
Fix it by using the active host value from cpuc->active_pebs_data_cfg.
Fixes: b752ea0c28 ("perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG")
Signed-off-by: Like Xu <likexu@tencent.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20230517133808.67885-1-likexu@tencent.com
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmRr6MMACgkQaDWVMHDJ
krAaIA/+IIV2rijvxzlnN41RksfiuI8lmDXuOeAqFCFVAukm6/faiDHmXQC/4ipw
JTdgq4JbB6wTZmYsijSMdwgWDt3BGXZAAXK7sSOXLNRkaf0EuPk6h59P7QXBAkRD
A6ibjWPipV6/k/Mczm8o+WnTSHUln6CVz8yB13gF3eB6+QMO7yYznL4EJ4iUBdZC
OWCukkI8zT07p1u7nqpOmTixXZEqSauC3qB9tg856w8UJoS1evOocgExZr2d8GT3
Ex9Q+Hj8ULkM6tDeke9mfoDDEdHtHDkQIMH5L63siTc0LzvRIs/EZEnZGdZWNy9F
zw6AaIShzJQ9Z9KTP/SRxQ3nvUh/UuI2z7KnZLLxHYqLZZVMNQtinK+ZDGlzvvqg
xr3ZI2PXUXoQT8Mg0BzzpeLwuuSXS8DFyj63SErValRuyIQQs07MhIRFiU3abQtq
A4nfXMzfCwfRn5ggm9HIPLKOtiw1ZlsQtDxmHULpXsUqGgBxEfmCF5f210TmkOhN
yrNlDmNBr39j1xlUeFlJRXWm4badlo39G8s/70oqFTushvatfet7Gyt7fZseGhJL
KNQCfxyCu8bYYmtYfuW6WTc8nq5sJmnTdWEjcT0ftbgDeYNtBjfKfhJ+8+q04FWl
wBxu5nqfKMT1bAKIOLse9eVbaMXNuHJJkStQZAZP5/wUMoVvfHI=
=TZgh
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Dave Hansen:
"This works around and issue where the INVLPG instruction may miss
invalidating kernel TLB entries in recent hybrid CPUs.
I do expect an eventual microcode fix for this. When the microcode
version numbers are known, we can circle back around and add them the
model table to disable this workaround"
* tag 'x86_urgent_for_6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Avoid incomplete Global INVLPG flushes
"x86/smpboot: Support parallel startup of secondary CPUs" adds the first use
of X2APIC_ENABLE in assembly, but older binutils don't tolerate the UL suffix.
Switch to using BIT() instead.
Fixes: 7e75178a09 ("x86/smpboot: Support parallel startup of secondary CPUs")
Reported-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Link: https://lore.kernel.org/r/20230522105738.2378364-1-andrew.cooper3@citrix.com
* Plug a race in the stage-2 mapping code where the IPA and the PA
would end up being out of sync
* Make better use of the bitmap API (bitmap_zero, bitmap_zalloc...)
* FP/SVE/SME documentation update, in the hope that this field
becomes clearer...
* Add workaround for Apple SEIS brokenness to a new SoC
* Random comment fixes
x86:
* add MSR_IA32_TSX_CTRL into msrs_to_save
* fixes for XCR0 handling in SGX enclaves
Generic:
* Fix vcpu_array[0] races
* Fix race between starting a VM and "reboot -f"
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmRp0WIUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPqVwf+OFayNPpURAFqfrOuISYW7hoCL24+
sCtXyVv4Ei0np1vGekit2h/m8GmxO12xEBibcFeYj+YQItIqu9HvC08fRxAKaMeE
N3p9iLuS1zcM3cEuZpg0r6QN+pKybttdadl70yho43CtagEM4FmB7dgyAo9AhyXk
pZUaVfoO6beBQ/J6A6V/Q5xlue1LvHk1+K4rmNcYVTYn6ZOd+yYgvqng1nv5/h9b
0HgW0aUWkEHAB67/sSnUUro707loMNTowsZlMCtgDk4Fzf8RwQ7qc8lClLk1UPjJ
DHB6Hif9F0Q5mkrwn+c7xyVlKARaY6/FOshS2Q620q19+4fq5fUD9HgjrQ==
=ARzH
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Plug a race in the stage-2 mapping code where the IPA and the PA
would end up being out of sync
- Make better use of the bitmap API (bitmap_zero, bitmap_zalloc...)
- FP/SVE/SME documentation update, in the hope that this field
becomes clearer...
- Add workaround for Apple SEIS brokenness to a new SoC
- Random comment fixes
x86:
- add MSR_IA32_TSX_CTRL into msrs_to_save
- fixes for XCR0 handling in SGX enclaves
Generic:
- Fix vcpu_array[0] races
- Fix race between starting a VM and 'reboot -f'"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: VMX: add MSR_IA32_TSX_CTRL into msrs_to_save
KVM: x86: Don't adjust guest's CPUID.0x12.1 (allowed SGX enclave XFRM)
KVM: VMX: Don't rely _only_ on CPUID to enforce XCR0 restrictions for ECREATE
KVM: Fix vcpu_array[0] races
KVM: VMX: Fix header file dependency of asm/vmx.h
KVM: Don't enable hardware after a restart/shutdown is initiated
KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown
KVM: arm64: vgic: Add Apple M2 PRO/MAX cpus to the list of broken SEIS implementations
KVM: arm64: Clarify host SME state management
KVM: arm64: Restructure check for SVE support in FP trap handler
KVM: arm64: Document check for TIF_FOREIGN_FPSTATE
KVM: arm64: Fix repeated words in comments
KVM: arm64: Constify start/end/phys fields of the pgtable walker data
KVM: arm64: Infer PA offset from VA in hyp map walker
KVM: arm64: Infer the PA offset from IPA in stage-2 map walker
KVM: arm64: Use the bitmap API to allocate bitmaps
KVM: arm64: Slightly optimize flush_context()
Add MSR_IA32_TSX_CTRL into msrs_to_save[] to explicitly tell userspace to
save/restore the register value during migration. Missing this may cause
userspace that relies on KVM ioctl(KVM_GET_MSR_INDEX_LIST) fail to port the
value to the target VM.
In addition, there is no need to add MSR_IA32_TSX_CTRL when
ARCH_CAP_TSX_CTRL_MSR is not supported in kvm_get_arch_capabilities(). So
add the checking in kvm_probe_msr_to_save().
Fixes: c11f83e062 ("KVM: vmx: implement MSR_IA32_TSX_CTRL disable RTM functionality")
Reported-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20230509032348.1153070-1-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop KVM's manipulation of guest's CPUID.0x12.1 ECX and EDX, i.e. the
allowed XFRM of SGX enclaves, now that KVM explicitly checks the guest's
allowed XCR0 when emulating ECREATE.
Note, this could theoretically break a setup where userspace advertises
a "bad" XFRM and relies on KVM to provide a sane CPUID model, but QEMU
is the only known user of KVM SGX, and QEMU explicitly sets the SGX CPUID
XFRM subleaf based on the guest's XCR0.
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230503160838.3412617-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explicitly check the vCPU's supported XCR0 when determining whether or not
the XFRM for ECREATE is valid. Checking CPUID works because KVM updates
guest CPUID.0x12.1 to restrict the leaf to a subset of the guest's allowed
XCR0, but that is rather subtle and KVM should not modify guest CPUID
except for modeling true runtime behavior (allowed XFRM is most definitely
not "runtime" behavior).
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230503160838.3412617-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Include a definition of WARN_ON_ONCE() before using it.
Fixes: bb1fcc70d9 ("KVM: nVMX: Allow L1 to use 5-level page walks for nested EPT")
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Jacob Xu <jacobhxu@google.com>
[reworded commit message; changed <asm/bug.h> to <linux/bug.h>]
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220225012959.1554168-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are two functions in the olpc platform that have no prototype:
arch/x86/platform/olpc/olpc_dt.c:237:13: error: no previous prototype for 'olpc_dt_fixup' [-Werror=missing-prototypes]
arch/x86/platform/olpc/olpc-xo1-pm.c:73:26: error: no previous prototype for 'xo1_do_sleep' [-Werror=missing-prototypes]
The first one should just be marked 'static' as there are no other
callers, while the second one is called from assembler and is
just a false-positive warning that can be silenced by adding a
prototype.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/all/20230516193549.544673-21-arnd%40kernel.org
arch_wb_cache_pmem() is declared in a global header but defined by
the architecture. On x86, the implementation needs to include the header
to avoid this warning:
arch/x86/lib/usercopy_64.c:39:6: error: no previous prototype for 'arch_wb_cache_pmem' [-Werror=missing-prototypes]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/all/20230516193549.544673-18-arnd%40kernel.org
__vdso_getcpu is declared in a header but this is not included
before the definition, causing a W=1 warning:
arch/x86/entry/vdso/vgetcpu.c:13:1: error: no previous prototype for '__vdso_getcpu' [-Werror=missing-prototypes]
arch/x86/entry/vdso/vdso32/../vgetcpu.c:13:1: error: no previous prototype for '__vdso_getcpu' [-Werror=missing-prototypes]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/all/20230516193549.544673-17-arnd%40kernel.org
copy_mc_fragile_handle_tail() is only called from assembler,
but 'make W=1' complains about a missing prototype:
arch/x86/lib/copy_mc.c:26:1: warning: no previous prototype for ‘copy_mc_fragile_handle_tail’ [-Wmissing-prototypes]
26 | copy_mc_fragile_handle_tail(char *to, char *from, unsigned len)
Add the prototype to avoid the warning.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/all/20230516193549.544673-16-arnd%40kernel.org
fb_is_primary_device() is defined as a global function on x86, unlike
the others that have an inline version. The file that defines is
however needs to include the declaration to avoid a warning:
arch/x86/video/fbdev.c:14:5: error: no previous prototype for 'fb_is_primary_device' [-Werror=missing-prototypes]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/all/20230516193549.544673-15-arnd%40kernel.org
The 32-bit system call entry points can be called on both 32-bit
and 64-bit kernels, but on the former the declarations are hidden:
arch/x86/entry/common.c:238:24: error: no previous prototype for 'do_SYSENTER_32' [-Werror=missing-prototypes]
Move them all out of the #ifdef block to avoid the warnings.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/all/20230516193549.544673-12-arnd%40kernel.org
arch_pnpbios_disabled() is defined in architecture code on x86, but this
does not include the appropriate header, causing a warning:
arch/x86/kernel/platform-quirks.c:42:13: error: no previous prototype for 'arch_pnpbios_disabled' [-Werror=missing-prototypes]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/all/20230516193549.544673-10-arnd%40kernel.org
The set_highmem_pages_init() function is declared in asm/numa.h, which
must be included in the file that defines it to avoid a W=1 warning:
arch/x86/mm/highmem_32.c:7:13: error: no previous prototype for 'set_highmem_pages_init' [-Werror=missing-prototypes]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/all/20230516193549.544673-9-arnd%40kernel.org
Two functions in the 32-bit doublefault code are lacking a prototype:
arch/x86/kernel/doublefault_32.c:23:36: error: no previous prototype for 'doublefault_shim' [-Werror=missing-prototypes]
23 | asmlinkage noinstr void __noreturn doublefault_shim(void)
| ^~~~~~~~~~~~~~~~
arch/x86/kernel/doublefault_32.c:114:6: error: no previous prototype for 'doublefault_init_cpu_tss' [-Werror=missing-prototypes]
114 | void doublefault_init_cpu_tss(void)
The first one is only called from assembler, while the second one is
declared in doublefault.h, but this file is not included.
Include the header file and add the other declaration there as well.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/all/20230516193549.544673-8-arnd%40kernel.org
The fpregs_soft_set/fpregs_soft_get functions are declared in a
header that is not included in the file that defines them, causing
a W=1 warning:
/home/arnd/arm-soc/arch/x86/math-emu/fpu_entry.c:638:5: error: no previous prototype for 'fpregs_soft_set' [-Werror=missing-prototypes]
638 | int fpregs_soft_set(struct task_struct *target,
| ^~~~~~~~~~~~~~~
/home/arnd/arm-soc/arch/x86/math-emu/fpu_entry.c:690:5: error: no previous prototype for 'fpregs_soft_get' [-Werror=missing-prototypes]
690 | int fpregs_soft_get(struct task_struct *target,
Include the file here to avoid the warning.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/all/20230516193549.544673-7-arnd%40kernel.org
'make W=1' warns about a function without a prototype in the x86-32 head code:
arch/x86/kernel/head32.c:72:13: error: no previous prototype for 'mk_early_pgtbl_32' [-Werror=missing-prototypes]
This is called from assembler code, so it does not actually need a prototype.
I could not find an appropriate header for it, so just declare it in front
of the definition to shut up the warning.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/all/20230516193549.544673-6-arnd%40kernel.org
Two functions in this file are global but have no prototype in
a header and are not called from elsewhere, so they should
be static:
arch/x86/pci/ce4100.c:86:6: error: no previous prototype for 'sata_revid_init' [-Werror=missing-prototypes]
arch/x86/pci/ce4100.c:175:5: error: no previous prototype for 'bridge_read' [-Werror=missing-prototypes]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/all/20230516193549.544673-3-arnd%40kernel.org
On 32-bit builds, the prepare_ftrace_return() function only has a global
definition, but no prototype before it, which causes a warning:
arch/x86/kernel/ftrace.c:625:6: warning: no previous prototype for ‘prepare_ftrace_return’ [-Wmissing-prototypes]
625 | void prepare_ftrace_return(unsigned long ip, unsigned long *parent,
Move the prototype that is already needed for some configurations into
a header file where it can be seen unconditionally.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/all/20230516193549.544673-2-arnd%40kernel.org
- Initialize 'ret' local variables on fprobe_handler() to fix the smatch
warning. With this, fprobe function exit handler is not working
randomly.
- Fix to use preempt_enable/disable_notrace for rethook handler to
prevent recursive call of fprobe exit handler (which is based on
rethook)
- Fix recursive call issue on fprobe_kprobe_handler().
- Fix to detect recursive call on fprobe_exit_handler().
- Fix to make all arch-dependent rethook code notrace.
(the arch-independent code is already notrace)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmRmKgQACgkQ2/sHvwUr
PxvlCgf+OJk5O9IJlTgqDV6JNPsTzFS7qqyAyQmZW9Bj8STfWAIRxa0zeGbZE58K
5LwgzAj+SqzYRwIvzzZ3xsA5j7f1Wj7wG0TQgmpnIW+hprwDrLsUhoZ5s1D/Ojel
A4rAnqCrgnh5m5SenU2QCUngGKn004j4RASaZvRELDyvyIkBSqNhswCH8ZWGPror
KuCu5AmEnFagYl0lmNL3H2aCITAg3QEK+fE6iR+lYsqfR3xbs4YAcqiylHBdY0wX
ssK7LVdRmv7O6TxSj4P2ohDvLJP3eL9bVirsJpg0OVbqWJCs65T2rJJjXiKojYXf
vSVWFJFK5oV98ZHfXTG9R7x0DEwc+g==
=jO68
-----END PGP SIGNATURE-----
Merge tag 'probes-fixes-v6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- Initialize 'ret' local variables on fprobe_handler() to fix the
smatch warning. With this, fprobe function exit handler is not
working randomly.
- Fix to use preempt_enable/disable_notrace for rethook handler to
prevent recursive call of fprobe exit handler (which is based on
rethook)
- Fix recursive call issue on fprobe_kprobe_handler()
- Fix to detect recursive call on fprobe_exit_handler()
- Fix to make all arch-dependent rethook code notrace (the
arch-independent code is already notrace)"
* tag 'probes-fixes-v6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rethook, fprobe: do not trace rethook related functions
fprobe: add recursion detection in fprobe_exit_handler
fprobe: make fprobe_kprobe_handler recursion free
rethook: use preempt_{disable, enable}_notrace in rethook_trampoline_handler
tracing: fprobe: Initialize ret valiable to fix smatch error
Replace include statements for <asm/fb.h> with <linux/fb.h>. Fixes
the coding style: if a header is available in asm/ and linux/, it
is preferable to include the header from linux/. This only affects
a few source files, most of which already include <linux/fb.h>.
Suggested-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Sui Jingfeng <suijingfeng@loongson.cn>
Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20230512102444.5438-6-tzimmermann@suse.de
These functions are already marked as NOKPROBE to prevent recursion and
we have the same reason to blacklist them if rethook is used with fprobe,
since they are beyond the recursion-free region ftrace can guard.
Link: https://lore.kernel.org/all/20230517034510.15639-5-zegao@tencent.com/
Fixes: f3a112c0c4 ("x86,rethook,kprobes: Replace kretprobe with rethook on x86")
Signed-off-by: Ze Gao <zegao@tencent.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When writing EFI variables, one might get errors with no other message
on why it fails. Being able to see how much is used by EFI variables
helps analyzing such issues.
Since this is not a conventional filesystem, block size is intentionally
set to 1 instead of PAGE_SIZE.
x86 quirks of reserved size are taken into account; so that available
and free size can be different, further helping debugging space issues.
With this patch, one can see the remaining space in EFI variable storage
via efivarfs, like this:
$ df -h /sys/firmware/efi/efivars/
Filesystem Size Used Avail Use% Mounted on
efivarfs 176K 106K 66K 62% /sys/firmware/efi/efivars
Signed-off-by: Anisse Astier <an.astier@criteo.com>
[ardb: - rename efi_reserved_space() to efivar_reserved_space()
- whitespace/coding style tweaks]
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
The INVLPG instruction is used to invalidate TLB entries for a
specified virtual address. When PCIDs are enabled, INVLPG is supposed
to invalidate TLB entries for the specified address for both the
current PCID *and* Global entries. (Note: Only kernel mappings set
Global=1.)
Unfortunately, some INVLPG implementations can leave Global
translations unflushed when PCIDs are enabled.
As a workaround, never enable PCIDs on affected processors.
I expect there to eventually be microcode mitigations to replace this
software workaround. However, the exact version numbers where that
will happen are not known today. Once the version numbers are set in
stone, the processor list can be tweaked to only disable PCIDs on
affected processors with affected microcode.
Note: if anyone wants a quick fix that doesn't require patching, just
stick 'nopcid' on your kernel command-line.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Add a linker assertion and compute the 0xcc padding dynamically so that
__x86_return_thunk is always cacheline-aligned. Leave the SYM_START()
macro in as the untraining doesn't need ENDBR annotations anyway.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Link: https://lore.kernel.org/r/20230515140726.28689-1-bp@alien8.de
The arch_report_meminfo() function is provided by four architectures,
with a __weak fallback in procfs itself. On architectures that don't
have a custom version, the __weak version causes a warning because
of the missing prototype.
Remove the architecture specific prototypes and instead add one
in linux/proc_fs.h.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # for arch/x86
Acked-by: Helge Deller <deller@gmx.de> # parisc
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Message-Id: <20230516195834.551901-1-arnd@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Sometimes the one-line ORC unwinder warnings aren't very helpful. Add a
new 'unwind_debug' cmdline option which will dump the full stack
contents of the current task when an error condition is encountered.
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Link: https://lore.kernel.org/r/6afb9e48a05fd2046bfad47e69b061b43dfd0e0e.1681331449.git.jpoimboe@kernel.org
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
The commit e335bb51cc ("x86/unwind: Ensure stack pointer is aligned")
tried to align the stack pointer in show_trace_log_lvl(), otherwise the
"stack < stack_info.end" check can't guarantee that the last read does
not go past the end of the stack.
However, we have the same problem with the initial value of the stack
pointer, it can also be unaligned. So without this patch this trivial
kernel module
#include <linux/module.h>
static int init(void)
{
asm volatile("sub $0x4,%rsp");
dump_stack();
asm volatile("add $0x4,%rsp");
return -EAGAIN;
}
module_init(init);
MODULE_LICENSE("GPL");
crashes the kernel.
Fixes: e335bb51cc ("x86/unwind: Ensure stack pointer is aligned")
Signed-off-by: Vernon Lovejoy <vlovejoy@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20230512104232.GA10227@redhat.com
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Make sure that machine check errors with a usable address are properly
marked as poison.
This is needed for errors that occur on memory which have
MCG_STATUS[RIPV] clear - i.e., the interrupted process cannot be
restarted reliably. One example is data poison consumption through the
instruction fetch units on AMD Zen-based systems.
The MF_MUST_KILL flag is passed to memory_failure() when
MCG_STATUS[RIPV] is not set. So the associated process will still be
killed. What this does, practically, is get rid of one more check to
kill_current_task with the eventual goal to remove it completely.
Also, make the handling identical to what is done on the notifier path
(uc_decode_notifier() does that address usability check too).
The scenario described above occurs when hardware can precisely identify
the address of poisoned memory, but execution cannot reliably continue
for the interrupted hardware thread.
[ bp: Massage commit message. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20230322005131.174499-1-tony.luck@intel.com
While discussing to change the visibility of X86_FEATURE_NAMES (see Link)
in order to remove CONFIG_EMBEDDED, Boris suggested to simply make the
X86_FEATURE_NAMES functionality unconditional.
As the need for really tiny kernel images has gone away and kernel images
with !X86_FEATURE_NAMES are hardly tested, remove this config and the whole
ifdeffery in the source code.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/20230509084007.24373-1-lukas.bulwahn@gmail.com/
Link: https://lore.kernel.org/r/20230510065713.10996-3-lukas.bulwahn@gmail.com
While discussing to change the visibility of X86_FEATURE_NAMES (see Link)
in order to remove CONFIG_EMBEDDED, Boris suggested to simply make the
X86_FEATURE_NAMES functionality unconditional.
As a first step, make X86_FEATURE_NAMES disappear from Kconfig. So, as
X86_FEATURE_NAMES defaults to yes, to disable it, one now needs to
modify the .config file before compiling the kernel.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/20230509084007.24373-1-lukas.bulwahn@gmail.com/
Implement the validation function which tells the core code whether
parallel bringup is possible.
The only condition for now is that the kernel does not run in an encrypted
guest as these will trap the RDMSR via #VC, which cannot be handled at that
point in early startup.
There was an earlier variant for AMD-SEV which used the GHBC protocol for
retrieving the APIC ID via CPUID, but there is no guarantee that the
initial APIC ID in CPUID is the same as the real APIC ID. There is no
enforcement from the secure firmware and the hypervisor can assign APIC IDs
as it sees fit as long as the ACPI/MADT table is consistent with that
assignment.
Unfortunately there is no RDMSR GHCB protocol at the moment, so enabling
AMD-SEV guests for parallel startup needs some more thought.
Intel-TDX provides a secure RDMSR hypercall, but supporting that is outside
the scope of this change.
Fixup announce_cpu() as e.g. on Hyper-V CPU1 is the secondary sibling of
CPU0, which makes the @cpu == 1 logic in announce_cpu() fall apart.
[ mikelley: Reported the announce_cpu() fallout
Originally-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205257.467571745@linutronix.de
In parallel startup mode the APs are kicked alive by the control CPU
quickly after each other and run through the early startup code in
parallel. The real-mode startup code is already serialized with a
bit-spinlock to protect the real-mode stack.
In parallel startup mode the smpboot_control variable obviously cannot
contain the Linux CPU number so the APs have to determine their Linux CPU
number on their own. This is required to find the CPUs per CPU offset in
order to find the idle task stack and other per CPU data.
To achieve this, export the cpuid_to_apicid[] array so that each AP can
find its own CPU number by searching therein based on its APIC ID.
Introduce a flag in the top bits of smpboot_control which indicates that
the AP should find its CPU number by reading the APIC ID from the APIC.
This is required because CPUID based APIC ID retrieval can only provide the
initial APIC ID, which might have been overruled by the firmware. Some AMD
APUs come up with APIC ID = initial APIC ID + 0x10, so the APIC ID to CPU
number lookup would fail miserably if based on CPUID. Also virtualization
can make its own APIC ID assignements. The only requirement is that the
APIC IDs are consistent with the APCI/MADT table.
For the boot CPU or in case parallel bringup is disabled the control bits
are empty and the CPU number is directly available in bit 0-23 of
smpboot_control.
[ tglx: Initial proof of concept patch with bitlock and APIC ID lookup ]
[ dwmw2: Rework and testing, commit message, CPUID 0x1 and CPU0 support ]
[ seanc: Fix stray override of initial_gs in common_cpu_up() ]
[ Oleksandr Natalenko: reported suspend/resume issue fixed in
x86_acpi_suspend_lowlevel ]
[ tglx: Make it read the APIC ID from the APIC instead of using CPUID,
split the bitlock part out ]
Co-developed-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205257.411554373@linutronix.de
Parallel AP bringup requires that the APs can run fully parallel through
the early startup code including the real mode trampoline.
To prepare for this implement a bit-spinlock to serialize access to the
real mode stack so that parallel upcoming APs are not going to corrupt each
others stack while going through the real mode startup code.
Co-developed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205257.355425551@linutronix.de
For parallel CPU brinugp it's required to read the APIC ID in the low level
startup code. The virtual APIC base address is a constant because its a
fix-mapped address. Exposing that constant which is composed via macros to
assembly code is non-trivial due to header inclusion hell.
Aside of that it's constant only because of the vsyscall ABI
requirement. Once vsyscall is out of the picture the fixmap can be placed
at runtime.
Avoid header hell, stay flexible and store the address in a variable which
can be exposed to the low level startup code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205257.299231005@linutronix.de
Make the primary thread tracking CPU mask based in preparation for simpler
handling of parallel bootup.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205257.186599880@linutronix.de
The x86 CPU bringup state currently does AP wake-up, wait for AP to
respond and then release it for full bringup.
It is safe to be split into a wake-up and and a separate wait+release
state.
Provide the required functions and enable the split CPU bringup, which
prepares for parallel bringup, where the bringup of the non-boot CPUs takes
two iterations: One to prepare and wake all APs and the second to wait and
release them. Depending on timing this can eliminate the wait time
completely.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205257.133453992@linutronix.de
The new AP state tracking and synchronization mechanism in the CPU hotplug
core code allows to remove quite some x86 specific code:
1) The AP alive synchronization based on cpumasks
2) The decision whether an AP can be brought up again
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205256.529657366@linutronix.de
No point in this conditional voodoo. Un-initializing the lock mechanism is
safe to be called unconditionally even if it was already invoked when the
CPU died.
Remove the invocation of xen_smp_intr_free() as that has been already
cleaned up in xen_cpu_dead_hvm().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205256.423407127@linutronix.de
Now that the core code drops sparse_irq_lock after the idle thread
synchronized, it's pointless to wait for the AP to mark itself online.
Whether the control CPU runs in a wait loop or sleeps in the core code
waiting for the online operation to complete makes no difference.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205256.369512093@linutronix.de
Now that the core code drops sparse_irq_lock after the idle thread
synchronized, it's pointless to wait for the AP to mark itself online.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205256.316417181@linutronix.de
Now that TSC synchronization is SMP function call based there is no reason
to wait for the AP to be set in smp_callin_mask. The control CPU waits for
the AP to set itself in the online mask anyway.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205256.206394064@linutronix.de
Spin-waiting on the control CPU until the AP reaches the TSC
synchronization is just a waste especially in the case that there is no
synchronization required.
As the synchronization has to run with interrupts disabled the control CPU
part can just be done from a SMP function call. The upcoming AP issues that
call async only in the case that synchronization is required.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205256.148255496@linutronix.de
The usage is in smpboot.c and not in the CPU initialization code.
The XEN_PV usage of cpu_callout_mask is obsolete as cpu_init() not longer
waits and cacheinfo has its own CPU mask now, so cpu_callout_mask can be
made static too.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205256.091511483@linutronix.de
cpu_callout_mask is used for the stop machine based MTRR/PAT init.
In preparation of moving the BP/AP synchronization to the core hotplug
code, use a private CPU mask for cacheinfo and manage it in the
starting/dying hotplug state.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205256.035041005@linutronix.de
The synchronization of the AP with the control CPU is a SMP boot problem
and has nothing to do with cpu_init().
Open code cpu_init_secondary() in start_secondary() and move
wait_for_master_cpu() into the SMP boot code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.981999763@linutronix.de
There are four logical parts to what native_cpu_up() does on the BSP (or
on the controlling CPU for a later hotplug):
1) Wake the AP by sending the INIT/SIPI/SIPI sequence.
2) Wait for the AP to make it as far as wait_for_master_cpu() which
sets that CPU's bit in cpu_initialized_mask, then sets the bit in
cpu_callout_mask to let the AP proceed through cpu_init().
3) Wait for the AP to finish cpu_init() and get as far as the
smp_callin() call, which sets that CPU's bit in cpu_callin_mask.
4) Perform the TSC synchronization and wait for the AP to actually
mark itself online in cpu_online_mask.
In preparation to allow these phases to operate in parallel on multiple
APs, split them out into separate functions and document the interactions
a little more clearly in both the BP and AP code paths.
No functional change intended.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.928917242@linutronix.de
Peter stumbled over the barrier() after the invocation of smp_callin() in
start_secondary():
"...this barrier() and it's comment seem weird vs smp_callin(). That
function ends with an atomic bitop (it has to, at the very least it must
not be weaker than store-release) but also has an explicit wmb() to order
setup vs CPU_STARTING.
There is no way the smp_processor_id() referred to in this comment can land
before cpu_init() even without the barrier()."
The barrier() along with the comment was added in 2003 with commit
d8f19f2cac70 ("[PATCH] x86-64 merge") in the history tree. One of those
well documented combo patches of that time which changes world and some
more. The context back then was:
/*
* Dont put anything before smp_callin(), SMP
* booting is too fragile that we want to limit the
* things done here to the most necessary things.
*/
cpu_init();
smp_callin();
+ /* otherwise gcc will move up smp_processor_id before the cpu_init */
+ barrier();
Dprintk("cpu %d: waiting for commence\n", smp_processor_id());
Even back in 2003 the compiler was not allowed to reorder that
smp_processor_id() invocation before the cpu_init() function call.
Especially not as smp_processor_id() resolved to:
asm volatile("movl %%gs:%c1,%0":"=r" (ret__):"i"(pda_offset(field)):"memory");
There is no trace of this change in any mailing list archive including the
back then official x86_64 list discuss@x86-64.org, which would explain the
problem this change solved.
The debug prints are gone by now and the the only smp_processor_id()
invocation today is farther down in start_secondary() after locking
vector_lock which itself prevents reordering.
Even if the compiler would be allowed to reorder this, the code would still
be correct as GSBASE is set up early in the assembly code and is valid when
the CPU reaches start_secondary(), while the code at the time when this
barrier was added did the GSBASE setup in cpu_init().
As the barrier has zero value, remove it.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.875713771@linutronix.de
Now that the CPU0 hotplug cruft is gone, the only user is AMD SEV.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.822234014@linutronix.de
This was introduced with commit e1c467e690 ("x86, hotplug: Wake up CPU0
via NMI instead of INIT, SIPI, SIPI") to eventually support physical
hotplug of CPU0:
"We'll change this code in the future to wake up hard offlined CPU0 if
real platform and request are available."
11 years later this has not happened and physical hotplug is not officially
supported. Remove the cruft.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.768845190@linutronix.de
This was introduced together with commit e1c467e690 ("x86, hotplug: Wake
up CPU0 via NMI instead of INIT, SIPI, SIPI") to eventually support
physical hotplug of CPU0:
"We'll change this code in the future to wake up hard offlined CPU0 if
real platform and request are available."
11 years later this has not happened and physical hotplug is not officially
supported. Remove the cruft.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.715707999@linutronix.de
This is used in the SEV play_dead() implementation to re-online CPUs. But
that has nothing to do with CPU0.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.662319599@linutronix.de
When TSC is synchronized across sockets then there is no reason to
calibrate the delay for the first CPU which comes up on a socket.
Just reuse the existing calibration value.
This removes 100ms pointlessly wasted time from CPU hotplug per socket.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.608773568@linutronix.de
No point in keeping them around.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.551974164@linutronix.de
Make topology_phys_to_logical_pkg_die() static as it's only used in
smpboot.c and fixup the kernel-doc warnings for both functions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.493750666@linutronix.de
so that the correct record sizes are used
- Update the sample size for AMD BRS events
- Fix a confusion with using the same on-stack struct with different
events in the event processing path
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmRgzWIACgkQEsHwGGHe
VUpdgw//a1toWyjwrIV1YMu8lEpsrPKpOqIFuDQcLSl1vsYrmTRJ47PI1j/ZTQeo
HgNkEE6lxAa9h/lKAjlE/lACE6Hr59xnQmu0BdG/SS+hlhWkT+oKLEUWz5qD4MuE
bWdpxwHOhMIFR1ASAMThy/mE9V4TKsI/tsd7lMXUo6/skDGCmCGIgRq//3NUB5fV
0ivp5lv6NXFnUwS34Ot3fbWj/be7rr2vkYgN8WbwMAaEbpCIyseh6Tz+5ZRbENfP
dMdh6ryuJ2BJ9BcDe9XlcEvPcaTvz7LVnzOVFz/AnBgtBTIOw/26xt17pgXBH7NK
kpTKQTPp0mnt6ysnX5zYkeumKaxxqvVWaf18AQHkupj1HwggjiEFPnKK9KfslSy4
1tcED/D3i5QLOx+A8lCtA4ACwGl0Cvwgvw98Gp9imLst/zmMKa4MK96BYCodirKJ
iDKN5aFA6c3pKJ4KTE7N6KKFzwhslTrehTHAJIL7BiVw3aMGin6514OnMELZBzam
/zud81OWAKywWWRSwg7wy+K8RGH0R6K5dhwFrrm2BMqAluMq+rX1pRY9pEsL6jDj
bCl45L52IsXZBSz2JTwWHGTssPyeDIe157ICFDOBnIx08u4KzJ+Knxsbaq2Jjs3R
9wm5H9yp/+q7//3XcEkdFjQwDVh2LJkY0QinH+6rPiAseBC9ukU=
=OCba
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Make sure the PEBS buffer is flushed before reprogramming the
hardware so that the correct record sizes are used
- Update the sample size for AMD BRS events
- Fix a confusion with using the same on-stack struct with different
events in the event processing path
* tag 'perf_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG
perf/x86: Fix missing sample size update on AMD BRS
perf/core: Fix perf_sample_data not properly initialized for different swevents in perf_tp_event()
amd_nb.c work for drivers which switch to them. Add a PCI device ID
to k10temp's table so that latter is loaded on such systems too
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmRgx6IACgkQEsHwGGHe
VUpzJQ//WL//vtzAyQyEQQ0HP5cUNwtts79rEWpNyHJD0nFymbOrR7ho97HTi7pu
a+wc3p9gwTC626GJD9dFVZs9YPqXI/Q2BKndQ6vrct9eaVnxWiJ45B5yJYLCUppE
AHAChrDZ8ErXQjV1jeHtkIZY1mSa5Q7slfn8KFLzY5gIWA/3ds2pmIDK4r2wqQUR
xIn+kDaaprsu8vhffWtHKIj5n1zRkLwPNZZ/2rZbQ7NJHt0yqu4Ld/L6myC6cQVj
cX1Z1Wg++T3rZHxChx69w/jYuCY2EeM5JHXEbQqUG1tnfkpInZqvyjz1FUs9iYOY
NOv9hcod9jIRjGvMooKn7MDhVy3yCtyrR590tKVP4qj3Wb+wKKJtPECjMfCo6I0H
nVIiQPMZAbbo9UVLVA9UickSxKUgNCvhANUcnYWEQp+DM8wYxbV35PT8lomNz6Cj
3mKo/uqQPfGn3yUR1GIQagYSxHZJ1NTv50lPQOV0tZtfP2Gk/H4uT2w6ZFJKiJIV
KTiE5vFW9VhbVye/VcHWyJ3+mB5SgClaA+t87zGpgJiPjfVWo2bnB6JkwZodEWi9
V8c53Hq+NG78QnFYZR5WNDYQ96LYWfurhcuqol8Sr836Y9LaOCGEAIsPMvG7esB2
y/GHRWdhkjOf5uNHduGoOVzlH95qW9mBtdFacVpaM/eocb1Zghk=
=dn2p
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Borislav Petkov:
- Add the required PCI IDs so that the generic SMN accesses provided by
amd_nb.c work for drivers which switch to them. Add a PCI device ID
to k10temp's table so that latter is loaded on such systems too
* tag 'x86_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hwmon: (k10temp) Add PCI ID for family 19, model 78h
x86/amd_nb: Add PCI ID for family 19h model 78h
SYM_FUNC_START_LOCAL_NOALIGN() adds an endbr leading to this layout
(leaving only the last 2 bytes of the address):
3bff <zen_untrain_ret>:
3bff: f3 0f 1e fa endbr64
3c03: f6 test $0xcc,%bl
3c04 <__x86_return_thunk>:
3c04: c3 ret
3c05: cc int3
3c06: 0f ae e8 lfence
However, "the RET at __x86_return_thunk must be on a 64 byte boundary,
for alignment within the BTB."
Use SYM_START instead.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of decoding each instruction in the return sites range only to
realize that that return site is a jump to the default return thunk
which is needed - X86_FEATURE_RETHUNK is enabled - lift that check
before the loop and get rid of that loop overhead.
Add comments about what gets patched, while at it.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230512120952.7924-1-bp@alien8.de
Because:
SMP alternatives: ffffffff810026dc: [2:44) optimized NOPs: eb 2a eb 28 cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc
is quite daft, make things more complicated and have the NOP runlength
detection eat the preceding JMP if they both end at the same target.
SMP alternatives: ffffffff810026dc: [0:44) optimized NOPs: eb 2a cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230208171431.433132442@infradead.org
Address two issues:
- it no longer hard requires single byte NOP runs - now it accepts any
NOP and NOPL encoded instruction (but not the more complicated 32bit
NOPs).
- it writes a single 'instruction' replacement.
Specifically, ORC unwinder relies on the tail NOP of an alternative to
be a single instruction. In particular, it relies on the inner bytes not
being executed.
Once the max supported NOP length has been reached (currently 8, could easily
be extended to 11 on x86_64), switch to JMP.d8 and INT3 padding to
achieve the same result.
Objtool uses this guarantee in the analysis of alternative/overlapping
CFI state for the ORC unwinder data. Every instruction edge gets a CFI
state and the more instructions the larger the chance of conflicts.
[ bp:
- Add a comment over add_nop() to explain why it does it this way
- Make add_nops() PARAVIRT only as it is used solely there now ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230208171431.373412974@infradead.org
Since commit ee6d3dd4ed ("driver core: make kobj_type constant.")
the driver core allows the usage of const struct kobj_type.
Take advantage of this to constify the structure definition to prevent
modification at runtime.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Up until now it was perceived that FSRM is an improvement to ERMS and
thus it was made dependent on latter.
However, there are AMD BIOSes out there which allow for disabling of
either features and thus preventing kernels from booting due to the CMP
disappearing and thus breaking the logic in the memmove() function.
Similar observation happens on some VM migration scenarios.
Patch the proper sequences depending on which feature is enabled.
Reported-by: Daniel Verkamp <dverkamp@chromium.org>
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/Y/yK0dyzI0MMdTie@zn.tnic
A little while ago someone (Kirill) ran into the whole 'alternatives don't
do relocations nonsense' again and I got annoyed enough to actually look
at the code.
Since the whole alternative machinery already fully decodes the
instructions it is simple enough to adjust immediates and displacement
when needed. Specifically, the immediates for IP modifying instructions
(JMP, CALL, Jcc) and the displacement for RIP-relative instructions.
[ bp: Massage comment some more and get rid of third loop in
apply_relocation(). ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230208171431.313857925@infradead.org
Using debug-alternative generates a *LOT* of output, extend it a bit
to select which of the many rewrites it reports on.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230208171431.253636689@infradead.org
The physical address width calculation in mtrr_bp_init() can easily be
replaced with using the already available value x86_phys_bits from
struct cpuinfo_x86.
The same information source can be used in mtrr/cleanup.c, removing the
need to pass that value on to mtrr_cleanup().
In print_mtrr_state() use x86_phys_bits instead of recalculating it
from size_or_mask.
Move setting of size_or_mask and size_and_mask into a dedicated new
function in mtrr/generic.c, enabling to make those 2 variables static,
as they are used in generic.c only now.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20230502120931.20719-2-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Normally __swp_entry_to_pte() is never called with a value translating
to a valid PTE. The only known exception is pte_swap_tests(), resulting
in a WARN splat in Xen PV guests, as __pte_to_swp_entry() did
translate the PFN of the valid PTE to a guest local PFN, while
__swp_entry_to_pte() doesn't do the opposite translation.
Fix that by using __pte() in __swp_entry_to_pte() instead of open
coding the native variant of it.
For correctness do the similar conversion for __swp_entry_to_pmd().
Fixes: 05289402d7 ("mm/debug_vm_pgtable: add tests validating arch helpers for core MM features")
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230306123259.12461-1-jgross@suse.com
A SEV-ES guest is active on AMD when CC_ATTR_GUEST_STATE_ENCRYPT is set.
I.e., MSR_AMD64_SEV, bit 1, SEV_ES_Enabled. So no need for a special
static key.
No functional changes.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20230328201712.25852-3-bp@alien8.de
Those will be used in code regions where instrumentation is not allowed
so mark them as such.
No functional changes.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20230328201712.25852-2-bp@alien8.de
Commit
310e782a99 ("platform/x86/amd: pmc: Utilize SMN index 0 for driver probe")
switched to using amd_smn_read() which relies upon the misc PCI ID used
by DF function 3 being included in a table. The ID for model 78h is
missing in that table, so amd_smn_read() doesn't work.
Add the missing ID into amd_nb, restoring s2idle on this system.
[ bp: Simplify commit message. ]
Fixes: 310e782a99 ("platform/x86/amd: pmc: Utilize SMN index 0 for driver probe")
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci_ids.h
Acked-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20230427053338.16653-2-mario.limonciello@amd.com
Intel Meteor Lake hybrid processors have cores in two separate dies. The
cores in one of the dies have higher maximum frequency. Use the SD_ASYM_
PACKING flag to give higher priority to the die with CPUs of higher maximum
frequency.
Suggested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230406203148.19182-13-ricardo.neri-calderon@linux.intel.com
X86 does not have the SD_ASYM_PACKING flag in the SMT domain. The scheduler
knows how to handle SMT and non-SMT cores of different priority. There is
no reason for SMT siblings of a core to have different priorities.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Len Brown <len.brown@intel.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-12-ricardo.neri-calderon@linux.intel.com
There is no difference between any of the SMT siblings of a physical core.
Do not do asym_packing load balancing at this level.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-11-ricardo.neri-calderon@linux.intel.com
Define bit macros for FixCntrCtl MSR and replace the bit hardcoding
with these bit macros. This would make code be more human-readable.
Perf commands 'perf stat -e "instructions,cycles,ref-cycles"' and
'perf record -e "instructions,cycles,ref-cycles"' pass.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230504072128.3653470-1-dapeng1.mi@linux.intel.com
Although, IBS pmus can be invoked via their own interface, indirect
IBS invocation via core pmu events is also supported with fixed set
of events: cpu-cycles:p, r076:p (same as cpu-cycles:p) and r0C1:p
(micro-ops) for user convenience.
This indirect IBS invocation is broken since commit 66d258c5b0
("perf/core: Optimize perf_init_event()"), which added RAW pmu under
'pmu_idr' list and thus if event_init() fails with RAW pmu, it started
returning error instead of trying other pmus.
Forward precise events from core pmu to IBS by overwriting 'type' and
'config' in the kernel copy of perf_event_attr. Overwriting will cause
perf_init_event() to retry with updated 'type' and 'config', which will
automatically forward event to IBS pmu.
Without patch:
$ sudo ./perf record -C 0 -e r076:p -- sleep 1
Error:
The r076:p event is not supported.
With patch:
$ sudo ./perf record -C 0 -e r076:p -- sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.341 MB perf.data (37 samples) ]
Fixes: 66d258c5b0 ("perf/core: Optimize perf_init_event()")
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230504110003.2548-3-ravi.bangoria@amd.com
Several similar kernel warnings can be triggered,
[56605.607840] CPU0 PEBS record size 0, expected 32, config 0 cpuc->record_size=208
when the below commands are running in parallel for a while on SPR.
while true;
do
perf record --no-buildid -a --intr-regs=AX \
-e cpu/event=0xd0,umask=0x81/pp \
-c 10003 -o /dev/null ./triad;
done &
while true;
do
perf record -o /tmp/out -W -d \
-e '{ld_blocks.store_forward:period=1000000, \
MEM_TRANS_RETIRED.LOAD_LATENCY:u:precise=2:ldlat=4}' \
-c 1037 ./triad;
done
The triad program is just the generation of loads/stores.
The warnings are triggered when an unexpected PEBS record (with a
different config and size) is found.
A system-wide PEBS event with the large PEBS config may be enabled
during a context switch. Some PEBS records for the system-wide PEBS
may be generated while the old task is sched out but the new one
hasn't been sched in yet. When the new task is sched in, the
cpuc->pebs_record_size may be updated for the per-task PEBS events. So
the existing system-wide PEBS records have a different size from the
later PEBS records.
The PEBS buffer should be flushed right before the hardware is
reprogrammed. The new size and threshold should be updated after the
old buffer has been flushed.
Reported-by: Stephane Eranian <eranian@google.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230421184529.3320912-1-kan.liang@linux.intel.com
It missed to convert a PERF_SAMPLE_BRANCH_STACK user to call the new
perf_sample_save_brstack() helper in order to update the dyn_size.
This affects AMD Zen3 machines with the branch-brs event.
Fixes: eb55b455ef ("perf/core: Add perf_sample_save_brstack() helper")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20230427030527.580841-1-namhyung@kernel.org
- Introduce local{,64}_try_cmpxchg() - a slightly more optimal
primitive, which will be used in perf events ring-buffer code.
- Simplify/modify rwsems on PREEMPT_RT, to address writer starvation.
- Misc cleanups/fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmRUvUoRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hlIhAArP33rTKi+HAndQ3UHW3XtmHRxEEQTfiE
wvIoN89h58QW4DGMeAV4ltafbIPQAkI233Aogwz903L0qbDV0Ro4OU3XJembRuWl
LeOADKwYyypXdOa8XICuY9aIP7e1/h0DF3ySs7inLcwK9JCyAIxnsVHYej+hsRXA
kZoXN98T3TR1C0V9UQy4SU3HI1lC3tsG3R9Ti9TnYUg3ygVXhRE9lOQ4kv9lFPVz
BNuj2Blj7KNiVaY9kehrhO54THI7NmsCVZO44Rcl48I0KAcFulAmFcNlE7GnR8Nj
thj38pU6XAFVHXG8MYjgE+Al+PnK48NtJxexCtHyGvGG4D2aLzRMnkolxAUCcVuK
G+UBsQm3ybjYgHgt1zuN6ehcpT+5tULkDH8JA7vrgZYaVgxHzsUaHgYfCCWKnmUY
mPR6aImEmYZwZVNLskhe0HT4mq244bp+VnWlnJ6LZK7t/itenvDhqnj7KTi4Bfej
lTHplOTitV/8uCEW8V4pX+YTEenVsIQmTc/G3iIabXP/6HzLffA3q4vyW6vKIErE
pqrpuFA0Z4GB+pU0mJXt7+I7zscDVthwI055jDyQBjA7IcdVGm2MjQ6xcNRW5FYN
UynvaEMocue4ZO4WdFsd1ZBUd9VfoNzGQspBw46DhCL1MEQBYv36SKQNjej/9aRr
ilVwqnOWI2s=
=mM0A
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2023-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
- Introduce local{,64}_try_cmpxchg() - a slightly more optimal
primitive, which will be used in perf events ring-buffer code
- Simplify/modify rwsems on PREEMPT_RT, to address writer starvation
- Misc cleanups/fixes
* tag 'locking-core-2023-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/atomic: Correct (cmp)xchg() instrumentation
locking/x86: Define arch_try_cmpxchg_local()
locking/arch: Wire up local_try_cmpxchg()
locking/generic: Wire up local{,64}_try_cmpxchg()
locking/atomic: Add generic try_cmpxchg{,64}_local() support
locking/rwbase: Mitigate indefinite writer starvation
locking/arch: Rename all internal __xchg() names to __arch_xchg()
Merge my x86 uaccess updates branch.
The LAM ("Linear Address Masking") updates in this release made me
unhappy about how "access_ok()" was done, and it actually turned out to
have a couple of small bugs in it too. This is my cleanup of the code:
- use the sign bit of the __user pointer rather than masking the
address and checking it against the TASK_SIZE range.
We already did this part for the get/put_user() side, but
'access_ok()' did the naïve "mask and range check" thing, which not
only generates nasty code, but also ended up meaning that __access_ok
itself didn't do a good job, and so copy_from_user_nmi() didn't get
the check right.
- move all the code that is 64-bit only into the 64-bit version of the
header file, so that we don't unnecessarily pollute the shared x86
code and make it look like LAM might work in 32-bit too.
- fix a bug in the address masking (that doesn't end up mattering: in
this case the fix was to just remove the buggy code entirely).
- a couple of trivial cleanups and added commentary about the
access_ok() rules.
* x86-uaccess-cleanup:
x86-64: mm: clarify the 'positive addresses' user address rules
x86: mm: remove 'sign' games from LAM untagged_addr*() macros
x86: uaccess: move 32-bit and 64-bit parts into proper <asm/uaccess_N.h> header
x86: mm: remove architecture-specific 'access_ok()' define
x86-64: make access_ok() independent of LAM
result in the root being freed even though the root is completely valid and
can be reused as-is (with a TLB flush).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmRP/3ESHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5J7kQAIg6v9UzM/qp7/d6C4laZLTWC2YlGhiI
1ZrfLU3/gQPYnnxv8GzLZ1CaXDhku2IIdyl2AQe8sUEmold45EapAW32rw2127j1
z4jW8x8dKYXUd1HGe823O0Rm+Ls6bGcXmHj8LaBCBIV6loBINeNfLXNllsO/yIcR
PmagzEqkNsMW3mvutdqb9mFP8p+mBzQu5qHlMEUb4WOXBmL06teHjR3qo7hi9Kl0
nM0ZvuvCLGvufoI0RESiq7mXPKBz3yvhFbkjrUgBKQ/rij2PMO8iyULsLfGY1iAI
m60aBfQPLJIH0NgvNHXkQOW59COYaY+I8udZqZZNr2uVb5A8J+/rQFSG/BP1Ccsw
mtJgZRD5WdplcAjYlZCcEgBmwjznjSOFGYaOrAp02dJlbPw2/Tjaj1GHMvMjEIME
KLvWTsN6xB9K0OhiXFvo1N4FCJbfi+PJPK0qVG7UttPnziCwYqAeIhGk4Kj6SHsX
P23HnDO8U/rCwRG2tuyZmbllpUXsX0q08wyGlp1UcKAbtD8PPGPyz8+I7YakKI97
RddIAh2qh5hwHON1xe35VSQ8X0OPOK1UnkiGTuBDdfldzxXK7OCfKKVQ6hsnpV6e
0a6nQc2Ni7/f5jThPo2cTaKz389ZpVE2j1DaTT8QXq5JuBcTzrNI6HImcJwPFTWP
+kUxewuRaaog
=pzJT
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-mmu-6.4-2' of https://github.com/kvm-x86/linux into HEAD
Fix a long-standing flaw in x86's TDP MMU where unloading roots on a vCPU can
result in the root being freed even though the root is completely valid and
can be reused as-is (with a TLB flush).
- Make stub data pages configurable
- Make it harder to mix user and kernel code by accident
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAmRSslcWHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wVbGEADWL/4rcVNqaD/b9V3bUXDkNhpG
81Jc9fXcx3OXV87GsxfadiQI05UaOhTAMWJlsdKUF0/IW5tSQSZc0QcDNxTs6aS1
NDkWvjk8s8OpbqdmV0cNKFmaeginrH4AuASajvNleU0BxX4ziqw6BPadih+wGVGz
iVEd/M5YH2yet33pvD7Wzh3KZOPlo2OOKBqXzE6Snc3yoOiIfN95U8Cl23nxzMI4
F235LS5m+4+GWIMkYs6+hlWEa8SxSnsH/beQJxAybtetSHRxsDXOHlEpDI9olxYI
+OFNqsFLDyhyjib2Snwt8HMES9VmMfUUwT7ZK1cChXN8Quix3lXAyz6rpocmdnL/
X1tzqWb1bz+yqrVRjtN3AGbcav+sQp4UWXsoExVsiz3lzaXfrbPvRZFucH2+SStK
Wadyy2v2rHZKNq9sDLUBf5GJ0C4LrBAyT15ADim2L5/K7pZyI7AKSPcdU0Xc8m5y
kMswml7FLaI+gsuKOBl7jySaoNDiVZFvQjsxd1+hn5cBNOcyBSt2rSErDw3geJgn
pFzVkLtklfxMpHMm+bwXONBZMrXOBPNgo0gSkVrCVzO68x6j/MaBGLFazSqOKxbc
UcVULhiomjIae4AcTomzOQx7oM1Xs3XuPo0x2RgJ3gMlCCZKoDNuqDgB3ZmPP/gq
cbbHHveXEx7aMOPz0A==
=Qb+I
-----END PGP SIGNATURE-----
Merge tag 'uml-for-linus-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux
Pull uml updates from Richard Weinberger:
- Make stub data pages configurable
- Make it harder to mix user and kernel code by accident
* tag 'uml-for-linus-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux:
um: make stub data pages size tweakable
um: prevent user code in modules
um: further clean up user_syms
um: don't export printf()
um: hostfs: define our own API boundary
um: add __weak for exported functions
Dave Hansen found the "(long) addr >= 0" code in the x86-64 access_ok
checks somewhat confusing, and suggested using a helper to clarify what
the code is doing.
So this does exactly that: clarifying what the sign bit check is all
about, by adding a helper macro that makes it clear what it is testing.
This also adds some explicit comments talking about how even with LAM
enabled, any addresses with the sign bit will still GP-fault in the
non-canonical region just above the sign bit.
This is all what allows us to do the user address checks with just the
sign bit, and furthermore be a bit cavalier about accesses that might be
done with an additional offset even past that point.
(And yes, this talks about 'positive' even though zero is also a valid
user address and so technically we should call them 'non-negative'. But
I don't think using 'non-negative' ends up being more understandable).
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The intent of the sign games was to not modify kernel addresses when
untagging them. However, that had two issues:
(a) it didn't actually work as intended, since the mask was calculated
as 'addr >> 63' on an _unsigned_ address. So instead of getting a
mask of all ones for kernel addresses, you just got '1'.
(b) untagging a kernel address isn't actually a valid operation anyway.
Now, (a) had originally been true for both 'untagged_addr()' and the
remote version of it, but had accidentally been fixed for the regular
version of untagged_addr() by commit e0bddc19ba ("x86/mm: Reduce
untagged_addr() overhead for systems without LAM"). That one rewrote
the shift to be part of the alternative asm code, and in the process
changed the unsigned shift into a signed 'sar' instruction.
And while it is true that we don't want to turn what looks like a kernel
address into a user address by masking off the high bit, that doesn't
need these sign masking games - all it needs is that the mm context
'untag_mask' value has the high bit set.
Which it always does.
So simplify the code by just removing the superfluous (and in the case
of untagged_addr_remote(), still buggy) sign bit games in the address
masking.
Acked-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The x86 <asm/uaccess.h> file has grown features that are specific to
x86-64 like LAM support and the related access_ok() changes. They
really should be in the <asm/uaccess_64.h> file and not pollute the
generic x86 header.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's already a generic definition of 'access_ok()' in the
asm-generic/access_ok.h header file, and the only difference bwteen that
and the x86-specific one is the added check for WARN_ON_IN_IRQ().
And it turns out that the reason for that check is long gone: it used to
use a "user_addr_max()" inline function that depended on the current
thread, and caused problems in non-thread contexts.
For details, see commits 7c4788950b ("x86/uaccess, sched/preempt:
Verify access_ok() context") and in particular commit ae31fe51a3
("perf/x86: Restore TASK_SIZE check on frame pointer") about how and why
this came to be.
But that "current task" issue was removed in the big set_fs() removal by
Christoph Hellwig in commit 47058bb54b ("x86: remove address space
overrides using set_fs()").
So the reason for the test and the architecture-specific access_ok()
define no longer exists, and is actually harmful these days. For
example, it led various 'copy_from_user_nmi()' games (eg using
__range_not_ok() instead, and then later converted to __access_ok() when
that became ok).
And that in turn meant that LAM was broken for the frame following
before this series, because __access_ok() used to not do the address
untagging.
Accessing user state still needs care in many contexts, but access_ok()
is not the place for this test.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds torvalds@linux-foundation.org>
The linear address masking (LAM) code made access_ok() more complicated,
in that it now needs to untag the address in order to verify the access
range. See commit 74c228d20a ("x86/uaccess: Provide untagged_addr()
and remove tags before address check").
We were able to avoid that overhead in the get_user/put_user code paths
by simply using the sign bit for the address check, and depending on the
GP fault if the address was non-canonical, which made it all independent
of LAM.
And we can do the same thing for access_ok(): simply check that the user
pointer range has the high bit clear. No need to bother with any
address bit masking.
In fact, we can go a bit further, and just check the starting address
for known small accesses ranges: any accesses that overflow will still
be in the non-canonical area and will still GP fault.
To still make syzkaller catch any potentially unchecked user addresses,
we'll continue to warn about GP faults that are caused by accesses in
the non-canonical range. But we'll limit that to purely "high bit set
and past the one-page 'slop' area".
We could probably just do that "check only starting address" for any
arbitrary range size: realistically all kernel accesses to user space
will be done starting at the low address. But let's leave that kind of
optimization for later. As it is, this already allows us to generate
simpler code and not worry about any tag bits in the address.
The one thing to look out for is the GUP address check: instead of
actually copying data in the virtual address range (and thus bad
addresses being caught by the GP fault), GUP will look up the page
tables manually. As a result, the page table limits need to be checked,
and that was previously implicitly done by the access_ok().
With the relaxed access_ok() check, we need to just do an explicit check
for TASK_SIZE_MAX in the GUP code instead. The GUP code already needs
to do the tag bit unmasking anyway, so there this is all very
straightforward, and there are no LAM issues.
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* More phys_to_virt conversions
* Improvement of AP management for VSIE (nested virtualization)
ARM64:
* Numerous fixes for the pathological lock inversion issue that
plagued KVM/arm64 since... forever.
* New framework allowing SMCCC-compliant hypercalls to be forwarded
to userspace, hopefully paving the way for some more features
being moved to VMMs rather than be implemented in the kernel.
* Large rework of the timer code to allow a VM-wide offset to be
applied to both virtual and physical counters as well as a
per-timer, per-vcpu offset that complements the global one.
This last part allows the NV timer code to be implemented on
top.
* A small set of fixes to make sure that we don't change anything
affecting the EL1&0 translation regime just after having having
taken an exception to EL2 until we have executed a DSB. This
ensures that speculative walks started in EL1&0 have completed.
* The usual selftest fixes and improvements.
KVM x86 changes for 6.4:
* Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled,
and by giving the guest control of CR0.WP when EPT is enabled on VMX
(VMX-only because SVM doesn't support per-bit controls)
* Add CR0/CR4 helpers to query single bits, and clean up related code
where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return
as a bool
* Move AMD_PSFD to cpufeatures.h and purge KVM's definition
* Avoid unnecessary writes+flushes when the guest is only adding new PTEs
* Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s optimizations
when emulating invalidations
* Clean up the range-based flushing APIs
* Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single
A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle
changed SPTE" overhead associated with writing the entire entry
* Track the number of "tail" entries in a pte_list_desc to avoid having
to walk (potentially) all descriptors during insertion and deletion,
which gets quite expensive if the guest is spamming fork()
* Disallow virtualizing legacy LBRs if architectural LBRs are available,
the two are mutually exclusive in hardware
* Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES)
after KVM_RUN, similar to CPUID features
* Overhaul the vmx_pmu_caps selftest to better validate PERF_CAPABILITIES
* Apply PMU filters to emulated events and add test coverage to the
pmu_event_filter selftest
x86 AMD:
* Add support for virtual NMIs
* Fixes for edge cases related to virtual interrupts
x86 Intel:
* Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is
not being reported due to userspace not opting in via prctl()
* Fix a bug in emulation of ENCLS in compatibility mode
* Allow emulation of NOP and PAUSE for L2
* AMX selftests improvements
* Misc cleanups
MIPS:
* Constify MIPS's internal callbacks (a leftover from the hardware enabling
rework that landed in 6.3)
Generic:
* Drop unnecessary casts from "void *" throughout kvm_main.c
* Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the struct
size by 8 bytes on 64-bit kernels by utilizing a padding hole
Documentation:
* Fix goof introduced by the conversion to rST
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmRNExkUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNyjwf+MkzDael9y9AsOZoqhEZ5OsfQYJ32
Im5ZVYsPRU2K5TuoWql6meIihgclCj1iIU32qYHa2F1WYt2rZ72rJp+HoY8b+TaI
WvF0pvNtqQyg3iEKUBKPA4xQ6mj7RpQBw86qqiCHmlfNt0zxluEGEPxH8xrWcfhC
huDQ+NUOdU7fmJ3rqGitCvkUbCuZNkw3aNPR8dhU8RAWrwRzP2hBOmdxIeo81WWY
XMEpJSijbGpXL9CvM0Jz9nOuMJwZwCCBGxg1vSQq0xTfLySNMxzvWZC2GFaBjucb
j0UOQ7yE0drIZDVhd3sdNslubXXU6FcSEzacGQb9aigMUon3Tem9SHi7Kw==
=S2Hq
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"s390:
- More phys_to_virt conversions
- Improvement of AP management for VSIE (nested virtualization)
ARM64:
- Numerous fixes for the pathological lock inversion issue that
plagued KVM/arm64 since... forever.
- New framework allowing SMCCC-compliant hypercalls to be forwarded
to userspace, hopefully paving the way for some more features being
moved to VMMs rather than be implemented in the kernel.
- Large rework of the timer code to allow a VM-wide offset to be
applied to both virtual and physical counters as well as a
per-timer, per-vcpu offset that complements the global one. This
last part allows the NV timer code to be implemented on top.
- A small set of fixes to make sure that we don't change anything
affecting the EL1&0 translation regime just after having having
taken an exception to EL2 until we have executed a DSB. This
ensures that speculative walks started in EL1&0 have completed.
- The usual selftest fixes and improvements.
x86:
- Optimize CR0.WP toggling by avoiding an MMU reload when TDP is
enabled, and by giving the guest control of CR0.WP when EPT is
enabled on VMX (VMX-only because SVM doesn't support per-bit
controls)
- Add CR0/CR4 helpers to query single bits, and clean up related code
where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long"
return as a bool
- Move AMD_PSFD to cpufeatures.h and purge KVM's definition
- Avoid unnecessary writes+flushes when the guest is only adding new
PTEs
- Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s
optimizations when emulating invalidations
- Clean up the range-based flushing APIs
- Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a
single A/D bit using a LOCK AND instead of XCHG, and skip all of
the "handle changed SPTE" overhead associated with writing the
entire entry
- Track the number of "tail" entries in a pte_list_desc to avoid
having to walk (potentially) all descriptors during insertion and
deletion, which gets quite expensive if the guest is spamming
fork()
- Disallow virtualizing legacy LBRs if architectural LBRs are
available, the two are mutually exclusive in hardware
- Disallow writes to immutable feature MSRs (notably
PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features
- Overhaul the vmx_pmu_caps selftest to better validate
PERF_CAPABILITIES
- Apply PMU filters to emulated events and add test coverage to the
pmu_event_filter selftest
- AMD SVM:
- Add support for virtual NMIs
- Fixes for edge cases related to virtual interrupts
- Intel AMX:
- Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if
XTILE_DATA is not being reported due to userspace not opting in
via prctl()
- Fix a bug in emulation of ENCLS in compatibility mode
- Allow emulation of NOP and PAUSE for L2
- AMX selftests improvements
- Misc cleanups
MIPS:
- Constify MIPS's internal callbacks (a leftover from the hardware
enabling rework that landed in 6.3)
Generic:
- Drop unnecessary casts from "void *" throughout kvm_main.c
- Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the
struct size by 8 bytes on 64-bit kernels by utilizing a padding
hole
Documentation:
- Fix goof introduced by the conversion to rST"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits)
KVM: s390: pci: fix virtual-physical confusion on module unload/load
KVM: s390: vsie: clarifications on setting the APCB
KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA
KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state
KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init()
KVM: selftests: Test the PMU event "Instructions retired"
KVM: selftests: Copy full counter values from guest in PMU event filter test
KVM: selftests: Use error codes to signal errors in PMU event filter test
KVM: selftests: Print detailed info in PMU event filter asserts
KVM: selftests: Add helpers for PMC asserts in PMU event filter test
KVM: selftests: Add a common helper for the PMU event filter guest code
KVM: selftests: Fix spelling mistake "perrmited" -> "permitted"
KVM: arm64: vhe: Drop extra isb() on guest exit
KVM: arm64: vhe: Synchronise with page table walker on MMU update
KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
KVM: arm64: nvhe: Synchronise with page table walker on TLBI
KVM: arm64: Handle 32bit CNTPCTSS traps
KVM: arm64: nvhe: Synchronise with page table walker on vcpu run
KVM: arm64: vgic: Don't acquire its_lock before config_lock
KVM: selftests: Add test to verify KVM's supported XCR0
...
Including:
- Convert to platform remove callback returning void
- Extend changing default domain to normal group
- Intel VT-d updates:
- Remove VT-d virtual command interface and IOASID
- Allow the VT-d driver to support non-PRI IOPF
- Remove PASID supervisor request support
- Various small and misc cleanups
- ARM SMMU updates:
- Device-tree binding updates:
* Allow Qualcomm GPU SMMUs to accept relevant clock properties
* Document Qualcomm 8550 SoC as implementing an MMU-500
* Favour new "qcom,smmu-500" binding for Adreno SMMUs
- Fix S2CR quirk detection on non-architectural Qualcomm SMMU
implementations
- Acknowledge SMMUv3 PRI queue overflow when consuming events
- Document (in a comment) why ATS is disabled for bypass streams
- AMD IOMMU updates:
- 5-level page-table support
- NUMA awareness for memory allocations
- Unisoc driver: Support for reattaching an existing domain
- Rockchip driver: Add missing set_platform_dma_ops callback
- Mediatek driver: Adjust the dma-ranges
- Various other small fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmRONeAACgkQK/BELZcB
GuPmpw/8C9ruxQ0JU5rcDBXQGvos4gMmxlbELMrBpbbiTtdb35xchpKfdhnECGIF
k2SrrcF40R/S82SyzNU/eZtGKirtcXvGFraUFgu/QdCcnnqpRHs+IJMXX2NJP+it
+0wO1uiInt3CN1ERcR4F31cDKiWjDG8bvQVE5LIyiy4KrIU5ld2G91Fkaa0R13Au
6H+/wKkcUC6OyaGE6wPx474xBkapT20vj5AIQuAWisXJJR0wbBon1sUTo/IRKsU+
IkNxH0W+1PNImJ+crAdf/nkOlyqoChY4ww6cm07LrOsBLIsX5bCqXfL4HvKthElD
MEgk2SN5kfjfR5Vf29W4hZVM1CT8VbhO41I7OzaZ6X6RU2PXoldPKlgKtZGeSKn1
9bcMpSgB0BtbttvBevSkxTo5KHFozXS2DG3DFoMB3yFMme8Th0LrhBZ9oB7NIPNw
ntMo4K75vviC6Vvzjy4Anj/+y+Zm3W6wDDP7F12O6WZLkK5s4hrSsHUm/MQnnKQP
muJlG870RnSl73xUQZe3cuBxktXuJ3EHqqYIPE0npzvauu8hhWcis3opf2Y+U2s8
aBCCIgp5kTKqjHLh2e4lNCKZf1/b/dhxRcRBQhpAIb8YsjMlIJyM+G8Jz6K6gBga
5Ld+68UQ3oHJwoLV1HCFN8jbpQ9KZn1s9+h3yrYjRAcLNiFb3nU=
=OvTo
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
- Convert to platform remove callback returning void
- Extend changing default domain to normal group
- Intel VT-d updates:
- Remove VT-d virtual command interface and IOASID
- Allow the VT-d driver to support non-PRI IOPF
- Remove PASID supervisor request support
- Various small and misc cleanups
- ARM SMMU updates:
- Device-tree binding updates:
* Allow Qualcomm GPU SMMUs to accept relevant clock properties
* Document Qualcomm 8550 SoC as implementing an MMU-500
* Favour new "qcom,smmu-500" binding for Adreno SMMUs
- Fix S2CR quirk detection on non-architectural Qualcomm SMMU
implementations
- Acknowledge SMMUv3 PRI queue overflow when consuming events
- Document (in a comment) why ATS is disabled for bypass streams
- AMD IOMMU updates:
- 5-level page-table support
- NUMA awareness for memory allocations
- Unisoc driver: Support for reattaching an existing domain
- Rockchip driver: Add missing set_platform_dma_ops callback
- Mediatek driver: Adjust the dma-ranges
- Various other small fixes and cleanups
* tag 'iommu-updates-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (82 commits)
iommu: Remove iommu_group_get_by_id()
iommu: Make iommu_release_device() static
iommu/vt-d: Remove BUG_ON in dmar_insert_dev_scope()
iommu/vt-d: Remove a useless BUG_ON(dev->is_virtfn)
iommu/vt-d: Remove BUG_ON in map/unmap()
iommu/vt-d: Remove BUG_ON when domain->pgd is NULL
iommu/vt-d: Remove BUG_ON in handling iotlb cache invalidation
iommu/vt-d: Remove BUG_ON on checking valid pfn range
iommu/vt-d: Make size of operands same in bitwise operations
iommu/vt-d: Remove PASID supervisor request support
iommu/vt-d: Use non-privileged mode for all PASIDs
iommu/vt-d: Remove extern from function prototypes
iommu/vt-d: Do not use GFP_ATOMIC when not needed
iommu/vt-d: Remove unnecessary checks in iopf disabling path
iommu/vt-d: Move PRI handling to IOPF feature path
iommu/vt-d: Move pfsid and ats_qdep calculation to device probe path
iommu/vt-d: Move iopf code from SVA to IOPF enabling path
iommu/vt-d: Allow SVA with device-specific IOPF
dmaengine: idxd: Add enable/disable device IOPF feature
arm64: dts: mt8186: Add dma-ranges for the parent "soc" node
...
Implement target specific support for local_try_cmpxchg()
and local_cmpxchg() using typed C wrappers that call their
_local counterpart and provide additional checking of
their input arguments.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230405141710.3551-4-ubizjak@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
- Remove diagnostics and adjust config for CSD lock diagnostics
- Add a generic IPI-sending tracepoint, as currently there's no easy
way to instrument IPI origins: it's arch dependent and for some
major architectures it's not even consistently available.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmRK438RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jJ5Q/5AZ0HGpyqwdFK8GmGznyu5qjP5HwV9pPq
gZQScqSy4tZEeza4TFMi83CoXSg9uJ7GlYJqqQMKm78LGEPomnZtXXC7oWvTA9M5
M/jAvzytmvZloSCXV6kK7jzSejMHhag97J/BjTYhZYQpJ9T+hNC87XO6J6COsKr9
lPIYqkFrIkQNr6B0U11AQfFejRYP1ics2fnbnZL86G/zZAc6x8EveM3KgSer2iHl
KbrO+xcYyGY8Ef9P2F72HhEGFfM3WslpT1yzqR3sm4Y+fuMG0oW3qOQuMJx0ZhxT
AloterY0uo6gJwI0P9k/K4klWgz81Tf/zLb0eBAtY2uJV9Fo3YhPHuZC7jGPGAy3
JusW2yNYqc8erHVEMAKDUsl/1KN4TE2uKlkZy98wno+KOoMufK5MA2e2kPPqXvUi
Jk9RvFolnWUsexaPmCftti0OCv3YFiviVAJ/t0pchfmvvJA2da0VC9hzmEXpLJVF
25nBTV/1uAOrWvOpCyo3ElrC2CkQVkFmK5rXMDdvf6ib0Nid4vFcCkCSLVfu+ePB
11mi7QYro+CcnOug1K+yKogUDmsZgV/u1kUwgQzTIpZ05Kkb49gUiXw9L2RGcBJh
yoDoiI66KPR7PWQ2qBdQoXug4zfEEtWG0O9HNLB0FFRC3hu7I+HHyiUkBWs9jasK
PA5+V7HcQRk=
=Wp7f
-----END PGP SIGNATURE-----
Merge tag 'smp-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP cross-CPU function-call updates from Ingo Molnar:
- Remove diagnostics and adjust config for CSD lock diagnostics
- Add a generic IPI-sending tracepoint, as currently there's no easy
way to instrument IPI origins: it's arch dependent and for some major
architectures it's not even consistently available.
* tag 'smp-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
trace,smp: Trace all smp_function_call*() invocations
trace: Add trace_ipi_send_cpu()
sched, smp: Trace smp callback causing an IPI
smp: reword smp call IPI comment
treewide: Trace IPIs sent via smp_send_reschedule()
irq_work: Trace self-IPIs sent via arch_irq_work_raise()
smp: Trace IPIs sent via arch_send_call_function_ipi_mask()
sched, smp: Trace IPIs sent via send_call_function_single_ipi()
trace: Add trace_ipi_send_cpumask()
kernel/smp: Make csdlock_debug= resettable
locking/csd_lock: Remove per-CPU data indirection from CSD lock debugging
locking/csd_lock: Remove added data from CSD lock debugging
locking/csd_lock: Add Kconfig option for csd_debug default
- Mark arch_cpu_idle_dead() __noreturn, make all architectures & drivers that did
this inconsistently follow this new, common convention, and fix all the fallout
that objtool can now detect statically.
- Fix/improve the ORC unwinder becoming unreliable due to UNWIND_HINT_EMPTY ambiguity,
split it into UNWIND_HINT_END_OF_STACK and UNWIND_HINT_UNDEFINED to resolve it.
- Fix noinstr violations in the KCSAN code and the lkdtm/stackleak code.
- Generate ORC data for __pfx code
- Add more __noreturn annotations to various kernel startup/shutdown/panic functions.
- Misc improvements & fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmRK1x0RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1ghxQ/+IkCynMYtdF5OG9YwbcGJqsPSfOPMEcEM
pUSFYg+gGPBDT/fJfcVSqvUtdnWbLC2kXt9yiswXz3X3J2nmNkBk5YKQftsNDcul
TmKeqIIAK51XTncpegKH0EGnOX63oZ9Vxa8CTPdDlb+YF23Km2FoudGRI9F5qbUd
LoraXqGYeiaeySkGyWmZVl6Uc8dIxnMkTN3H/oI9aB6TOrsi059hAtFcSaFfyemP
c4LqXXCH7k2baiQt+qaLZ8cuZVG/+K5r2N2cmjO5kmJc6ynIaFnfMe4XxZLjp5LT
/PulYI15bXkvSARKx5CRh/CDHMOx5Blw+ASO0RhWbdy0WH4ZhhcaVF5AeIpPW86a
1LBcz97rMp72WmvKgrJeVO1r9+ll4SI6/YKGJRsxsCMdP3hgFpqntXyVjTFNdTM1
0gH6H5v55x06vJHvhtTk8SR3PfMTEM2fRU5jXEOrGowoGifx+wNUwORiwj6LE3KQ
SKUdT19RNzoW3VkFxhgk65ThK1S7YsJUKRoac3YdhttpqqqtFV//erenrZoR4k/p
vzvKy68EQ7RCNyD5wNWNFe0YjeJl5G8gQ8bUm4Xmab7djjgz+pn4WpQB8yYKJLAo
x9dqQ+6eUbw3Hcgk6qQ9E+r/svbulnAL0AeALAWK/91DwnZ2mCzKroFkLN7napKi
fRho4CqzrtM=
=NwEV
-----END PGP SIGNATURE-----
Merge tag 'objtool-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:
- Mark arch_cpu_idle_dead() __noreturn, make all architectures &
drivers that did this inconsistently follow this new, common
convention, and fix all the fallout that objtool can now detect
statically
- Fix/improve the ORC unwinder becoming unreliable due to
UNWIND_HINT_EMPTY ambiguity, split it into UNWIND_HINT_END_OF_STACK
and UNWIND_HINT_UNDEFINED to resolve it
- Fix noinstr violations in the KCSAN code and the lkdtm/stackleak code
- Generate ORC data for __pfx code
- Add more __noreturn annotations to various kernel startup/shutdown
and panic functions
- Misc improvements & fixes
* tag 'objtool-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
x86/hyperv: Mark hv_ghcb_terminate() as noreturn
scsi: message: fusion: Mark mpt_halt_firmware() __noreturn
x86/cpu: Mark {hlt,resume}_play_dead() __noreturn
btrfs: Mark btrfs_assertfail() __noreturn
objtool: Include weak functions in global_noreturns check
cpu: Mark nmi_panic_self_stop() __noreturn
cpu: Mark panic_smp_self_stop() __noreturn
arm64/cpu: Mark cpu_park_loop() and friends __noreturn
x86/head: Mark *_start_kernel() __noreturn
init: Mark start_kernel() __noreturn
init: Mark [arch_call_]rest_init() __noreturn
objtool: Generate ORC data for __pfx code
x86/linkage: Fix padding for typed functions
objtool: Separate prefix code from stack validation code
objtool: Remove superfluous dead_end_function() check
objtool: Add symbol iteration helpers
objtool: Add WARN_INSN()
scripts/objdump-func: Support multiple functions
context_tracking: Fix KCSAN noinstr violation
objtool: Add stackleak instrumentation to uaccess safe list
...
to ARM's Top Byte Ignore and allows userspace to store metadata in some
bits of pointers without masking it out before use.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmRK/WIACgkQaDWVMHDJ
krAL+RAAw33EhsWyYVkeAtYmYBKkGvlgeSDULtfJKe5bynJBTHkGKfM6RE9MSJIt
5fHWaConGh8HNpy0Us1sDvd/aWcWRm5h7ZcCVD+R4qrgh/vc7ULzM+elXe5jzr4W
cyuTckF2eW6SVrYg6fH5q+6Uy/moDtrdkLRvwRBf+AYeepB8gvSSH5XixKDNiVBE
pjNy1xXVZQokqD4tjsFelmLttyacR5OabiE/aeVNoFYf9yTwfnN8N3T6kwuOoS4l
Lp6NA+/0ux+oBlR+Is+JJG8Mxrjvz96yJGZYdR2YP5k3bMQtHAAjuq2w+GgqZm5i
j3/E6KQepEGaCfC+bHl68xy/kKx8ik+jMCEcBalCC25J3uxbLz41g6K3aI890wJn
+5ZtfcmoDUk9pnUyLxR8t+UjOSBFAcRSUE+FTjUH1qEGsMPK++9a4iLXz5vYVK1+
+YCt1u5LNJbkDxE8xVX3F5jkXh0G01SJsuUVAOqHSNfqSNmohFK8/omqhVRrRqoK
A7cYLtnOGiUXLnvjrwSxPNOzRrG+GAwqaw8gwOTaYogETWbTY8qsSCEVl204uYwd
m8io9rk2ZXUdDuha56xpBbPE0JHL9hJ2eKCuPkfvRgJT9YFyTh+e0UdX20k+nDjc
ang1S350o/Y0sus6rij1qS8AuxJIjHucG0GdgpZk3KUbcxoRLhI=
=qitk
-----END PGP SIGNATURE-----
Merge tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 LAM (Linear Address Masking) support from Dave Hansen:
"Add support for the new Linear Address Masking CPU feature.
This is similar to ARM's Top Byte Ignore and allows userspace to store
metadata in some bits of pointers without masking it out before use"
* tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside
x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVA
selftests/x86/lam: Add test cases for LAM vs thread creation
selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking
selftests/x86/lam: Add inherit test cases for linear-address masking
selftests/x86/lam: Add io_uring test cases for linear-address masking
selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking
selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking
x86/mm/iommu/sva: Make LAM and SVA mutually exclusive
iommu/sva: Replace pasid_valid() helper with mm_valid_pasid()
mm: Expose untagging mask in /proc/$PID/status
x86/mm: Provide arch_prctl() interface for LAM
x86/mm: Reduce untagged_addr() overhead for systems without LAM
x86/uaccess: Provide untagged_addr() and remove tags before address check
mm: Introduce untagged_addr_remote()
x86/mm: Handle LAM on context switch
x86: CPUID and CR3/CR4 flags for Linear Address Masking
x86: Allow atomic MM_CONTEXT flags setting
x86/mm: Rework address range check in get_user() and put_user()
assembly macro argument rather than a runtime register.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmRKvncACgkQaDWVMHDJ
krDQsA/+OlDCexITqWgmd3rPN2ZUuyB+aV4MfoKppQuAuWD4I7vxD5qeqjRS2XTh
E6SSzp43zEVhVo6Kv3UvPR/Tr9edUGn2KzIWmqd1bOwhgbEfd898gzbWuRmK6i8t
qqweR1RMAL/COgPAlcrdpTLl2PCc9tLYpDnQ8WcAUqH4uoePpQyN3Za0J/dcKX7l
8XexOAaco4Wz3ylD9npPcLo9ytvohg+exJtCNldN1l2j5xXdA2fTqEJYaUMp/+Nd
Z1TTQ43QcT7dRknFojxdYfAkCqBfr8ccBAwV1mriahKWY/3xl35BqSeJVlma1tkm
UzkTY1CFwKYRk24C/oQK7OQMYnyJ7Q1RhSrd91lQWVjaTcI/3DPUKiKKdwFXDv4C
FUYvuJkanPVk3PyCZRvltdNvsXsifzx0RKZWLZ+3TQ2jtaMEDOzPgChq7a6WfpkQ
HQPuVoENHvyHdUycQhtELUsaJ3AdnOM87XiQDcbNNiaPiOLB9C8dhSWMKoPsMehO
oAiUQ7lW6po0lcELVSKib2ASVpXhOmlAxdRyZ50mhjrbpcxfBBGD3+KdFqZ4Gs1c
8UyrQbjVq07Lx2fvdizvDpIcr4M7z0xBAhJeIegC6z86XpJq5uvin+vOLzFAfe16
WGy6FiZtVpXp4fyqUY7GgQNqhk1b8h6EHKd9d/zCSPuH8/wT/6g=
=hvDm
-----END PGP SIGNATURE-----
Merge tag 'x86_tdx_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 tdx update from Dave Hansen:
"The original tdx hypercall assembly code took two flags in %RSI to
tweak its behavior at runtime. PeterZ recently axed one flag in commit
e80a48bade ("x86/tdx: Remove TDX_HCALL_ISSUE_STI").
Kill the other flag too and tweak the 'output' mode with an assembly
macro instead. This results in elimination of one push/pop pair and
overall easier to read assembly.
- Do conditional __tdx_hypercall() 'output' processing via an
assembly macro argument rather than a runtime register"
* tag 'x86_tdx_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tdx: Drop flags from __tdx_hypercall()
* Explicitly make some hardware constants part of the uabi
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmRKkrkACgkQaDWVMHDJ
krDtDBAAhWbKRK1rJJsz2GuliF3/f/cZwcNxGG+QGrYBl2F2ilOrmVwNYME2TvHD
qQJHm8pU7vnDpnkZspqE0OoB6fbSa5qH3RfFhBFRziJFgN9mY0F0IJZeuH/EvJ/0
7gkRMA3Fs41EESbAWhUTakvC6u3L06SUpUH2W8ixAcawZu+g/FksDXxE+eVVPZaQ
Ztw17j6/m8W9bZ17HtyWK2vAepPlJhuXFPSAk7ox09ACwkqWAHO0/3RPcbc8HUZV
lDyYeDhRELG1pai14GhTixRcgkdn4nnnNDmn13xpuwkpOh7FeZL/SoDmXtJ71CrJ
I1YM1t9aB4ze2WDOo3mSKzU4efspGzAgIH26u19NQTmEp/9ppS+RaifXpt0r1yir
ygOXkgk8l2qZPxryyL9ROU6b9cnPzsP9k3mWTtNJiJrx0CL73lWkA5KORb/Ezdnj
kXAjTd4nUeCQJz+7PsnuvGqsT8/Dk1ugnHTu6Bn66U0hV0MNcx5G5m5HehDQBUmb
TllHGJSGt/1AXIfBZ1p7GSrgCaq3NTzWNmcFxHS3bpC/pyGwszmdDBIS/pODfBfp
0nG9cG8mte1KkhqjkSYTLtgarQEijs1NWrVnTUogg1kqtlvqZr8Zxun51YAW9Jt5
zCGoB6W7EWVfJZBMHmVX7a4g21650mgte3YoAAyAwMJFtZG14ng=
=GlmS
-----END PGP SIGNATURE-----
Merge tag 'x86_fpu_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fpu updates from Dave Hansen:
"There's no _actual_ kernel functionality here.
This expands the documentation around AMX support including some code
examples. The example code also exposed the fact that hardware
architecture constants as part of the ABI, but there's no easy place
that they get defined for apps. Adding them to a uabi header will
eventually make life easier for consumers of the ABI.
Summary:
- Improve AMX documentation along with example code
- Explicitly make some hardware constants part of the uabi"
* tag 'x86_fpu_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation/x86: Explain the state component permission for guests
Documentation/x86: Add the AMX enabling example
x86/arch_prctl: Add AMX feature numbers as ABI constants
Documentation/x86: Explain the purpose for dynamic features
- updates to scripts/gdb from Glenn Washburn
- kexec cleanups from Bjorn Helgaas
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr+6wAKCRDdBJ7gKXxA
jn4NAP4u/hj/kR2dxYehcVLuQqJspCRZZBZlAReFJyHNQO6voAEAk0NN9rtG2+/E
r0G29CJhK+YL0W6mOs8O1yo9J1rZnAM=
=2CUV
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2023-04-27-16-01' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"Mainly singleton patches all over the place.
Series of note are:
- updates to scripts/gdb from Glenn Washburn
- kexec cleanups from Bjorn Helgaas"
* tag 'mm-nonmm-stable-2023-04-27-16-01' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (50 commits)
mailmap: add entries for Paul Mackerras
libgcc: add forward declarations for generic library routines
mailmap: add entry for Oleksandr
ocfs2: reduce ioctl stack usage
fs/proc: add Kthread flag to /proc/$pid/status
ia64: fix an addr to taddr in huge_pte_offset()
checkpatch: introduce proper bindings license check
epoll: rename global epmutex
scripts/gdb: add GDB convenience functions $lx_dentry_name() and $lx_i_dentry()
scripts/gdb: create linux/vfs.py for VFS related GDB helpers
uapi/linux/const.h: prefer ISO-friendly __typeof__
delayacct: track delays from IRQ/SOFTIRQ
scripts/gdb: timerlist: convert int chunks to str
scripts/gdb: print interrupts
scripts/gdb: raise error with reduced debugging information
scripts/gdb: add a Radix Tree Parser
lib/rbtree: use '+' instead of '|' for setting color.
proc/stat: remove arch_idle_time()
checkpatch: check for misuse of the link tags
checkpatch: allow Closes tags with links
...
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page().
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather
than exclusive, and has fixed a bunch of errors which were caused by its
unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA
jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96
eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY=
=J+Dj
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page()
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
rather than exclusive, and has fixed a bunch of errors which were
caused by its unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics
flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim
accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
shmem: restrict noswap option to initial user namespace
mm/khugepaged: fix conflicting mods to collapse_file()
sparse: remove unnecessary 0 values from rc
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
maple_tree: fix allocation in mas_sparse_area()
mm: do not increment pgfault stats when page fault handler retries
zsmalloc: allow only one active pool compaction context
selftests/mm: add new selftests for KSM
mm: add new KSM process and sysfs knobs
mm: add new api to enable ksm per process
mm: shrinkers: fix debugfs file permissions
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
migrate_pages_batch: fix statistics for longterm pin retry
userfaultfd: use helper function range_in_vma()
lib/show_mem.c: use for_each_populated_zone() simplify code
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer: add folio_create_empty_buffers helper
...
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmRHJSgTHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXjSOCAClsmFmyP320yAB74vQer5cSzxbIpFW
3qt/P3D8zABn0UxjjmD8+LTHuyB+72KANU6qQ9No6zdYs8yaA1vGX8j8UglWWHuj
fmaAD4DuZl+V+fmqDgHukgaPlhakmW0m5tJkR+TW3kCgnyrtvSWpXPoxUAe6CLvj
Kb/SPl6ylHRWlIAEZ51gy0Ipqxjvs5vR/h9CWpTmRMuZvxdWUro2Cm82wJgzXPqq
3eLbAzB29kLFEIIUpba9a/rif1yrWgVFlfpuENFZ+HUYuR78wrPB9evhwuPvhXd2
+f+Wk0IXORAJo8h7aaMMIr6bd4Lyn98GPgmS5YSe92HRIqjBvtYs3Dq8
=F6+n
-----END PGP SIGNATURE-----
Merge tag 'hyperv-next-signed-20230424' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:
- PCI passthrough for Hyper-V confidential VMs (Michael Kelley)
- Hyper-V VTL mode support (Saurabh Sengar)
- Move panic report initialization code earlier (Long Li)
- Various improvements and bug fixes (Dexuan Cui and Michael Kelley)
* tag 'hyperv-next-signed-20230424' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (22 commits)
PCI: hv: Replace retarget_msi_interrupt_params with hyperv_pcpu_input_arg
Drivers: hv: move panic report code from vmbus to hv early init code
x86/hyperv: VTL support for Hyper-V
Drivers: hv: Kconfig: Add HYPERV_VTL_MODE
x86/hyperv: Make hv_get_nmi_reason public
x86/hyperv: Add VTL specific structs and hypercalls
x86/init: Make get/set_rtc_noop() public
x86/hyperv: Exclude lazy TLB mode CPUs from enlightened TLB flushes
x86/hyperv: Add callback filter to cpumask_to_vpset()
Drivers: hv: vmbus: Remove the per-CPU post_msg_page
clocksource: hyper-v: make sure Invariant-TSC is used if it is available
PCI: hv: Enable PCI pass-thru devices in Confidential VMs
Drivers: hv: Don't remap addresses that are above shared_gpa_boundary
hv_netvsc: Remove second mapping of send and recv buffers
Drivers: hv: vmbus: Remove second way of mapping ring buffers
Drivers: hv: vmbus: Remove second mapping of VMBus monitor pages
swiotlb: Remove bounce buffer remapping for Hyper-V
Driver: VMBus: Add Devicetree support
dt-bindings: bus: Add Hyper-V VMBus
Drivers: hv: vmbus: Convert acpi_device to more generic platform_device
...
The summary of the changes for this pull requests is:
* Song Liu's new struct module_memory replacement
* Nick Alcock's MODULE_LICENSE() removal for non-modules
* My cleanups and enhancements to reduce the areas where we vmalloc
module memory for duplicates, and the respective debug code which
proves the remaining vmalloc pressure comes from userspace.
Most of the changes have been in linux-next for quite some time except
the minor fixes I made to check if a module was already loaded
prior to allocating the final module memory with vmalloc and the
respective debug code it introduces to help clarify the issue. Although
the functional change is small it is rather safe as it can only *help*
reduce vmalloc space for duplicates and is confirmed to fix a bootup
issue with over 400 CPUs with KASAN enabled. I don't expect stable
kernels to pick up that fix as the cleanups would have also had to have
been picked up. Folks on larger CPU systems with modules will want to
just upgrade if vmalloc space has been an issue on bootup.
Given the size of this request, here's some more elaborate details
on this pull request.
The functional change change in this pull request is the very first
patch from Song Liu which replaces the struct module_layout with a new
struct module memory. The old data structure tried to put together all
types of supported module memory types in one data structure, the new
one abstracts the differences in memory types in a module to allow each
one to provide their own set of details. This paves the way in the
future so we can deal with them in a cleaner way. If you look at changes
they also provide a nice cleanup of how we handle these different memory
areas in a module. This change has been in linux-next since before the
merge window opened for v6.3 so to provide more than a full kernel cycle
of testing. It's a good thing as quite a bit of fixes have been found
for it.
Jason Baron then made dynamic debug a first class citizen module user by
using module notifier callbacks to allocate / remove module specific
dynamic debug information.
Nick Alcock has done quite a bit of work cross-tree to remove module
license tags from things which cannot possibly be module at my request
so to:
a) help him with his longer term tooling goals which require a
deterministic evaluation if a piece a symbol code could ever be
part of a module or not. But quite recently it is has been made
clear that tooling is not the only one that would benefit.
Disambiguating symbols also helps efforts such as live patching,
kprobes and BPF, but for other reasons and R&D on this area
is active with no clear solution in sight.
b) help us inch closer to the now generally accepted long term goal
of automating all the MODULE_LICENSE() tags from SPDX license tags
In so far as a) is concerned, although module license tags are a no-op
for non-modules, tools which would want create a mapping of possible
modules can only rely on the module license tag after the commit
8b41fc4454 ("kbuild: create modules.builtin without Makefile.modbuiltin
or tristate.conf"). Nick has been working on this *for years* and
AFAICT I was the only one to suggest two alternatives to this approach
for tooling. The complexity in one of my suggested approaches lies in
that we'd need a possible-obj-m and a could-be-module which would check
if the object being built is part of any kconfig build which could ever
lead to it being part of a module, and if so define a new define
-DPOSSIBLE_MODULE [0]. A more obvious yet theoretical approach I've
suggested would be to have a tristate in kconfig imply the same new
-DPOSSIBLE_MODULE as well but that means getting kconfig symbol names
mapping to modules always, and I don't think that's the case today. I am
not aware of Nick or anyone exploring either of these options. Quite
recently Josh Poimboeuf has pointed out that live patching, kprobes and
BPF would benefit from resolving some part of the disambiguation as
well but for other reasons. The function granularity KASLR (fgkaslr)
patches were mentioned but Joe Lawrence has clarified this effort has
been dropped with no clear solution in sight [1].
In the meantime removing module license tags from code which could never
be modules is welcomed for both objectives mentioned above. Some
developers have also welcomed these changes as it has helped clarify
when a module was never possible and they forgot to clean this up,
and so you'll see quite a bit of Nick's patches in other pull
requests for this merge window. I just picked up the stragglers after
rc3. LWN has good coverage on the motivation behind this work [2] and
the typical cross-tree issues he ran into along the way. The only
concrete blocker issue he ran into was that we should not remove the
MODULE_LICENSE() tags from files which have no SPDX tags yet, even if
they can never be modules. Nick ended up giving up on his efforts due
to having to do this vetting and backlash he ran into from folks who
really did *not understand* the core of the issue nor were providing
any alternative / guidance. I've gone through his changes and dropped
the patches which dropped the module license tags where an SPDX
license tag was missing, it only consisted of 11 drivers. To see
if a pull request deals with a file which lacks SPDX tags you
can just use:
./scripts/spdxcheck.py -f \
$(git diff --name-only commid-id | xargs echo)
You'll see a core module file in this pull request for the above,
but that's not related to his changes. WE just need to add the SPDX
license tag for the kernel/module/kmod.c file in the future but
it demonstrates the effectiveness of the script.
Most of Nick's changes were spread out through different trees,
and I just picked up the slack after rc3 for the last kernel was out.
Those changes have been in linux-next for over two weeks.
The cleanups, debug code I added and final fix I added for modules
were motivated by David Hildenbrand's report of boot failing on
a systems with over 400 CPUs when KASAN was enabled due to running
out of virtual memory space. Although the functional change only
consists of 3 lines in the patch "module: avoid allocation if module is
already present and ready", proving that this was the best we can
do on the modules side took quite a bit of effort and new debug code.
The initial cleanups I did on the modules side of things has been
in linux-next since around rc3 of the last kernel, the actual final
fix for and debug code however have only been in linux-next for about a
week or so but I think it is worth getting that code in for this merge
window as it does help fix / prove / evaluate the issues reported
with larger number of CPUs. Userspace is not yet fixed as it is taking
a bit of time for folks to understand the crux of the issue and find a
proper resolution. Worst come to worst, I have a kludge-of-concept [3]
of how to make kernel_read*() calls for modules unique / converge them,
but I'm currently inclined to just see if userspace can fix this
instead.
[0] https://lore.kernel.org/all/Y/kXDqW+7d71C4wz@bombadil.infradead.org/
[1] https://lkml.kernel.org/r/025f2151-ce7c-5630-9b90-98742c97ac65@redhat.com
[2] https://lwn.net/Articles/927569/
[3] https://lkml.kernel.org/r/20230414052840.1994456-3-mcgrof@kernel.org
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmRG4m0SHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoinQ2oP/0xlvKwJg6Ey8fHZF0qv8VOskE80zoLF
hMazU3xfqLA+1TQvouW1YBxt3jwS3t1Ehs+NrV+nY9Yzcm0MzRX/n3fASJVe7nRr
oqWWQU+voYl5Pw1xsfdp6C8IXpBQorpYby3Vp0MAMoZyl2W2YrNo36NV488wM9KC
jD4HF5Z6xpnPSZTRR7AgW9mo7FdAtxPeKJ76Bch7lH8U6omT7n36WqTw+5B1eAYU
YTOvrjRs294oqmWE+LeebyiOOXhH/yEYx4JNQgCwPdxwnRiGJWKsk5va0hRApqF/
WW8dIqdEnjsa84lCuxnmWgbcPK8cgmlO0rT0DyneACCldNlldCW1LJ0HOwLk9pea
p3JFAsBL7TKue4Tos6I7/4rx1ufyBGGIigqw9/VX5g0Iif+3BhWnqKRfz+p9wiMa
Fl7cU6u7yC68CHu1HBSisK16cYMCPeOnTSd89upHj8JU/t74O6k/ARvjrQ9qmNUt
c5U+OY+WpNJ1nXQydhY/yIDhFdYg8SSpNuIO90r4L8/8jRQYXNG80FDd1UtvVDuy
eq0r2yZ8C0XHSlOT9QHaua/tWV/aaKtyC/c0hDRrigfUrq8UOlGujMXbUnrmrWJI
tLJLAc7ePWAAoZXGSHrt0U27l029GzLwRdKqJ6kkDANVnTeOdV+mmBg9zGh3/Mp6
agiwdHUMVN7X
=56WK
-----END PGP SIGNATURE-----
Merge tag 'modules-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull module updates from Luis Chamberlain:
"The summary of the changes for this pull requests is:
- Song Liu's new struct module_memory replacement
- Nick Alcock's MODULE_LICENSE() removal for non-modules
- My cleanups and enhancements to reduce the areas where we vmalloc
module memory for duplicates, and the respective debug code which
proves the remaining vmalloc pressure comes from userspace.
Most of the changes have been in linux-next for quite some time except
the minor fixes I made to check if a module was already loaded prior
to allocating the final module memory with vmalloc and the respective
debug code it introduces to help clarify the issue. Although the
functional change is small it is rather safe as it can only *help*
reduce vmalloc space for duplicates and is confirmed to fix a bootup
issue with over 400 CPUs with KASAN enabled. I don't expect stable
kernels to pick up that fix as the cleanups would have also had to
have been picked up. Folks on larger CPU systems with modules will
want to just upgrade if vmalloc space has been an issue on bootup.
Given the size of this request, here's some more elaborate details:
The functional change change in this pull request is the very first
patch from Song Liu which replaces the 'struct module_layout' with a
new 'struct module_memory'. The old data structure tried to put
together all types of supported module memory types in one data
structure, the new one abstracts the differences in memory types in a
module to allow each one to provide their own set of details. This
paves the way in the future so we can deal with them in a cleaner way.
If you look at changes they also provide a nice cleanup of how we
handle these different memory areas in a module. This change has been
in linux-next since before the merge window opened for v6.3 so to
provide more than a full kernel cycle of testing. It's a good thing as
quite a bit of fixes have been found for it.
Jason Baron then made dynamic debug a first class citizen module user
by using module notifier callbacks to allocate / remove module
specific dynamic debug information.
Nick Alcock has done quite a bit of work cross-tree to remove module
license tags from things which cannot possibly be module at my request
so to:
a) help him with his longer term tooling goals which require a
deterministic evaluation if a piece a symbol code could ever be
part of a module or not. But quite recently it is has been made
clear that tooling is not the only one that would benefit.
Disambiguating symbols also helps efforts such as live patching,
kprobes and BPF, but for other reasons and R&D on this area is
active with no clear solution in sight.
b) help us inch closer to the now generally accepted long term goal
of automating all the MODULE_LICENSE() tags from SPDX license tags
In so far as a) is concerned, although module license tags are a no-op
for non-modules, tools which would want create a mapping of possible
modules can only rely on the module license tag after the commit
8b41fc4454 ("kbuild: create modules.builtin without
Makefile.modbuiltin or tristate.conf").
Nick has been working on this *for years* and AFAICT I was the only
one to suggest two alternatives to this approach for tooling. The
complexity in one of my suggested approaches lies in that we'd need a
possible-obj-m and a could-be-module which would check if the object
being built is part of any kconfig build which could ever lead to it
being part of a module, and if so define a new define
-DPOSSIBLE_MODULE [0].
A more obvious yet theoretical approach I've suggested would be to
have a tristate in kconfig imply the same new -DPOSSIBLE_MODULE as
well but that means getting kconfig symbol names mapping to modules
always, and I don't think that's the case today. I am not aware of
Nick or anyone exploring either of these options. Quite recently Josh
Poimboeuf has pointed out that live patching, kprobes and BPF would
benefit from resolving some part of the disambiguation as well but for
other reasons. The function granularity KASLR (fgkaslr) patches were
mentioned but Joe Lawrence has clarified this effort has been dropped
with no clear solution in sight [1].
In the meantime removing module license tags from code which could
never be modules is welcomed for both objectives mentioned above. Some
developers have also welcomed these changes as it has helped clarify
when a module was never possible and they forgot to clean this up, and
so you'll see quite a bit of Nick's patches in other pull requests for
this merge window. I just picked up the stragglers after rc3. LWN has
good coverage on the motivation behind this work [2] and the typical
cross-tree issues he ran into along the way. The only concrete blocker
issue he ran into was that we should not remove the MODULE_LICENSE()
tags from files which have no SPDX tags yet, even if they can never be
modules. Nick ended up giving up on his efforts due to having to do
this vetting and backlash he ran into from folks who really did *not
understand* the core of the issue nor were providing any alternative /
guidance. I've gone through his changes and dropped the patches which
dropped the module license tags where an SPDX license tag was missing,
it only consisted of 11 drivers. To see if a pull request deals with a
file which lacks SPDX tags you can just use:
./scripts/spdxcheck.py -f \
$(git diff --name-only commid-id | xargs echo)
You'll see a core module file in this pull request for the above, but
that's not related to his changes. WE just need to add the SPDX
license tag for the kernel/module/kmod.c file in the future but it
demonstrates the effectiveness of the script.
Most of Nick's changes were spread out through different trees, and I
just picked up the slack after rc3 for the last kernel was out. Those
changes have been in linux-next for over two weeks.
The cleanups, debug code I added and final fix I added for modules
were motivated by David Hildenbrand's report of boot failing on a
systems with over 400 CPUs when KASAN was enabled due to running out
of virtual memory space. Although the functional change only consists
of 3 lines in the patch "module: avoid allocation if module is already
present and ready", proving that this was the best we can do on the
modules side took quite a bit of effort and new debug code.
The initial cleanups I did on the modules side of things has been in
linux-next since around rc3 of the last kernel, the actual final fix
for and debug code however have only been in linux-next for about a
week or so but I think it is worth getting that code in for this merge
window as it does help fix / prove / evaluate the issues reported with
larger number of CPUs. Userspace is not yet fixed as it is taking a
bit of time for folks to understand the crux of the issue and find a
proper resolution. Worst come to worst, I have a kludge-of-concept [3]
of how to make kernel_read*() calls for modules unique / converge
them, but I'm currently inclined to just see if userspace can fix this
instead"
Link: https://lore.kernel.org/all/Y/kXDqW+7d71C4wz@bombadil.infradead.org/ [0]
Link: https://lkml.kernel.org/r/025f2151-ce7c-5630-9b90-98742c97ac65@redhat.com [1]
Link: https://lwn.net/Articles/927569/ [2]
Link: https://lkml.kernel.org/r/20230414052840.1994456-3-mcgrof@kernel.org [3]
* tag 'modules-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (121 commits)
module: add debugging auto-load duplicate module support
module: stats: fix invalid_mod_bytes typo
module: remove use of uninitialized variable len
module: fix building stats for 32-bit targets
module: stats: include uapi/linux/module.h
module: avoid allocation if module is already present and ready
module: add debug stats to help identify memory pressure
module: extract patient module check into helper
modules/kmod: replace implementation with a semaphore
Change DEFINE_SEMAPHORE() to take a number argument
module: fix kmemleak annotations for non init ELF sections
module: Ignore L0 and rename is_arm_mapping_symbol()
module: Move is_arm_mapping_symbol() to module_symbol.h
module: Sync code of is_arm_mapping_symbol()
scripts/gdb: use mem instead of core_layout to get the module address
interconnect: remove module-related code
interconnect: remove MODULE_LICENSE in non-modules
zswap: remove MODULE_LICENSE in non-modules
zpool: remove MODULE_LICENSE in non-modules
x86/mm/dump_pagetables: remove MODULE_LICENSE in non-modules
...
Here is the large set of driver core changes for 6.4-rc1.
Once again, a busy development cycle, with lots of changes happening in
the driver core in the quest to be able to move "struct bus" and "struct
class" into read-only memory, a task now complete with these changes.
This will make the future rust interactions with the driver core more
"provably correct" as well as providing more obvious lifetime rules for
all busses and classes in the kernel.
The changes required for this did touch many individual classes and
busses as many callbacks were changed to take const * parameters
instead. All of these changes have been submitted to the various
subsystem maintainers, giving them plenty of time to review, and most of
them actually did so.
Other than those changes, included in here are a small set of other
things:
- kobject logging improvements
- cacheinfo improvements and updates
- obligatory fw_devlink updates and fixes
- documentation updates
- device property cleanups and const * changes
- firwmare loader dependency fixes.
All of these have been in linux-next for a while with no reported
problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZEp7Sw8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykitQCfamUHpxGcKOAGuLXMotXNakTEsxgAoIquENm5
LEGadNS38k5fs+73UaxV
=7K4B
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the large set of driver core changes for 6.4-rc1.
Once again, a busy development cycle, with lots of changes happening
in the driver core in the quest to be able to move "struct bus" and
"struct class" into read-only memory, a task now complete with these
changes.
This will make the future rust interactions with the driver core more
"provably correct" as well as providing more obvious lifetime rules
for all busses and classes in the kernel.
The changes required for this did touch many individual classes and
busses as many callbacks were changed to take const * parameters
instead. All of these changes have been submitted to the various
subsystem maintainers, giving them plenty of time to review, and most
of them actually did so.
Other than those changes, included in here are a small set of other
things:
- kobject logging improvements
- cacheinfo improvements and updates
- obligatory fw_devlink updates and fixes
- documentation updates
- device property cleanups and const * changes
- firwmare loader dependency fixes.
All of these have been in linux-next for a while with no reported
problems"
* tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (120 commits)
device property: make device_property functions take const device *
driver core: update comments in device_rename()
driver core: Don't require dynamic_debug for initcall_debug probe timing
firmware_loader: rework crypto dependencies
firmware_loader: Strip off \n from customized path
zram: fix up permission for the hot_add sysfs file
cacheinfo: Add use_arch[|_cache]_info field/function
arch_topology: Remove early cacheinfo error message if -ENOENT
cacheinfo: Check cache properties are present in DT
cacheinfo: Check sib_leaf in cache_leaves_are_shared()
cacheinfo: Allow early level detection when DT/ACPI info is missing/broken
cacheinfo: Add arm64 early level initializer implementation
cacheinfo: Add arch specific early level initializer
tty: make tty_class a static const structure
driver core: class: remove struct class_interface * from callbacks
driver core: class: mark the struct class in struct class_interface constant
driver core: class: make class_register() take a const *
driver core: class: mark class_release() as taking a const *
driver core: remove incorrect comment for device_create*
MIPS: vpe-cmp: remove module owner pointer from struct class usage.
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmRIKooUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vxq7A/9G0sInrqvqH2I9/Set/FnmMfCtGDH
YcEjHYYxL+pztSiXTavDV+ib9iaut83oYtcV9p1bUMhJoZdKNZhrNdIGzRFSemI4
0/ShtklPzNEu6nPPL24CnEzgbrODBU56ZvzrIE/tShEoOjkKa1triBnOA/JMxYTL
cUwqDQlDkdpYniCgxy05QfcFZ0mmSOkbl7runGfTMTiUKKC3xSRiaW5YN9KZe3i7
G5YHu1VVCjeQdQSICHYwyFmkyiqosCoajQNp1IHBkWqSwilzyZMg0NWJobVSA7M/
mXXnzLtFcC60oT58/9MaggQwDTaSGDE8mG+sWv05bB2u5TQVyZEZqZ4c2FzmZIZT
WLZYLB6PFRW0zePEuMnVkSLS2npkX+aGaBv28bf88sjorpaYNG01uYijnLEceolQ
yBPFRN3bsRuOyHvYY/tiZX/BP7z/DS++XXwA8zQWZnYsXSlncJdwCNquV0xIwUt+
hij4/Yu7o9SgV1LbuwtkMFAn3C9Szc65Eer+IvRRdnMZYphjVHbA5F2msRFyiCeR
HxECtMQ1jBnVrpQAcBX1Sz+Vu5MrwCqzc2n6tvTQHDvVNjXfkG3NaFhxYPc1IL9Z
NJMeCKfK1qzw7TtbvWXCluTTIM9N/bNJXrJhQbjNY7V6IaBZY1QNYW0ZFfGgj6Gb
UUPgndidRy4/hzw=
=HPXl
-----END PGP SIGNATURE-----
Merge tag 'pci-v6.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci updates from Bjorn Helgaas:
"Resource management:
- Add pci_dev_for_each_resource() and pci_bus_for_each_resource()
iterators
PCIe native device hotplug:
- Fix AB-BA deadlock between reset_lock and device_lock
Power management:
- Wait longer for devices to become ready after resume (as we do for
reset) to accommodate Intel Titan Ridge xHCI devices
- Extend D3hot delay for NVIDIA HDA controllers to avoid
unrecoverable devices after a bus reset
Error handling:
- Clear PCIe Device Status after EDR since generic error recovery now
only clears it when AER is native
ASPM:
- Work around Chromebook firmware defect that clobbers Capability
list (including ASPM L1 PM Substates Cap) when returning from
D3cold to D0
Freescale i.MX6 PCIe controller driver:
- Install imprecise external abort handler only when DT indicates
PCIe support
Freescale Layerscape PCIe controller driver:
- Add ls1028a endpoint mode support
Qualcomm PCIe controller driver:
- Add SM8550 DT binding and driver support
- Add SDX55 DT binding and driver support
- Use bulk APIs for clocks of IP 1.0.0, 2.3.2, 2.3.3
- Use bulk APIs for reset of IP 2.1.0, 2.3.3, 2.4.0
- Add DT "mhi" register region for supported SoCs
- Expose link transition counts via debugfs to help debug low power
issues
- Support system suspend and resume; reduce interconnect bandwidth
and turn off clock and PHY if there are no active devices
- Enable async probe by default to reduce boot time
Miscellaneous:
- Sort controller Kconfig entries by vendor"
* tag 'pci-v6.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (56 commits)
PCI: xilinx: Drop obsolete dependency on COMPILE_TEST
PCI: mobiveil: Sort Kconfig entries by vendor
PCI: dwc: Sort Kconfig entries by vendor
PCI: Sort controller Kconfig entries by vendor
PCI: Use consistent controller Kconfig menu entry language
PCI: xilinx-nwl: Add 'Xilinx' to Kconfig prompt
PCI: hv: Add 'Microsoft' to Kconfig prompt
PCI: meson: Add 'Amlogic' to Kconfig prompt
PCI: Use of_property_present() for testing DT property presence
PCI/PM: Extend D3hot delay for NVIDIA HDA controllers
dt-bindings: PCI: qcom: Document msi-map and msi-map-mask properties
PCI: qcom: Add SM8550 PCIe support
dt-bindings: PCI: qcom: Add SM8550 compatible
PCI: qcom: Add support for SDX55 SoC
dt-bindings: PCI: qcom-ep: Fix the unit address used in example
dt-bindings: PCI: qcom: Add SDX55 SoC
dt-bindings: PCI: qcom: Update maintainers entry
PCI: qcom: Enable async probe by default
PCI: qcom: Add support for system suspend and resume
PCI/PM: Drop pci_bridge_wait_for_secondary_bus() timeout parameter
...
Preserve TDP MMU roots until they are explicitly invalidated by gifting
the TDP MMU itself a reference to a root when it is allocated. Keeping a
reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible
performance, and can potentially even soft-hang a vCPU, if a vCPU
frequently unloads its roots, e.g. when KVM is emulating SMI+RSM.
When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI
and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU
cache of previous roots). Unloading roots is a simple way to ensure KVM
flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs
when allocating a "new" root (from the vCPU's perspective).
In the shadow MMU, KVM keeps track of all shadow pages, roots included, in
a per-VM hash table. Unloading a shadow MMU root just wipes it from the
per-vCPU cache; the root is still tracked in the per-VM hash table. When
KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root
in the per-VM hash table.
Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a
per-VM structure, where "active" in this case means a root is either
in-use or cached as a previous root by at least one vCPU. When a TDP MMU
root becomes inactive, i.e. the last vCPU reference to the root is put,
KVM immediately frees the root (asterisk on "immediately" as the actual
freeing may be done by a worker, but for all intents and purposes the root
is gone).
The TDP MMU behavior is especially problematic for 1-vCPU setups, as
unloading all roots effectively frees all roots. The issue is mitigated
to some degree in multi-vCPU setups as a different vCPU usually holds a
reference to an unloaded root and thus keeps the root alive, allowing the
vCPU to reuse its old root after unloading (with a flush+sync).
The TDP MMU flaw has been known for some time, as until very recently,
KVM's handling of CR0.WP also triggered unloading of all roots. The
CR0.WP toggling scenario was eventually addressed by not unloading roots
when _only_ CR0.WP is toggled, but such an approach doesn't Just Work
for emulating SMM as KVM must emulate a full TLB flush on entry and exit
to/from SMM. Given that the shadow MMU plays nice with unloading roots
at will, teaching the TDP MMU to do the same is far less complex than
modifying KVM to track which roots need to be flushed before reuse.
Note, preserving all possible TDP MMU roots is not a concern with respect
to memory consumption. Now that the role for direct MMUs doesn't include
information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there
are _at most_ six possible roots (where "guest_mode" here means L2):
1. 4-level !SMM !guest_mode
2. 4-level SMM !guest_mode
3. 5-level !SMM !guest_mode
4. 5-level SMM !guest_mode
5. 4-level !SMM guest_mode
6. 5-level !SMM guest_mode
And because each vCPU can track 4 valid roots, a VM can already have all
6 root combinations live at any given time. Not to mention that, in
practice, no sane VMM will advertise different guest.MAXPHYADDR values
across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for
a single VM. Furthermore, the vast majority of modern hypervisors will
utilize EPT/NPT when available, thus the guest_mode=%true cases are also
unlikely to be utilized.
Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com
Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net
Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com
Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: stable@vger.kernel.org
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
- Fix a bug in emulation of ENCLS in compatibility mode
- Allow emulation of NOP and PAUSE for L2
- Misc cleanups
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmRGuYgSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5kQUP/jV5Q8ZeVCzlf6ZCeAHnWX/Hahsv6i6H
ooNL8W6p8FI5xlYOWh8J02JpmLUrNWURCPqvr0oYLm4r1UlJ/OGjyuKB8d7SZ7z/
RaLN7tppMod527J+Qm3ptHQbTKAGHe4dEoiX46cuvTEcCxrsVykYltvfD1rNuSQA
VcaNJkkcHv/KuItUHLAuntCAiFvbD1gYNLfUAC7e0htGjLRLxg3+ugHEiFcJ3c6y
z4ged1toYLGD962jWSIgokFbivfUNZT25WlZjBliMa/E8+ckTAzmc67UJYvhNBOM
HyAHs0hp+XtSgfcCgNkI+WDrFXXgxa+QQcMFvRWacS3Hx6tgJoQ51FRMevmumn0O
zBPk3+BOquhknqb5NbmwRZoLExffo+86fFlDcgszzvV4Y/vBfp/XTsuJZCnaiMDZ
wdmJoF4mhRDtgt7yORltpjHqp3yRmLqMNUb3sxXLRA9D+edo9mr8SXujOnukmXoH
o/ZpEollTPUQ7od/uDIvDyosWvb65IbYwsKGdOanfBacVrxy5OPM38mPF7u9AyzD
Gn81H/OhwhpTSBAX7kLMGeK/QGkyIBEUM1levdmcAk0nKYQzHzsI7tMYfqwXuQSu
qKAcF+qtpOReWmb4KaJZ7c0HQIBQHOKQ6exXxnQJuLjnAHS0674NxMDkT5a1EGRL
Q9OPSTSYBMDC
=FOnk
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-vmx-6.4' of https://github.com/kvm-x86/linux into HEAD
KVM VMX changes for 6.4:
- Fix a bug in emulation of ENCLS in compatibility mode
- Allow emulation of NOP and PAUSE for L2
- Misc cleanups
- Add support for virtual NMIs
- Fixes for edge cases related to virtual interrupts
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmRGuLISHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5NOMQAKy1Od54yzQsIKyAZZJVfOEm7N5VLQgz
+jLilXgHd8dm/g0g/KVCDPFoZ/ut2Tf5Dn4WwyoPWOpgGsOyTwdDIJabf9rustkA
goZFcfUXz+P1nangTidrj6CFYgGmVS13Uu//H19X4bSzT+YifVevJ4QkRVElj9Mh
VBUeXppC/gMGBZ9tKEzl+AU3FwJ58cB88q4boovBFYiDdciv/fF86t02Lc+dCIX1
6hTcOAnjAcp3eJY0wPQJUAEScufDKcMf6tSrsB/yWXv9KB9ANXFNXry8/+lW/Ux/
oOUmUVdRXrrsRUqtYk9+KuMoIN7CL1SBV0RCm5ApqwqwnTVdHS+odHU3c2s7E/uU
QXIW4vwSne3W9Y4YApDgFjwDwmzY85dvblWlWBnR2LW2I3Or48xK+S8LpWG+lj6l
EDf7RzeqAipJ1qUq6qDYJlyg/YsyYlcoErtra423skg38HBWxQXdqkVIz3SYdKjA
0OcBQIRI28KzJDn1gU6P3Q0Wr/cKsx9EGy6+jWBhf4Yf3eHP7+3WUTrg/Up0q8ny
0j/+cbe5kBb6k2T9y2X6jm6TVbPV5FyMBOF/UxmqEbRLmxXjBe8tMnFwV+qN871I
gk5HTSIkX39GU9kNA3h5HoWjdNeRfhazKR9ZVrELVc1zjHnGLthXBPZbIAUsPPMx
vgM6jf8NwLXZ
=9xNX
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-svm-6.4' of https://github.com/kvm-x86/linux into HEAD
KVM SVM changes for 6.4:
- Add support for virtual NMIs
- Fixes for edge cases related to virtual interrupts
- Don't advertisze XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is
not being reported due to userspace not opting in via prctl()
- Overhaul the AMX selftests to improve coverage and cleanup the test
- Misc cleanups
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmRGt50SHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5MskP/2PhSrdgHxCwfpqpdVe/q5OWwFuhn3wG
f5QKMpEBg4wJFeIE3eGJEaDlg776nWtWDNgUmqdjoZ8vyyadkPX9CV2Y2Hq0M7Tw
d0gKPjQrz2BavyDYoPNfs4pfshs4EvDTswBkhdAt8KTZhGZosJOywQIp61V3ePqr
1rDP6C4+CmwTRAK0f7egslyJ2pZXiUcvhITvzx8XhIAQh6nEK4gUZ/l3hLmg38kD
Af23kiLnP8lHUUx4BQtRAnTw0SZXJ8DcKtoFkzEH8mdj4g6EqXpxy48zuyZcqWVi
4XIFr+WECPsV5gdqWN9rMDqIG2ib+2heKDmcdUptcVuvr1ktv0reQybmgVck4CKX
fTAdu86/LBaQmIHwNHaNFPwdUby4QQZ8ajafPC62oc+B6N1lQg8bbCwnvO6KGlGl
FaQTnzaZq7ft4tfQRXOMu1AbLZLK7dIqJHHhxR3MkBkd4MAcZ1bVKkvlJLqsOKNw
TEsreXErY7AsegZK73Rn4IN/CJGBof5bZ2NIchmiN+0UfMsd9zGn66Als6oRNh4E
tRUhFONPIEmydy9UB50qe6b98ElB6R++opZbvkVW2hy8lMy3iJrCvUbOs1nx3wbn
cxvIuTfw/dAFf70S03/zudf7lYHs2wKV1rrIAebyTd4NnvWdVB8OaSHgZswMgVjb
UzzQfnQ+u9so
=BY10
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-selftests-6.4' of https://github.com/kvm-x86/linux into HEAD
KVM selftests, and an AMX/XCR0 bugfix, for 6.4:
- Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is
not being reported due to userspace not opting in via prctl()
- Overhaul the AMX selftests to improve coverage and cleanup the test
- Misc cleanups
- Disallow virtualizing legacy LBRs if architectural LBRs are available,
the two are mutually exclusive in hardware
- Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES)
after KVM_RUN, and overhaul the vmx_pmu_caps selftest to better
validate PERF_CAPABILITIES
- Apply PMU filters to emulated events and add test coverage to the
pmu_event_filter selftest
- Misc cleanups and fixes
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmRGtd4SHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5Z9kP/i3WZ40hevvQvB/5cEpxxmxYDwCYnnjM
hiQgK5jT4SrMTmVjLgkNdI2PogQoS4CX+GC7lcA9bvse84hjuPvgOflb2B+p2UQi
Ytbr9g/tfKNIpnKIk9mcPcSObN9vm2Kgt7n28rtPrHWj89eQzgc66eijqdpKBLxA
c3crVR8krwYAQK0tmzHq1+H6hB369YbHAHyTTRRI/bNWnqKblnvUbt0NL2aBusa9
rNMaOdRtinLpy2dmuX/b3japRB8QTnlf7zpPIF4cBEhbYXy5woClZpf1D2fCA6Er
XFbEoYawMVd9UeJYbW4z5yErLT83eYoGp4U0eFXWp6fvh8nZlgCGvBKE9g4mmqwj
aSLaTR5eVN2qlw6jXVeg3unCo8Eyl36AwYwve2L6sFmBvZvNV5iz2eQ7rrOe4oE3
dnTUaLQ8I2SVg04MbYmCq5W+frTL/I7kqNpbccL1Z3R5WO4y5gz63mug6NfLIvhR
t45TAIaifxBfcXQsBZM3v2KUK/xQrD3AbJmFKh54L2CKqiGaNWsMLX+6NZ7LZWgf
8rEqsVkkQDgF7z8eXai4TR26nYfSX6g9gDqtOH73L87aJ7PJk5cRoDWQ1sWs1e/l
4HA/L0Bo/3pnKAa0ZWxJOixmzqY49gNQf3dj8gt3jk3y2ijbAivshiSpPBmIxn0u
QLeOf/LGvipl
=m18F
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-pmu-6.4' of https://github.com/kvm-x86/linux into HEAD
KVM x86 PMU changes for 6.4:
- Disallow virtualizing legacy LBRs if architectural LBRs are available,
the two are mutually exclusive in hardware
- Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES)
after KVM_RUN, and overhaul the vmx_pmu_caps selftest to better
validate PERF_CAPABILITIES
- Apply PMU filters to emulated events and add test coverage to the
pmu_event_filter selftest
- Misc cleanups and fixes
- Tweak FNAME(sync_spte) to avoid unnecessary writes+flushes when the
guest is only adding new PTEs
- Overhaul .sync_page() and .invlpg() to share the .sync_page()
implementation, i.e. utilize .sync_page()'s optimizations when emulating
invalidations
- Clean up the range-based flushing APIs
- Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single
A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle
changed SPTE" overhead associated with writing the entire entry
- Track the number of "tail" entries in a pte_list_desc to avoid having
to walk (potentially) all descriptors during insertion and deletion,
which gets quite expensive if the guest is spamming fork()
- Misc cleanups
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmRGsvASHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5XnoP/0D8rQmrA0xPHK81zYS1E71tsR/itO/T
CQMSB4PhEqvcRUaWOuhLBRUW+noWzaOkjkMYK2uoPTdtme7v9+Ar7EtfrWYHrBWD
IxHCAymo3a5dQPUc3Nb77u6HjRAOokPSqSz5jE4qAjlniW09feruro2Phi+BTme4
JjxTc/7Oh0Fu26+mK7mJHiw3fV1x3YznnnRPrKGrVQes5L6ozNICkUZ6nvuJUVMk
lTNHNQbG8PqJZnfWG7VIKRn1vdfXwEfnvyucGVEqFfPLkOXqJHyqMVmIOtvsH7C5
l8j36+lBZwtFh2jk2EsXOTb6sS7l1MSvyHLlbaJaqqffP+77Hf1n0fROur0k9Yse
jJJejJWxZ/SvjMt/bOA+4ybGafZH0lt20DsDWnat5GSQ1EVT1CInN2p8OY8pdecR
QOJBqnNUOykC7/Pyad+IxTxwrOSNCYh+5aYG8AdGquZvNUEwjffVJqrmxDvklY8Z
DTYwGKgNY7NsP/dV0WYYElsAuHiKwiDZL15KftiQebO1fPcZDpTzDo83/8UMfGxh
yegngcNX9Qi7lWtLkUMy8A99UvejM0QrS/Zt8v1zjlQ8PjreZLLBWsNpe0ufIMRk
31ZAC2OS4Koi3wZ54tA7Z1Kh11meGhAk5Ti7sNke0rDqB9UMmj6UKw121cSRvW7q
W6O4U3YeGpKx
=zb4u
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-mmu-6.4' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.4:
- Tweak FNAME(sync_spte) to avoid unnecessary writes+flushes when the
guest is only adding new PTEs
- Overhaul .sync_page() and .invlpg() to share the .sync_page()
implementation, i.e. utilize .sync_page()'s optimizations when emulating
invalidations
- Clean up the range-based flushing APIs
- Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single
A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle
changed SPTE" overhead associated with writing the entire entry
- Track the number of "tail" entries in a pte_list_desc to avoid having
to walk (potentially) all descriptors during insertion and deletion,
which gets quite expensive if the guest is spamming fork()
- Misc cleanups
- Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled,
and by giving the guest control of CR0.WP when EPT is enabled on VMX
(VMX-only because SVM doesn't support per-bit controls)
- Add CR0/CR4 helpers to query single bits, and clean up related code
where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return
as a bool
- Move AMD_PSFD to cpufeatures.h and purge KVM's definition
- Misc cleanups
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmRGr2sSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5b80P/2ayACpc7iV2DysXkrxOdn1JmMu9BeHd
3oMb7bydf79LMNAO+NKPqVjo74yZ/Lh8UyufJGgF3HnSCdumx5Iklyx6/2PUHu/I
8xT1H7VlIGQMcNy0G4hMus34ZcafJl4y+BXgMEqEErLcy3n598UvFGJ+C0/4lnux
2Gk7dLASHq/mVVKReBM/kD4RhCVy5Venz6zkk9KbwDLHAmfejVK5bSqDYAnO1WtV
IBWetxlVyMZCnfPV2drhzgNVwiHvYvCaMBW+cUk5cH8Z2r0VZVDERmc1D4/rd04t
xs9lMk6CdNU7REQfblA0xMgeO/dNAXq5Fs4FfcM8OTBZU32KKafPhgW1uj2Sv+9l
nbb1XxZ7C0EcBhKVbUD6zRl05vjHwxlRgoi0yWUqERthFKNXHV42JJgaNn4fxDYS
tOBKBNkM9z6tCGN2aZv6GwhsEyY2y7oLdbZUGK9/FM3mF1VBASms1BTwokJXTxCD
pkOpAGeN5hxOlC4/wl6iHJTrz9oaJUj5E5kMD1oK6oQJgnnfqH0kVTG/ui/OUtJg
8N3amYO/d7InFvuE0f9R6TqZVhTN2QefHmNJaEldsmYp1NMI8Ep8JIhQKRA2LZVE
CGRxyrPj5CESerAItAI6tshEre5W8aScEzhpmd6HgHmahhQJsCEj+3q/J8FPWLG/
iQ3GnggrklfU
=qj7D
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.4' of https://github.com/kvm-x86/linux into HEAD
KVM x86 changes for 6.4:
- Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled,
and by giving the guest control of CR0.WP when EPT is enabled on VMX
(VMX-only because SVM doesn't support per-bit controls)
- Add CR0/CR4 helpers to query single bits, and clean up related code
where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return
as a bool
- Move AMD_PSFD to cpufeatures.h and purge KVM's definition
- Misc cleanups
- Numerous fixes for the pathological lock inversion issue that
plagued KVM/arm64 since... forever.
- New framework allowing SMCCC-compliant hypercalls to be forwarded
to userspace, hopefully paving the way for some more features
being moved to VMMs rather than be implemented in the kernel.
- Large rework of the timer code to allow a VM-wide offset to be
applied to both virtual and physical counters as well as a
per-timer, per-vcpu offset that complements the global one.
This last part allows the NV timer code to be implemented on
top.
- A small set of fixes to make sure that we don't change anything
affecting the EL1&0 translation regime just after having having
taken an exception to EL2 until we have executed a DSB. This
ensures that speculative walks started in EL1&0 have completed.
- The usual selftest fixes and improvements.
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmRCZIwPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDoZ8P/ioXAdDbAE4hTuyD2YdKJ3IGWN3pg52Z7xc2
rBXXFrbK9+n9FEc3AVdHoGsRPDP0Ynl+apj+aB0Klr/Fl0KKqac+W0ARX9rn1mI1
HjeygFPaGnXjMUp0BjeSLS+g3b0gebELJ6R1QEe1/MIPb8Se7M1y3ZpMWdhe0PPL
vyzw3LZq2OAlLgWKZhAfhh03qdr2kqJxypYs6nMrcexfn8dXT78dsYKW1nXmqKcE
61Gg23MDPUoexYpUhm+ym5t8hltoI1di8faPmxEpaFzpSDyAg8V5vo6LiW9jn3cf
RX0Sikk1laiRAhVbbIFCKC148vFyKxum3scpKyb91Qc+sK1kmIcxvEqlc6SfG9je
+5ndZwAfXtW6SMSOyX8y5fXbee7M0sx3n3le9BNgwXfmLWg/GHXJ544dJgVIlf/e
0Z+8QnP1IUDfARR/b2FlW7A7XLzNHQzO379ekcAdUptbGwlS9CrW6SJ83QR7K6fB
bh0aSSELKsD7pX8wnNyNACvmz2zL12ITlDKdZWUr8MSxyTjgVy7s0BDsQT3sbrA1
1sH++RvUWfC2k7tVT3vjZFzUDlPw3bnZmo5YMWRTMbXEdr1V5rDw5F5IXit13KeT
8bk0hnJgnLmyoX2A17v5dkFMIKD7p13tqDRdfFcn0ru63HIKxgkS3ITkDmsAQELK
DHT7RBE0
=Bhta
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 6.4
- Numerous fixes for the pathological lock inversion issue that
plagued KVM/arm64 since... forever.
- New framework allowing SMCCC-compliant hypercalls to be forwarded
to userspace, hopefully paving the way for some more features
being moved to VMMs rather than be implemented in the kernel.
- Large rework of the timer code to allow a VM-wide offset to be
applied to both virtual and physical counters as well as a
per-timer, per-vcpu offset that complements the global one.
This last part allows the NV timer code to be implemented on
top.
- A small set of fixes to make sure that we don't change anything
affecting the EL1&0 translation regime just after having having
taken an exception to EL2 until we have executed a DSB. This
ensures that speculative walks started in EL1&0 have completed.
- The usual selftest fixes and improvements.
API:
- Total usage stats now include all that returned error (instead of some).
- Remove maximum hash statesize limit.
- Add cloning support for hmac and unkeyed hashes.
- Demote BUG_ON in crypto_unregister_alg to a WARN_ON.
Algorithms:
- Use RIP-relative addressing on x86 to prepare for PIE build.
- Add accelerated AES/GCM stitched implementation on powerpc P10.
- Add some test vectors for cmac(camellia).
- Remove failure case where jent is unavailable outside of FIPS mode in drbg.
- Add permanent and intermittent health error checks in jitter RNG.
Drivers:
- Add support for 402xx devices in qat.
- Add support for HiSTB TRNG.
- Fix hash concurrency issues in stm32.
- Add OP-TEE firmware support in caam.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmRGCjcACgkQxycdCkmx
i6d6JA//ZmwgEqAKA8qWpHnNKZylTLqFhLxnKZwr4Hhp1KzManh/T9pepXiD2zAY
D92wU60v0hfGAazeUWQRmrIZxcjyd3b3Tr7WiFuNoZbkPsuXWZAoz8iHgMq69dqb
DXZhKJnlmVlcr+qTSk9MP8HODL5kU6Ug2pk+r8hL/WsBI+JGfZEXKcJhhMqYLYls
nl+NN4fkE5tgcTh2lp/9dQsQRylhESZuqb8L2wItQmripSbhPGwYf24I7B7xcGrn
o7X4XG//cQO6zQErgnOJOosIgJEEynW27CN4ZiHB8WhRAk0YLXydQBs6EjZgNA8H
EvZC/bIx2YOt8ngG99q4kRg4OgKp4c7UnV6l1pxuJWbIyXrFh4djxHdq9pTYr3UB
P3pVEX38Wu7U5Tfgy3y1QqZzsvrPjmnI3NQ8QBrcFzNRDan5K6nH4kQyk9Cv7LQm
GlE1JOThU5U2G33ZWKCluJUjVUCRceMWQYla1X5R4uWMCwSqRMpmx8Ib9QvbYlWe
iUI+RatLnlIobx+lgaC8mtij9dQddFjk6YwFYhQcD3Bl30DhTeIlbnOUY9YOTXps
H6V9X2inVUjyZr1uJ4a7rPdCUuzQxR6HWPyp6fXMlbLrEhL8e6c4/QbEoTubRQeS
WTtoIFt4ezd2SG6hI6dTCscgFc5EAyEMDD5GtQmJeyozu0Gqtpo=
=ITkW
-----END PGP SIGNATURE-----
Merge tag 'v6.4-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Total usage stats now include all that returned errors (instead of
just some)
- Remove maximum hash statesize limit
- Add cloning support for hmac and unkeyed hashes
- Demote BUG_ON in crypto_unregister_alg to a WARN_ON
Algorithms:
- Use RIP-relative addressing on x86 to prepare for PIE build
- Add accelerated AES/GCM stitched implementation on powerpc P10
- Add some test vectors for cmac(camellia)
- Remove failure case where jent is unavailable outside of FIPS mode
in drbg
- Add permanent and intermittent health error checks in jitter RNG
Drivers:
- Add support for 402xx devices in qat
- Add support for HiSTB TRNG
- Fix hash concurrency issues in stm32
- Add OP-TEE firmware support in caam"
* tag 'v6.4-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (139 commits)
i2c: designware: Add doorbell support for Mendocino
i2c: designware: Use PCI PSP driver for communication
powerpc: Move Power10 feature PPC_MODULE_FEATURE_P10
crypto: p10-aes-gcm - Remove POWER10_CPU dependency
crypto: testmgr - Add some test vectors for cmac(camellia)
crypto: cryptd - Add support for cloning hashes
crypto: cryptd - Convert hash to use modern init_tfm/exit_tfm
crypto: hmac - Add support for cloning
crypto: hash - Add crypto_clone_ahash/shash
crypto: api - Add crypto_clone_tfm
crypto: api - Add crypto_tfm_get
crypto: x86/sha - Use local .L symbols for code
crypto: x86/crc32 - Use local .L symbols for code
crypto: x86/aesni - Use local .L symbols for code
crypto: x86/sha256 - Use RIP-relative addressing
crypto: x86/ghash - Use RIP-relative addressing
crypto: x86/des3 - Use RIP-relative addressing
crypto: x86/crc32c - Use RIP-relative addressing
crypto: x86/cast6 - Use RIP-relative addressing
crypto: x86/cast5 - Use RIP-relative addressing
...
Move the implementation of fb_pgprotect() to fbdev.c and include
<asm/fb.h>. Fixes the following warning:
../arch/x86/video/fbdev.c:14:5: warning: no previous prototype for 'fb_is_primary_device' [-Wmissing-prototypes]
14 | int fb_is_primary_device(struct fb_info *info)
| ^~~~~~~~~~~~~~~~~~~~
Just including <asm/fb.h> results in a number of built-in errors
about undefined function. Moving fb_pgprotect() to the source file
avoids the required include statements in the header. The function
is only called occasionally from fb_mmap(), [1] so having it as static
inline had no benefit.
While at it, fix the codying style in fbdev.c.
Link: https://elixir.bootlin.com/linux/v6.3-rc7/source/drivers/video/fbdev/core/fbmem.c#L1404 # 1
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230424084751.14641-1-tzimmermann@suse.de
Highlights:
- AMD PMC and PMF drivers:
- Numerous bugfixes
- Intel Speed Select Technology (ISST):
- TPMI (Topology Aware Register and PM Capsule Interface) support
for ISST support on upcoming processor models
- Various other improvements / new hw support
- tools/intel-speed-select: TPMI support + other improvements
- Intel In Field Scan (IFS):
- Add Array Bist test support
- New drivers:
- intel_bytcrc_pwrsrc Crystal Cove PMIC pwrsrc / reset-reason driver
- lenovo-ymc Yoga Mode Control driver for reporting SW_TABLET_MODE
- msi-ec Driver for MSI laptop EC features like battery charging limits
- apple-gmux:
- Support for new MMIO based models (T2 Macs)
- Honor acpi_backlight= auto-detect-code + kernel cmdline option
to switch between gmux and apple_bl backlight drivers and remove
own custom handling for this
- x86-android-tablets: Refactor / cleanup + new hw support
- Miscellaneous other cleanups / fixes
The following is an automated git shortlog grouped by driver:
Add driver for Yoga Tablet Mode switch:
- Add driver for Yoga Tablet Mode switch
Add intel_bytcrc_pwrsrc driver:
- Add intel_bytcrc_pwrsrc driver
Add new msi-ec driver:
- Add new msi-ec driver
Documentation/ABI:
- Update IFS ABI doc
ISST:
- unlock on error path in tpmi_sst_init()
- Add suspend/resume callbacks
- Add SST-TF support via TPMI
- Add SST-BF support via TPMI
- Add SST-PP support via TPMI
- Add SST-CP support via TPMI
- Parse SST MMIO and update instance
- Enumerate TPMI SST and create framework
- Add support for MSR 0x54
- Add API version of the target
- Add IOCTL default callback
- Add TPMI target
Merge remote-tracking branch 'intel-speed-select/intel-sst' into review-hans:
- Merge remote-tracking branch 'intel-speed-select/intel-sst' into review-hans
Merge tag 'ib-pdx86-backlight-6.4' into review-hans:
- Merge tag 'ib-pdx86-backlight-6.4' into review-hans
Move ideapad ACPI helpers to a new header:
- Move ideapad ACPI helpers to a new header
acer-wmi:
- Convert to platform remove callback returning void
acerhdf:
- Remove unneeded semicolon
adv_swbutton:
- Convert to platform remove callback returning void
amilo-rfkill:
- Convert to platform remove callback returning void
apple-gmux:
- Fix iomem_base __iomem annotation
- return -EFAULT if copy fails
- Update apple_gmux_detect documentation
- Add acpi_video_get_backlight_type() check
- add debugfs interface
- support MMIO gmux on T2 Macs
- refactor gmux types
- use first bit to check switch state
backlight:
- apple_bl: Use acpi_video_get_backlight_type()
barco-p50-gpio:
- Convert to platform remove callback returning void
classmate:
- mark SPI related data as maybe unused
compal-laptop:
- Convert to platform remove callback returning void
dell:
- dell-smo8800: Convert to platform remove callback returning void
- dcdbas: Convert to platform remove callback returning void
dell-laptop:
- Register ctl-led for speaker-mute
hp:
- tc1100-wmi: Convert to platform remove callback returning void
- hp_accel: Convert to platform remove callback returning void
huawei-wmi:
- Convert to platform remove callback returning void
ideapad-laptop:
- Convert to platform remove callback returning void
intel:
- vbtn: Convert to platform remove callback returning void
- telemetry: pltdrv: Convert to platform remove callback returning void
- pmc: core: Convert to platform remove callback returning void
- mrfld_pwrbtn: Convert to platform remove callback returning void
- int3472: discrete: Convert to platform remove callback returning void
- int1092: intel_sar: Convert to platform remove callback returning void
- int0002_vgpio: Convert to platform remove callback returning void
- hid: Convert to platform remove callback returning void
- chtwc_int33fe: Convert to platform remove callback returning void
- chtdc_ti_pwrbtn: Convert to platform remove callback returning void
- bxtwc_tmu: Convert to platform remove callback returning void
intel-uncore-freq:
- Add client processors
mlxbf-bootctl:
- Add sysfs file for BlueField boot fifo
pcengines-apuv2:
- Drop platform:pcengines-apuv2 module-alias
platform/mellanox:
- add firmware reset support
platform/olpc:
- olpc-xo175-ec: Use SPI device ID data to bind device
platform/surface:
- aggregator_registry: Add support for tablet-mode switch on Surface Pro 9
- aggregator_tabletsw: Add support for Type-Cover posture source
- aggregator_tabletsw: Properly handle different posture source IDs
platform/x86/amd:
- pmc: provide user message where s0ix is not supported
- pmc: Remove __maybe_unused from amd_pmc_suspend_handler()
- pmc: Convert to platform remove callback returning void
- pmc: Fix memory leak in amd_pmc_stb_debugfs_open_v2()
- pmc: Move out of BIOS SMN pair for STB init
- pmc: Utilize SMN index 0 for driver probe
- pmc: Move idlemask check into `amd_pmc_idlemask_read`
- pmc: Don't dump data after resume from s0i3 on picasso
- pmc: Hide SMU version and program attributes for Picasso
- pmc: Don't try to read SMU version on Picasso
- pmf: core: Convert to platform remove callback returning void
- hsmp: Convert to platform remove callback returning void
platform/x86/amd/pmf:
- Move out of BIOS SMN pair for driver probe
platform/x86/intel:
- vsec: Use intel_vsec_dev_release() to simplify init() error cleanup
- vsec: Explicitly enable capabilities
platform/x86/intel/ifs:
- Update IFS doc
- Implement Array BIST test
- Sysfs interface for Array BIST
- Introduce Array Scan test to IFS
- IFS cleanup
- Reorganize driver data
- Separate ifs_pkg_auth from ifs_data
platform/x86/intel/pmc/mtl:
- Put GNA/IPU/VPU devices in D3
platform/x86/intel/pmt:
- Ignore uninitialized entries
- Add INTEL_PMT module namespace
platform/x86/intel/sdsi:
- Change mailbox timeout
samsung-q10:
- Convert to platform remove callback returning void
serial-multi-instantiate:
- Convert to platform remove callback returning void
sony:
- mark SPI related data as maybe unused
think-lmi:
- Remove unnecessary casts for attributes
- Remove custom kobject sysfs_ops
- Properly interpret return value of tlmi_setting
thinkpad_acpi:
- Fix Embedded Controller access on X380 Yoga
tools/power/x86/intel-speed-select:
- Update version
- Change TRL display for Emerald Rapids
- Identify Emerald Rapids
- Display AMX base frequency
- Use cgroup v2 isolation
- Add missing free cpuset
- Fix clos-max display with TPMI I/F
- Add cpu id check
- Avoid setting duplicate tdp level
- Remove cpu mask display for non-cpu power domain
- Hide invalid TRL level
- Display fact info for non-cpu power domain
- Show level 0 name for new api_version
- Prevent cpu clos config for non-cpu power domain
- Allow display non-cpu power domain info
- Display amx_p1 and cooling_type
- Display punit info
- Introduce TPMI interface support
- Get punit core mapping information
- Introduce api_version helper
- Support large clos_min/max
- Introduce is_debug_enabled()
- Allow api_version based platform callbacks
- Move send_mbox_cmd to isst-core-mbox.c
- Abstract adjust_uncore_freq
- Abstract read_pm_config
- Abstract clos_associate
- Abstract clos_get_assoc_status
- Abstract set_clos
- Abstract pm_get_clos
- Abstract pm_qos_config
- Abstract get_clos_information
- Abstract get_get_trls
- Enhance get_tdp_info
- Abstract get_uncore_p0_p1_info
- Abstract get_fact_info
- Abstract set_pbf_fact_status
- Remove isst_get_pbf_info_complete
- Abstract get_pbf_info
- Abstract set_tdp_level
- Abstract get_trl_bucket_info
- Abstract get_get_trl
- Abstract get_coremask_info
- Abstract get_tjmax_info
- Move code right before its caller
- Abstract get_pwr_info
- Abstract get_tdp_info
- Abstract get_ctdp_control
- Abstract get_config_levels
- Abstract is_punit_valid
- Introduce isst-core-mbox.c
- Always invoke isst_fill_platform_info
- Introduce isst_get_disp_freq_multiplier
- Move mbox functions to isst-core.c
- Improve isst_print_extended_platform_info
- Rename for_each_online_package_in_set
- Introduce support for multi-punit
- Introduce isst_is_punit_valid()
- Introduce punit to isst_id
- Follow TRL nameing for FACT info
- Unify TRL levels
wmi:
- Convert to platform remove callback returning void
x86-android-tablets:
- Add accelerometer support for Yoga Tablet 2 1050/830 series
- Add "yogabook-touch-kbd-digitizer-switch" pdev for Lenovo Yoga Book
- Add Wacom digitizer info for Lenovo Yoga Book
- Update Yoga Book HiDeep touchscreen comment
- Add Lenovo Yoga Book X90F/L data
- Share lp855x_platform_data between different models
- Use LP8557 in direct mode on both the Yoga 830 and the 1050
- Add depends on PMIC_OPREGION
- Lenovo Yoga Book match is for YB1-X91 models only
- Add LID switch support for Yoga Tablet 2 1050/830 series
- Add backlight ctrl for Lenovo Yoga Tab 3 Pro YT3-X90F
- Add touchscreen support for Lenovo Yoga Tab 3 Pro YT3-X90F
- Add support for the Dolby button on Peaq C1010
- Add gpio_keys support to x86_android_tablet_init()
- Move remaining tablets to other.c
- Move Lenovo tablets to their own file
- Move Asus tablets to their own file
- Move shared power-supply fw-nodes to a separate file
- Move DMI match table into its own dmi.c file
- Move core code into new core.c file
- Move into its own subdir
- Add Acer Iconia One 7 B1-750 data
x86/include/asm/msr-index.h:
- Add IFS Array test bits
xo1-rfkill:
- Convert to platform remove callback returning void
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEEuvA7XScYQRpenhd+kuxHeUQDJ9wFAmRGmK4UHGhkZWdvZWRl
QHJlZGhhdC5jb20ACgkQkuxHeUQDJ9yBCAf+PebzfccC2ABHq+nFGSok18beRtFf
fGs9NI21Mjdbhhy+KsKddgZceh7pbdcaIznuka3TZAK0UXcHRe30X3eoDvSCk9YW
Xj/Uf3ExsipNh1Ung+Q1qTWtzUw7XdJWqMZ5HxlUI2ZlmSTAIOyZBpSEPrK052oi
lAbSqrnB1DEh1qYV4Q7g71R82iAR791DAH1dsDZwC1Zb6KK6fxI/zQhw4JP1XSCs
htE5RFUzPWiXG2ou5t6Nteju/QqEaCoIS7z7ZK/SgWcLlPxeksxwso3obI/U8PvD
JMmMiY4VFzizuGqTZHiy/EtKXo1pq+fOcMEqSuaaDfcYgdHmLm0OIU12Ig==
=51xc
-----END PGP SIGNATURE-----
Merge tag 'platform-drivers-x86-v6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver updates from Hans de Goede:
- AMD PMC and PMF drivers:
- Numerous bugfixes
- Intel Speed Select Technology (ISST):
- TPMI (Topology Aware Register and PM Capsule Interface) support
for ISST support on upcoming processor models
- Various other improvements / new hw support
- tools/intel-speed-select: TPMI support + other improvements
- Intel In Field Scan (IFS):
- Add Array Bist test support
- New drivers:
- intel_bytcrc_pwrsrc Crystal Cove PMIC pwrsrc / reset-reason driver
- lenovo-ymc Yoga Mode Control driver for reporting SW_TABLET_MODE
- msi-ec Driver for MSI laptop EC features like battery charging limits
- apple-gmux:
- Support for new MMIO based models (T2 Macs)
- Honor acpi_backlight= auto-detect-code + kernel cmdline option
to switch between gmux and apple_bl backlight drivers and remove
own custom handling for this
- x86-android-tablets: Refactor / cleanup + new hw support
- Miscellaneous other cleanups / fixes
* tag 'platform-drivers-x86-v6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (178 commits)
platform/x86: x86-android-tablets: Add accelerometer support for Yoga Tablet 2 1050/830 series
platform/x86: x86-android-tablets: Add "yogabook-touch-kbd-digitizer-switch" pdev for Lenovo Yoga Book
platform/x86: x86-android-tablets: Add Wacom digitizer info for Lenovo Yoga Book
platform/x86: x86-android-tablets: Update Yoga Book HiDeep touchscreen comment
platform/x86: thinkpad_acpi: Fix Embedded Controller access on X380 Yoga
platform/x86/intel/sdsi: Change mailbox timeout
platform/x86/intel/pmt: Ignore uninitialized entries
platform/x86: amd: pmc: provide user message where s0ix is not supported
platform/x86/amd: pmc: Fix memory leak in amd_pmc_stb_debugfs_open_v2()
mlxbf-bootctl: Add sysfs file for BlueField boot fifo
platform/x86: amd: pmc: Remove __maybe_unused from amd_pmc_suspend_handler()
platform/x86/intel/pmc/mtl: Put GNA/IPU/VPU devices in D3
platform/x86/amd: pmc: Move out of BIOS SMN pair for STB init
platform/x86/amd: pmc: Utilize SMN index 0 for driver probe
platform/x86/amd: pmc: Move idlemask check into `amd_pmc_idlemask_read`
platform/x86/amd: pmc: Don't dump data after resume from s0i3 on picasso
platform/x86/amd: pmc: Hide SMU version and program attributes for Picasso
platform/x86/amd: pmc: Don't try to read SMU version on Picasso
platform/x86/amd/pmf: Move out of BIOS SMN pair for driver probe
platform/x86: intel-uncore-freq: Add client processors
...
ACPI:
* Improve error reporting when failing to manage SDEI on AGDI device
removal
Assembly routines:
* Improve register constraints so that the compiler can make use of
the zero register instead of moving an immediate #0 into a GPR
* Allow the compiler to allocate the registers used for CAS
instructions
CPU features and system registers:
* Cleanups to the way in which CPU features are identified from the
ID register fields
* Extend system register definition generation to handle Enum types
when defining shared register fields
* Generate definitions for new _EL2 registers and add new fields
for ID_AA64PFR1_EL1
* Allow SVE to be disabled separately from SME on the kernel
command-line
Tracing:
* Support for "direct calls" in ftrace, which enables BPF tracing
for arm64
Kdump:
* Don't bother unmapping the crashkernel from the linear mapping,
which then allows us to use huge (block) mappings and reduce
TLB pressure when a crashkernel is loaded.
Memory management:
* Try again to remove data cache invalidation from the coherent DMA
allocation path
* Simplify the fixmap code by mapping at page granularity
* Allow the kfence pool to be allocated early, preventing the rest
of the linear mapping from being forced to page granularity
Perf and PMU:
* Move CPU PMU code out to drivers/perf/ where it can be reused
by the 32-bit ARM architecture when running on ARMv8 CPUs
* Fix race between CPU PMU probing and pKVM host de-privilege
* Add support for Apple M2 CPU PMU
* Adjust the generic PERF_COUNT_HW_BRANCH_INSTRUCTIONS event
dynamically, depending on what the CPU actually supports
* Minor fixes and cleanups to system PMU drivers
Stack tracing:
* Use the XPACLRI instruction to strip PAC from pointers, rather
than rolling our own function in C
* Remove redundant PAC removal for toolchains that handle this in
their builtins
* Make backtracing more resilient in the face of instrumentation
Miscellaneous:
* Fix single-step with KGDB
* Remove harmless warning when 'nokaslr' is passed on the kernel
command-line
* Minor fixes and cleanups across the board
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmRChcwQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNCgBCADFvkYY9ESztSnd3EpiMbbAzgRCQBiA5H7U
F2Wc+hIWgeAeUEttSH22+F16r6Jb0gbaDvsuhtN2W/rwQhKNbCU0MaUME05MPmg2
AOp+RZb2vdT5i5S5dC6ZM6G3T6u9O78LBWv2JWBdd6RIybamEn+RL00ep2WAduH7
n1FgTbsKgnbScD2qd4K1ejZ1W/BQMwYulkNpyTsmCIijXM12lkzFlxWnMtky3uhR
POpawcIZzXvWI02QAX+SIdynGChQV3VP+dh9GuFbt7ASigDEhgunvfUYhZNSaqf4
+/q0O8toCtmQJBUhF0DEDSB5T8SOz5v9CKxKuwfaX6Trq0ixFQpZ
=78L9
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"ACPI:
- Improve error reporting when failing to manage SDEI on AGDI device
removal
Assembly routines:
- Improve register constraints so that the compiler can make use of
the zero register instead of moving an immediate #0 into a GPR
- Allow the compiler to allocate the registers used for CAS
instructions
CPU features and system registers:
- Cleanups to the way in which CPU features are identified from the
ID register fields
- Extend system register definition generation to handle Enum types
when defining shared register fields
- Generate definitions for new _EL2 registers and add new fields for
ID_AA64PFR1_EL1
- Allow SVE to be disabled separately from SME on the kernel
command-line
Tracing:
- Support for "direct calls" in ftrace, which enables BPF tracing for
arm64
Kdump:
- Don't bother unmapping the crashkernel from the linear mapping,
which then allows us to use huge (block) mappings and reduce TLB
pressure when a crashkernel is loaded.
Memory management:
- Try again to remove data cache invalidation from the coherent DMA
allocation path
- Simplify the fixmap code by mapping at page granularity
- Allow the kfence pool to be allocated early, preventing the rest of
the linear mapping from being forced to page granularity
Perf and PMU:
- Move CPU PMU code out to drivers/perf/ where it can be reused by
the 32-bit ARM architecture when running on ARMv8 CPUs
- Fix race between CPU PMU probing and pKVM host de-privilege
- Add support for Apple M2 CPU PMU
- Adjust the generic PERF_COUNT_HW_BRANCH_INSTRUCTIONS event
dynamically, depending on what the CPU actually supports
- Minor fixes and cleanups to system PMU drivers
Stack tracing:
- Use the XPACLRI instruction to strip PAC from pointers, rather than
rolling our own function in C
- Remove redundant PAC removal for toolchains that handle this in
their builtins
- Make backtracing more resilient in the face of instrumentation
Miscellaneous:
- Fix single-step with KGDB
- Remove harmless warning when 'nokaslr' is passed on the kernel
command-line
- Minor fixes and cleanups across the board"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (72 commits)
KVM: arm64: Ensure CPU PMU probes before pKVM host de-privilege
arm64: kexec: include reboot.h
arm64: delete dead code in this_cpu_set_vectors()
arm64/cpufeature: Use helper macro to specify ID register for capabilites
drivers/perf: hisi: add NULL check for name
drivers/perf: hisi: Remove redundant initialized of pmu->name
arm64/cpufeature: Consistently use symbolic constants for min_field_value
arm64/cpufeature: Pull out helper for CPUID register definitions
arm64/sysreg: Convert HFGITR_EL2 to automatic generation
ACPI: AGDI: Improve error reporting for problems during .remove()
arm64: kernel: Fix kernel warning when nokaslr is passed to commandline
perf/arm-cmn: Fix port detection for CMN-700
arm64: kgdb: Set PSTATE.SS to 1 to re-enable single-step
arm64: move PAC masks to <asm/pointer_auth.h>
arm64: use XPACLRI to strip PAC
arm64: avoid redundant PAC stripping in __builtin_return_address()
arm64/sme: Fix some comments of ARM SME
arm64/signal: Alloc tpidr2 sigframe after checking system_supports_tpidr2()
arm64/signal: Use system_supports_tpidr2() to check TPIDR2
arm64/idreg: Don't disable SME when disabling SVE
...
These are various cleanups, fixing a number of uapi header files to no
longer reference CONFIG_* symbols, and one patch that introduces the
new CONFIG_HAS_IOPORT symbol for architectures that provide working
inb()/outb() macros, as a preparation for adding driver dependencies
on those in the following release.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmRG8IkACgkQYKtH/8kJ
Uid15Q/9E/neIIEqEk6IvtyhUicrJiIZUM0rGoYtWXiz75ggk6Kx9+3I+j8zIQ/E
kf2TzAG7q9Md7nfTDFLr4FSr0IcNDj+VG4nYxUyDHdKGcARO+g9Kpdvscxip3lgU
Rw5w74Gyd30u4iUKGS39OYuxcCgl9LaFjMA9Gh402Oiaoh+OYLmgQS9h/goUD5KN
Nd+AoFvkdbnHl0/SpxthLRyL5rFEATBmAY7apYViPyMvfjS3gfDJwXJR9jkKgi6X
Qs4t8Op8BA3h84dCuo6VcFqgAJs2Wiq3nyTSUnkF8NxJ2RFTpeiVgfsLOzXHeDgz
SKDB4Lp14o3mlyZyj00MWq1uMJRRetUgNiVb6iHOoKQ/E4demBdh+mhIFRybjM5B
XNTWFcg9PWFCMa4W9jnLfZBc881X4+7T+qUF8I0W/1AbRJUmyGj8HO6jLceC4yGD
UYLn5oFPM6OWXHp6DqJrCr9Yw8h6fuviQZFEbl/ARlgVGt+J4KbYweJYk8DzfX6t
PZIj8LskOqyIpRuC2oDA1PHxkaJ1/z+N5oRBHq1uicSh4fxY5HW7HnyzgF08+R3k
cf+fjAhC3TfGusHkBwQKQJvpxrxZjPuvYXDZ0GxTvNKJRB8eMeiTm1n41E5oTVwQ
swSblSCjZj/fMVVPXLcjxEW4SBNWRxa9Lz3tIPXb3RheU10Lfy8=
=H3k4
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic updates from Arnd Bergmann:
"These are various cleanups, fixing a number of uapi header files to no
longer reference CONFIG_* symbols, and one patch that introduces the
new CONFIG_HAS_IOPORT symbol for architectures that provide working
inb()/outb() macros, as a preparation for adding driver dependencies
on those in the following release"
* tag 'asm-generic-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
Kconfig: introduce HAS_IOPORT option and select it as necessary
scripts: Update the CONFIG_* ignore list in headers_install.sh
pktcdvd: Remove CONFIG_CDROM_PKTCDVD_WCACHE from uapi header
Move bp_type_idx to include/linux/hw_breakpoint.h
Move ep_take_care_of_epollwakeup() to fs/eventpoll.c
Move COMPAT_ATM_ADDPARTY to net/atm/svc.c
- Fix the incorrect handling of atomic offset updates in
reserve_eilvt_offset()
The check for the return value of atomic_cmpxchg() is not compared
against the old value, it is compared against the new value, which
makes it two round on success.
Convert it to atomic_try_cmpxchg() which does the right thing.
- Handle IO/APIC less systems correctly
When IO/APIC is not advertised by ACPI then the computation of the lower
bound for dynamically allocated interrupts like MSI goes wrong.
This lower bound is used to exclude the IO/APIC legacy GSI space as that
must stay reserved for the legacy interrupts.
In case that the system, e.g. VM, does not advertise an IO/APIC the
lower bound stays at 0.
0 is an invalid interrupt number except for the legacy timer interrupt
on x86. The return value is unchecked in the core code, so it ends up
to allocate interrupt number 0 which is subsequently considered to be
invalid by the caller, e.g. the MSI allocation code.
A similar problem was already cured for device tree based systems years
ago, but that missed - or did not envision - the zero IO/APIC case.
Consolidate the zero check and return the provided "from" argument to the
core code call site, which is guaranteed to be greater than 0.
- Simplify the X2APIC cluster CPU mask logic for CPU hotplug
Per cluster CPU masks are required for X2APIC in cluster mode to
determine the correct cluster for a target CPU when calculating the
destination for IPIs
These masks are established when CPUs are borught up. The first CPU in a
cluster must allocate a new cluster CPU mask. As this happens during the
early startup of a CPU, where memory allocations cannot be done, the
mask has to be allocated by the control CPU.
The current implementation allocates a clustermask just in case and if
the to be brought up CPU is the first in a cluster the CPU takes over
this allocation from a global pointer.
This works nicely in the fully serialized CPU bringup scenario which is
used today, but would fail completely for parallel bringup of CPUs.
The cluster association of a CPU can be computed from the APIC ID which
is enumerated by ACPI/MADT.
So the cluster CPU masks can be preallocated and associated upfront and
the upcoming CPUs just need to set their corresponding bit.
Aside of preparing for parallel bringup this is a valuable
simplification on its own.
- Remove global variables which control the early startup of secondary
CPUs on 64-bit
The only information which is needed by a starting CPU is the Linux CPU
number. The CPU number allows it to retrieve the rest of the required
data from already existing per CPU storage.
So instead of initial_stack, early_gdt_desciptor and initial_gs provide
a new variable smpboot_control which contains the Linux CPU number for
now. The starting CPU can retrieve and compute all required information
for startup from there.
Aside of being a cleanup, this is also preparing for parallel CPU
bringup, where starting CPUs will look up their Linux CPU number via the
APIC ID, when smpboot_control has the corresponding control bit set.
- Make cc_vendor globally accesible
Subsequent parallel bringup changes require access to cc_vendor because
confidental computing platforms need special treatment in the early
startup phase vs. CPUID and APCI ID readouts.
The change makes cc_vendor global and provides stub accessors in case
that CONFIG_ARCH_HAS_CC_PLATFORM is not set.
This was merged from the x86/cc branch in anticipation of further
parallel bringup commits which require access to cc_vendor. Due to late
discoveries of fundamental issue with those patches these commits never
happened.
The merge commit is unfortunately in the middle of the APIC commits so
unraveling it would have required a rebase or revert. As the parallel
bringup seems to be well on its way for 6.5 this would be just pointless
churn. As the commit does not contain any functional change it's not a
risk to keep it.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmRGuAwTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoRzSEADEx1sVkd2yrLcTYdpjdKbbUaDJ6lR0
DXxIP3+ApGHmV9l9yIh+/5C2oEJsiUfFf1vdh6ajv5iXpksCKzcUzkW5g3w7nM36
CSpULpFjwvaq8TIo0o1PIhAbo/yIMMzJVDs8R0reCnWgGAWZoW/a9Ndcvcicd0an
pQAlkw3FD5r92mcMlKPNWFoui1AkScGEV02zJ7884MAukmBZwD8Jd+gE6eQC9GKa
9hyJiB77st1URl+a0cPsPYvv8RLVuVcljWsh2edyvxgovIO56+BoEjbrgRSF6cqQ
Bhzo//3KgbUJ1y+YqH01aKZzY0hRpbAi2Rew4RBKcBKwCGd2qltUQG0LFNxAtV83
RsC573wSCGSCGO5Xb1RVXih5is+9YqMqitJNWvEc15jjOA9nwoLc80axP11v42f9
Xl4iGHQTWVGdxT4H22NH7UCuRlGg38vAx+In2HGpN/e57q2ighESjiGuqQAQpLel
pbOeJtQ/D2xXVKcCap4T/P/2x5ls7bsc76MWJBMcYC3pRgJ5M7ZHw7wTw0IAty4x
xCfR1bsRVEAhrE9r/odgNipXjBJu+CdGBAupNEIiRyq1QiwUKtMTayasRGUlbYO6
vrieHKqoflzRVg2M9Bgm3oI28X27FzZHWAZJW2oJ2Wnn2jL5kuRJa1nEykqo8pEP
j6rjnScRVvdpIw==
=IQWG
-----END PGP SIGNATURE-----
Merge tag 'x86-apic-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 APIC updates from Thomas Gleixner:
- Fix the incorrect handling of atomic offset updates in
reserve_eilvt_offset()
The check for the return value of atomic_cmpxchg() is not compared
against the old value, it is compared against the new value, which
makes it two round on success.
Convert it to atomic_try_cmpxchg() which does the right thing.
- Handle IO/APIC less systems correctly
When IO/APIC is not advertised by ACPI then the computation of the
lower bound for dynamically allocated interrupts like MSI goes wrong.
This lower bound is used to exclude the IO/APIC legacy GSI space as
that must stay reserved for the legacy interrupts.
In case that the system, e.g. VM, does not advertise an IO/APIC the
lower bound stays at 0.
0 is an invalid interrupt number except for the legacy timer
interrupt on x86. The return value is unchecked in the core code, so
it ends up to allocate interrupt number 0 which is subsequently
considered to be invalid by the caller, e.g. the MSI allocation code.
A similar problem was already cured for device tree based systems
years ago, but that missed - or did not envision - the zero IO/APIC
case.
Consolidate the zero check and return the provided "from" argument to
the core code call site, which is guaranteed to be greater than 0.
- Simplify the X2APIC cluster CPU mask logic for CPU hotplug
Per cluster CPU masks are required for X2APIC in cluster mode to
determine the correct cluster for a target CPU when calculating the
destination for IPIs
These masks are established when CPUs are borught up. The first CPU
in a cluster must allocate a new cluster CPU mask. As this happens
during the early startup of a CPU, where memory allocations cannot be
done, the mask has to be allocated by the control CPU.
The current implementation allocates a clustermask just in case and
if the to be brought up CPU is the first in a cluster the CPU takes
over this allocation from a global pointer.
This works nicely in the fully serialized CPU bringup scenario which
is used today, but would fail completely for parallel bringup of
CPUs.
The cluster association of a CPU can be computed from the APIC ID
which is enumerated by ACPI/MADT.
So the cluster CPU masks can be preallocated and associated upfront
and the upcoming CPUs just need to set their corresponding bit.
Aside of preparing for parallel bringup this is a valuable
simplification on its own.
- Remove global variables which control the early startup of secondary
CPUs on 64-bit
The only information which is needed by a starting CPU is the Linux
CPU number. The CPU number allows it to retrieve the rest of the
required data from already existing per CPU storage.
So instead of initial_stack, early_gdt_desciptor and initial_gs
provide a new variable smpboot_control which contains the Linux CPU
number for now. The starting CPU can retrieve and compute all
required information for startup from there.
Aside of being a cleanup, this is also preparing for parallel CPU
bringup, where starting CPUs will look up their Linux CPU number via
the APIC ID, when smpboot_control has the corresponding control bit
set.
- Make cc_vendor globally accesible
Subsequent parallel bringup changes require access to cc_vendor
because confidental computing platforms need special treatment in the
early startup phase vs. CPUID and APCI ID readouts.
The change makes cc_vendor global and provides stub accessors in case
that CONFIG_ARCH_HAS_CC_PLATFORM is not set.
This was merged from the x86/cc branch in anticipation of further
parallel bringup commits which require access to cc_vendor. Due to
late discoveries of fundamental issue with those patches these
commits never happened.
The merge commit is unfortunately in the middle of the APIC commits
so unraveling it would have required a rebase or revert. As the
parallel bringup seems to be well on its way for 6.5 this would be
just pointless churn. As the commit does not contain any functional
change it's not a risk to keep it.
* tag 'x86-apic-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ioapic: Don't return 0 from arch_dynirq_lower_bound()
x86/apic: Fix atomic update of offset in reserve_eilvt_offset()
x86/coco: Export cc_vendor
x86/smpboot: Reference count on smpboot_setup_warm_reset_vector()
x86/smpboot: Remove initial_gs
x86/smpboot: Remove early_gdt_descr on 64-bit
x86/smpboot: Remove initial_stack on 64-bit
x86/apic/x2apic: Allow CPU cluster_mask to be populated in parallel
- Improve the VDSO build time checks to cover all dynamic relocations
VDSO does not allow dynamic relcations, but the build time check is
incomplete and fragile.
It's based on architectures specifying the relocation types to search
for and does not handle R_*_NONE relocation entries correctly.
R_*_NONE relocations are injected by some GNU ld variants if they fail
to determine the exact .rel[a]/dyn_size to cover trailing zeros.
R_*_NONE relocations must be ignored by dynamic loaders, so they
should be ignored in the build time check too.
Remove the architecture specific relocation types to check for and
validate strictly that no other relocations than R_*_NONE end up
in the VSDO .so file.
- Prefer signal delivery to the current thread for
CLOCK_PROCESS_CPUTIME_ID based posix-timers
Such timers prefer to deliver the signal to the main thread of a
process even if the context in which the timer expires is the current
task. This has the downside that it might wake up an idle thread.
As there is no requirement or guarantee that the signal has to be
delivered to the main thread, avoid this by preferring the current
task if it is part of the thread group which shares sighand.
This not only avoids waking idle threads, it also distributes the
signal delivery in case of multiple timers firing in the context
of different threads close to each other better.
- Align the tick period properly (again)
For a long time the tick was starting at CLOCK_MONOTONIC zero, which
allowed users space applications to either align with the tick or to
place a periodic computation so that it does not interfere with the
tick. The alignement of the tick period was more by chance than by
intention as the tick is set up before a high resolution clocksource is
installed, i.e. timekeeping is still tick based and the tick period
advances from there.
The early enablement of sched_clock() broke this alignement as the time
accumulated by sched_clock() is taken into account when timekeeping is
initialized. So the base value now(CLOCK_MONOTONIC) is not longer a
multiple of tick periods, which breaks applications which relied on
that behaviour.
Cure this by aligning the tick starting point to the next multiple of
tick periods, i.e 1000ms/CONFIG_HZ.
- A set of NOHZ fixes and enhancements
- Cure the concurrent writer race for idle and IO sleeptime statistics
The statitic values which are exposed via /proc/stat are updated from
the CPU local idle exit and remotely by cpufreq, but that happens
without any form of serialization. As a consequence sleeptimes can be
accounted twice or worse.
Prevent this by restricting the accumulation writeback to the CPU
local idle exit and let the remote access compute the accumulated
value.
- Protect idle/iowait sleep time with a sequence count
Reading idle/iowait sleep time, e.g. from /proc/stat, can race with
idle exit updates. As a consequence the readout may result in random
and potentially going backwards values.
Protect this by a sequence count, which fixes the idle time
statistics issue, but cannot fix the iowait time problem because
iowait time accounting races with remote wake ups decrementing the
remote runqueues nr_iowait counter. The latter is impossible to fix,
so the only way to deal with that is to document it properly and to
remove the assertion in the selftest which triggers occasionally due
to that.
- Restructure struct tick_sched for better cache layout
- Some small cleanups and a better cache layout for struct tick_sched
- Implement the missing timer_wait_running() callback for POSIX CPU timers
For unknown reason the introduction of the timer_wait_running() callback
missed to fixup posix CPU timers, which went unnoticed for almost four
years.
While initially only targeted to prevent livelocks between a timer
deletion and the timer expiry function on PREEMPT_RT enabled kernels, it
turned out that fixing this for mainline is not as trivial as just
implementing a stub similar to the hrtimer/timer callbacks.
The reason is that for CONFIG_POSIX_CPU_TIMERS_TASK_WORK enabled systems
there is a livelock issue independent of RT.
CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y moves the expiry of POSIX CPU timers
out from hard interrupt context to task work, which is handled before
returning to user space or to a VM. The expiry mechanism moves the
expired timers to a stack local list head with sighand lock held. Once
sighand is dropped the task can be preempted and a task which wants to
delete a timer will spin-wait until the expiry task is scheduled back
in. In the worst case this will end up in a livelock when the preempting
task and the expiry task are pinned on the same CPU.
The timer wheel has a timer_wait_running() mechanism for RT, which uses
a per CPU timer-base expiry lock which is held by the expiry code and the
task waiting for the timer function to complete blocks on that lock.
This does not work in the same way for posix CPU timers as there is no
timer base and expiry for process wide timers can run on any task
belonging to that process, but the concept of waiting on an expiry lock
can be used too in a slightly different way.
Add a per task mutex to struct posix_cputimers_work, let the expiry task
hold it accross the expiry function and let the deleting task which
waits for the expiry to complete block on the mutex.
In the non-contended case this results in an extra mutex_lock()/unlock()
pair on both sides.
This avoids spin-waiting on a task which is scheduled out, prevents the
livelock and cures the problem for RT and !RT systems.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmRGrj4THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoZhdEAC/lwfDWCnTXHC8ExQQRDIVNyXmDlLb
EHB8ZY7Wc4gNZ8UEXEOLOXJHMG9bsbtPGctVewJwRGnXZWKVhpPwQba6kCRycyX0
0J6l5DlvUaGGrpoOzOZwgETRmtIZE9tEArZR8xlfRScYd93a7yLhwIjO8JaV9vKs
IQpAQMeJ/ysp6gHrS59qakYfoHU/ERUAu3Tk4GqHUtPtcyz3nX3eTlLWV8LySqs+
00qr2yc0bQFUFoKzTCxtM8lcEi9ja9SOj1rw28348O+BXE4d0HC12Ie7eU/CDN2Y
OAlWYxVjy4LMh24LDrRQKTzoVqx9MXDx2g+09B3t8NK5LgeS+EJIjujDhZF147/H
5y906nplZUKa8BiZW5Rpm/HKH8tFI80T9XWSQCRBeMgTEJyRyRU1yASAwO4xw+dY
Dn3tGmFGymcV/72o4ic9JFKQd8cTSxPjEJS3qqzMkEAtyI/zPBmKxj/Tce50OH40
6FSZq1uU21ZQzszwSHISwgFtNr75laUSK4Z1te5OhPOOz+C7O9YqHvqS/1jwhPj2
tMd8X17fRW3UTUBlBj+zqxqiEGBl/Yk2AvKrJIXGUtfWYCtjMJ7ieCf0kZ7NSVJx
9ewubA0gqseMD783YomZsy8LLtMKnhclJeslUOVb1oKs1q/WF1R/k6qjy9vUwYaB
nIJuHl8mxSetag==
=SVnj
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timers and timekeeping updates from Thomas Gleixner:
- Improve the VDSO build time checks to cover all dynamic relocations
VDSO does not allow dynamic relocations, but the build time check is
incomplete and fragile.
It's based on architectures specifying the relocation types to search
for and does not handle R_*_NONE relocation entries correctly.
R_*_NONE relocations are injected by some GNU ld variants if they
fail to determine the exact .rel[a]/dyn_size to cover trailing zeros.
R_*_NONE relocations must be ignored by dynamic loaders, so they
should be ignored in the build time check too.
Remove the architecture specific relocation types to check for and
validate strictly that no other relocations than R_*_NONE end up in
the VSDO .so file.
- Prefer signal delivery to the current thread for
CLOCK_PROCESS_CPUTIME_ID based posix-timers
Such timers prefer to deliver the signal to the main thread of a
process even if the context in which the timer expires is the current
task. This has the downside that it might wake up an idle thread.
As there is no requirement or guarantee that the signal has to be
delivered to the main thread, avoid this by preferring the current
task if it is part of the thread group which shares sighand.
This not only avoids waking idle threads, it also distributes the
signal delivery in case of multiple timers firing in the context of
different threads close to each other better.
- Align the tick period properly (again)
For a long time the tick was starting at CLOCK_MONOTONIC zero, which
allowed users space applications to either align with the tick or to
place a periodic computation so that it does not interfere with the
tick. The alignement of the tick period was more by chance than by
intention as the tick is set up before a high resolution clocksource
is installed, i.e. timekeeping is still tick based and the tick
period advances from there.
The early enablement of sched_clock() broke this alignement as the
time accumulated by sched_clock() is taken into account when
timekeeping is initialized. So the base value now(CLOCK_MONOTONIC) is
not longer a multiple of tick periods, which breaks applications
which relied on that behaviour.
Cure this by aligning the tick starting point to the next multiple of
tick periods, i.e 1000ms/CONFIG_HZ.
- A set of NOHZ fixes and enhancements:
* Cure the concurrent writer race for idle and IO sleeptime
statistics
The statitic values which are exposed via /proc/stat are updated
from the CPU local idle exit and remotely by cpufreq, but that
happens without any form of serialization. As a consequence
sleeptimes can be accounted twice or worse.
Prevent this by restricting the accumulation writeback to the CPU
local idle exit and let the remote access compute the accumulated
value.
* Protect idle/iowait sleep time with a sequence count
Reading idle/iowait sleep time, e.g. from /proc/stat, can race
with idle exit updates. As a consequence the readout may result
in random and potentially going backwards values.
Protect this by a sequence count, which fixes the idle time
statistics issue, but cannot fix the iowait time problem because
iowait time accounting races with remote wake ups decrementing
the remote runqueues nr_iowait counter. The latter is impossible
to fix, so the only way to deal with that is to document it
properly and to remove the assertion in the selftest which
triggers occasionally due to that.
* Restructure struct tick_sched for better cache layout
* Some small cleanups and a better cache layout for struct
tick_sched
- Implement the missing timer_wait_running() callback for POSIX CPU
timers
For unknown reason the introduction of the timer_wait_running()
callback missed to fixup posix CPU timers, which went unnoticed for
almost four years.
While initially only targeted to prevent livelocks between a timer
deletion and the timer expiry function on PREEMPT_RT enabled kernels,
it turned out that fixing this for mainline is not as trivial as just
implementing a stub similar to the hrtimer/timer callbacks.
The reason is that for CONFIG_POSIX_CPU_TIMERS_TASK_WORK enabled
systems there is a livelock issue independent of RT.
CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y moves the expiry of POSIX CPU
timers out from hard interrupt context to task work, which is handled
before returning to user space or to a VM. The expiry mechanism moves
the expired timers to a stack local list head with sighand lock held.
Once sighand is dropped the task can be preempted and a task which
wants to delete a timer will spin-wait until the expiry task is
scheduled back in. In the worst case this will end up in a livelock
when the preempting task and the expiry task are pinned on the same
CPU.
The timer wheel has a timer_wait_running() mechanism for RT, which
uses a per CPU timer-base expiry lock which is held by the expiry
code and the task waiting for the timer function to complete blocks
on that lock.
This does not work in the same way for posix CPU timers as there is
no timer base and expiry for process wide timers can run on any task
belonging to that process, but the concept of waiting on an expiry
lock can be used too in a slightly different way.
Add a per task mutex to struct posix_cputimers_work, let the expiry
task hold it accross the expiry function and let the deleting task
which waits for the expiry to complete block on the mutex.
In the non-contended case this results in an extra
mutex_lock()/unlock() pair on both sides.
This avoids spin-waiting on a task which is scheduled out, prevents
the livelock and cures the problem for RT and !RT systems
* tag 'timers-core-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
posix-cpu-timers: Implement the missing timer_wait_running callback
selftests/proc: Assert clock_gettime(CLOCK_BOOTTIME) VS /proc/uptime monotonicity
selftests/proc: Remove idle time monotonicity assertions
MAINTAINERS: Remove stale email address
timers/nohz: Remove middle-function __tick_nohz_idle_stop_tick()
timers/nohz: Add a comment about broken iowait counter update race
timers/nohz: Protect idle/iowait sleep time under seqcount
timers/nohz: Only ever update sleeptime from idle exit
timers/nohz: Restructure and reshuffle struct tick_sched
tick/common: Align tick period with the HZ tick.
selftests/timers/posix_timers: Test delivery of signals across threads
posix-timers: Prefer delivery of signals to the current thread
vdso: Improve cmd_vdso_check to check all dynamic relocations
SEV-SNP vTOM guest on Hyper-V. A vTOM guest basically splits the
address space in two parts: encrypted and unencrypted. The use case
being running unmodified guests on the Hyper-V confidential computing
hypervisor
- Double-buffer messages between the guest and the hardware PSP device
so that no partial buffers are copied back'n'forth and thus potential
message integrity and leak attacks are possible
- Name the return value the sev-guest driver returns when the hw PSP
device hasn't been called, explicitly
- Cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmRGl8gACgkQEsHwGGHe
VUoEDhAAiw4+2nZR7XUJ7pewlXG7AJJZsVIpzzcF6Gyymn0LFCyMnP7O3snmFqzz
aik0q2LzWrmDQ3Nmmzul0wtdsuW7Nik6BP9oF3WnB911+gGbpXyNWZ8EhOPNzkUR
9D8Sp6f0xmqNE3YuzEpanufiDswgUxi++DRdmIRAs1TTh4bfUFWZcib1pdwoqSmR
oS3UfVwVZ4Ee2Qm1f3n3XQ0FUpsjWeARPExUkLEvd8XeonTP+6aGAdggg9MnPcsl
3zpSmOpuZ6VQbDrHxo3BH9HFuIUOd6S9PO++b9F6WxNPGEMk7fHa7ahOA6HjhgVz
5Da3BN16OS9j64cZsYHMPsBcd+ja1YmvvZGypsY0d6X4d3M1zTPW+XeLbyb+VFBy
SvA7z+JuxtLKVpju65sNiJWw8ZDTSu+eEYNDeeGLvAj3bxtclJjcPdMEPdzxmC5K
eAhmRmiFuVM4nXMAR6cspVTsxvlTHFtd5gdm6RlRnvd7aV77Zl1CLzTy8IHTVpvI
t7XTbtjEjYc0pI6cXXptHEOnBLjXUMPcqgGFgJYEauH6EvrxoWszUZD0tS3Hw80A
K+Rwnc70ubq/PsgZcF4Ayer1j49z1NPfk5D4EA7/ChN6iNhQA8OqHT1UBrHAgqls
2UAwzE2sQZnjDvGZghlOtFIQUIhwue7m93DaRi19EOdKYxVjV6U=
=ZAw9
-----END PGP SIGNATURE-----
Merge tag 'x86_sev_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SEV updates from Borislav Petkov:
- Add the necessary glue so that the kernel can run as a confidential
SEV-SNP vTOM guest on Hyper-V. A vTOM guest basically splits the
address space in two parts: encrypted and unencrypted. The use case
being running unmodified guests on the Hyper-V confidential computing
hypervisor
- Double-buffer messages between the guest and the hardware PSP device
so that no partial buffers are copied back'n'forth and thus potential
message integrity and leak attacks are possible
- Name the return value the sev-guest driver returns when the hw PSP
device hasn't been called, explicitly
- Cleanups
* tag 'x86_sev_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/hyperv: Change vTOM handling to use standard coco mechanisms
init: Call mem_encrypt_init() after Hyper-V hypercall init is done
x86/mm: Handle decryption/re-encryption of bss_decrypted consistently
Drivers: hv: Explicitly request decrypted in vmap_pfn() calls
x86/hyperv: Reorder code to facilitate future work
x86/ioremap: Add hypervisor callback for private MMIO mapping in coco VM
x86/sev: Change snp_guest_issue_request()'s fw_err argument
virt/coco/sev-guest: Double-buffer messages
crypto: ccp: Get rid of __sev_platform_init_locked()'s local function pointer
crypto: ccp - Name -1 return value as SEV_RET_NO_FW_CALL
-fzero-call-used-regs builds from zeroing live registers because
paravirt hides the CALLs from the compiler so latter doesn't know
there's a CALL in the first place
- Merge two paravirt callbacks into one, as their functionality is
identical
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmRGjhwACgkQEsHwGGHe
VUpRUA//QduAqsDoIJ1U/y7JxbLriFR9X9rCY4u5pq2ZSroQZ861eBeUCJnlc+MQ
+zlAsmGkLpb9b5be85vMByz++rO4ZTfBVamqJORD9Zj99RY0F7ym/HYXK4CP6J+E
IrgMTTLPd8kMH/5Pb/tiNXOuKnNw99MdKhE5CPKGnvtM7eSzLrIuN9sHAUb9SMuV
l+TWYh4vkf8+XfzSpp5WYaCgyDBN8tMWD3cBeLTljT3OEOh9vIQYWjRliKQyxjWG
FJ8BnL8Nx+3kDkRjHyK4/h0P0KQYB6hnRSOrZyaae2H3N7uSMQbcLuRC6aXz1amm
9AKoubhzx/A5hwGx8jKtGuLCkEtSakdcbiF0l3gek3Auecxcg6x8W+cCNvpq8FGV
DJ349RPqR7TlKJwyvPp7dHRozVrY2sdbWZILxLhKDvAoOR4F927dt9+A96glc5dP
VTnrlptj1vX+dSkKgKRTmPUKbsXM2h003qTiAUVzjMP0PcKUKknpBhz7kLQ3gpFc
7rxyjHWANQJpY39WHvuIv+pzVUodrUGioA1LcEisx8FCM/iAIoejLi+ybbRMyc/2
NN3TMxoEl3RIQCOFgsM8NxAvOL9P6+82NiM+0v0TgzszMlso7RzbjBeaaWRtxX+O
82p9mTLDQuxESkA0HEwoTQa/xfO51zCi+SeLfhFO6A4s93Sjjb0=
=Eu5f
-----END PGP SIGNATURE-----
Merge tag 'x86_paravirt_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 paravirt updates from Borislav Petkov:
- Convert a couple of paravirt callbacks to asm to prevent
'-fzero-call-used-regs' builds from zeroing live registers because
paravirt hides the CALLs from the compiler so latter doesn't know
there's a CALL in the first place
- Merge two paravirt callbacks into one, as their functionality is
identical
* tag 'x86_paravirt_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/paravirt: Convert simple paravirt functions to asm
x86/paravirt: Merge activate_mm() and dup_mmap() callbacks
- Finally use a CPUID bit for split lock detection instead of
enumerating every model
- Make sure automatic IBRS is set on AMD, even though the AP bringup
code does that now by replicating the MSR which contains the switch
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmRGiUkACgkQEsHwGGHe
VUrjgw/7BnRvmgdSJg//TwlCnbnYCHbUzPbCfnMK8W6C5OvoRR+VYxeu3DoI/dsx
xW2lMR/Svf30orB3EQTnpOBNa3PPbDlQvqInM+bQ/TYb5F6yIAnRkQhD9OaIQkeM
CwX68pPcEPXCY+Ds2RmV6K2UvzIG5vVeYg6O36FVYUvON1tHFadEAT//lAMVspOs
HBbhEOpu6/zHoKr53cduT2P9i7SAjCIjPRSMpuIfCd3RNcjwqWEXCyXxNad6LrTc
Nd+xNjUpcRecl2bR41bIrpTfGGaU2XOJI2GiFfH/mBP8WNSP4Npp3LQVI35bDwLP
VYr2IRGySxTerLSV2v8UwBSYw/hVltq5TkHyqjNaQB5JKbhxnH67GLV2LeOxawGz
OogxcPF7RrVmr/c3ji4FE/QQlTbHczIRaSjNOYupHNNcQP5NrxVHWCNZRKX8ljh1
Ah1G3s5vEVigzgqnMX8ey4xBpMtL4bilT2mMwh5hY2XMY3QjgrXLg+73VkvBkM6Y
MjreNrUoGSC7Qw39rXtUfgRBMCB16CfFSsxPS4Isu6JwlNpJzOOifVdTdE4flOrd
HR0ac776WKO9KJrPvnxYNf5mHRWkUWPS7t04BvkHuzOzxQQHz51A0xwh7td0kZA9
vozSbxKE91sH0XD73x/H/EA/TGpWwYq7DQIYJOxCu1juq1ku7lM=
=QquZ
-----END PGP SIGNATURE-----
Merge tag 'x86_cpu_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu model updates from Borislav Petkov:
- Add Emerald Rapids to the list of Intel models supporting PPIN
- Finally use a CPUID bit for split lock detection instead of
enumerating every model
- Make sure automatic IBRS is set on AMD, even though the AP bringup
code does that now by replicating the MSR which contains the switch
* tag 'x86_cpu_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Add Xeon Emerald Rapids to list of CPUs that support PPIN
x86/split_lock: Enumerate architectural split lock disable bit
x86/CPU/AMD: Make sure EFER[AIBRSE] is set
an objtool fix and use proper size for a bitmap
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmRGhG8ACgkQEsHwGGHe
VUr9+xAAs5D7hCYfKVsdNiiWh+pBRnrnOjfCcRb0cvIIyE4DKvLbGJoTjsRN7WuT
1ExnnjGL9CJHyqnJQj8/M0AdNgh28fOpJInzE3k7cDckZHQQp4cYDLQ2x9uAWvVl
lNmOKTmVXq97hZw6maSTm4iFWDTzbdLveIETjlVWvjWVomkm9KI6/3HZN2qjzxJP
IeCZ20lpZAg94/rMf6FqhuY1gaSa2yeXqz+wO19A5LBNhRHm60bFS2h48GiACxsV
JD5jDPVsTozAGxyNxKe1DerzH4NQCBax9bzjW7TvAGqNLPamLPg5npGdrAg9SdD8
yQ6F9TiUSU1jJfh3NqA7TxOcCtSr36xUrDJaOiMnVr68qi6kBnFsQ+Hxx3NwvCsU
6304wESm+j1rJ1DwtKOrguIVZ+nI+s6I/ubki4wjxa7zZZqZ7/daNM/3j9Wl7Vng
pd/augpcPR5+FNmU2Zq47ZK3kgqxjpEFpByjOChYclHGWZ4Jk717K7kf7CD424WM
VU590ffXLQCN/pcPkDo4Rxj5LVkaXqocWfOfr5uB0XIPjP+wjsjtJF+Mi/phO23O
dWFzI00GJZqKMehV07eCKaGoxVko9P/FxG8WdxLfw4BfmaOQGBB0O4m5tRvyPdjB
Ezm0ZUbuLl3zmHmfM1ZCPRkZ532I/IAF88VcIygmYRUr78w6mfw=
=e2hz
-----END PGP SIGNATURE-----
Merge tag 'ras_core_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS updates from Borislav Petkov:
- Just cleanups and fixes this time around: make threshold_ktype const,
an objtool fix and use proper size for a bitmap
* tag 'ras_core_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/MCE/AMD: Use an u64 for bank_map
x86/mce: Always inline old MCA stubs
x86/MCE/AMD: Make kobj_type structure constant
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZEYCQAAKCRBZ7Krx/gZQ
64FdAQDZ2hTDyZEWPt486dWYPYpiKyaGFXSXDGo7wgP0fiwxXQEA/mROKb6JqYw6
27mZ9A7qluT8r3AfTTQ0D+Yse/dr4AM=
=GA9W
-----END PGP SIGNATURE-----
Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fget updates from Al Viro:
"fget() to fdget() conversions"
* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fuse_dev_ioctl(): switch to fdget()
cgroup_get_from_fd(): switch to fdget_raw()
bpf: switch to fdget_raw()
build_mount_idmapped(): switch to fdget()
kill the last remaining user of proc_ns_fget()
SVM-SEV: convert the rest of fget() uses to fdget() in there
convert sgx_set_attribute() to fdget()/fdput()
convert setns(2) to fdget()/fdput()
still a fair amount going on, including:
- Reorganizing the architecture-specific documentation under
Documentation/arch. This makes the structure match the source directory
and helps to clean up the mess that is the top-level Documentation
directory a bit. This work creates the new directory and moves x86 and
most of the less-active architectures there. The current plan is to move
the rest of the architectures in 6.5, with the patches going through the
appropriate subsystem trees.
- Some more Spanish translations and maintenance of the Italian
translation.
- A new "Kernel contribution maturity model" document from Ted.
- A new tutorial on quickly building a trimmed kernel from Thorsten.
Plus the usual set of updates and fixes.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmRGze0PHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5Y/VsH/RyWqinorRVFZmHqRJMRhR0j7hE2pAgK5prE
dGXYVtHHNQ+25thNaqhZTOLYFbSX6ii2NG7sLRXmyOTGIZrhUCFFXCHkuq4ZUypR
gJpMUiKQVT4dhln3gIZ0k09NSr60gz8UTcq895N9UFpUdY1SCDhbCcLc4uXTRajq
NrdgFaHWRkPb+gBRbXOExYm75DmCC6Ny5AyGo2rXfItV//ETjWIJVQpJhlxKrpMZ
3LgpdYSLhEFFnFGnXJ+EAPJ7gXDi2Tg5DuPbkvJyFOTouF3j4h8lSS9l+refMljN
xNRessv+boge/JAQidS6u8F2m2ESSqSxisv/0irgtKIMJwXaoX4=
=1//8
-----END PGP SIGNATURE-----
Merge tag 'docs-6.4' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"Commit volume in documentation is relatively low this time, but there
is still a fair amount going on, including:
- Reorganize the architecture-specific documentation under
Documentation/arch
This makes the structure match the source directory and helps to
clean up the mess that is the top-level Documentation directory a
bit. This work creates the new directory and moves x86 and most of
the less-active architectures there.
The current plan is to move the rest of the architectures in 6.5,
with the patches going through the appropriate subsystem trees.
- Some more Spanish translations and maintenance of the Italian
translation
- A new "Kernel contribution maturity model" document from Ted
- A new tutorial on quickly building a trimmed kernel from Thorsten
Plus the usual set of updates and fixes"
* tag 'docs-6.4' of git://git.lwn.net/linux: (47 commits)
media: Adjust column width for pdfdocs
media: Fix building pdfdocs
docs: clk: add documentation to log which clocks have been disabled
docs: trace: Fix typo in ftrace.rst
Documentation/process: always CC responsible lists
docs: kmemleak: adjust to config renaming
ELF: document some de-facto PT_* ABI quirks
Documentation: arm: remove stih415/stih416 related entries
docs: turn off "smart quotes" in the HTML build
Documentation: firmware: Clarify firmware path usage
docs/mm: Physical Memory: Fix grammar
Documentation: Add document for false sharing
dma-api-howto: typo fix
docs: move m68k architecture documentation under Documentation/arch/
docs: move parisc documentation under Documentation/arch/
docs: move ia64 architecture docs under Documentation/arch/
docs: Move arc architecture docs under Documentation/arch/
docs: move nios2 documentation under Documentation/arch/
docs: move openrisc documentation under Documentation/arch/
docs: move superh documentation under Documentation/arch/
...
o MAINTAINERS files additions and changes.
o Fix hotplug warning in nohz code.
o Tick dependency changes by Zqiang.
o Lazy-RCU shrinker fixes by Zqiang.
o rcu-tasks stall reporting improvements by Neeraj.
o Initial changes for renaming of k[v]free_rcu() to its new k[v]free_rcu_mightsleep()
name for robustness.
o Documentation Updates:
o Significant changes to srcu_struct size.
o Deadlock detection for srcu_read_lock() vs synchronize_srcu() from Boqun.
o rcutorture and rcu-related tool, which are targeted for v6.4 from Boqun's tree.
o Other misc changes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEcoCIrlGe4gjE06JJqA4nf2o45hAFAmQuBnIACgkQqA4nf2o4
5hACVRAAoXu7/gfh5Pjw9O4E4pCdPJKsZZVYrcrVGrq6NAxRn6M1SgurAdC5grj2
96x0waoGaiO82V0H5iJMcKdAVu67x9R8WaQ1JoxN75Efn8h9W4TguB87TV1gk0xS
eZ18b/CyEaM5mNb80DFFF4FLohy5737p/kNTMqXQdUyR1BsDl16iRMgjiBiFhNUx
yPo8Y2kC2U2OTbldZgaE7s9bQO3xxEcifx93sGWsAex/gx54FYNisiwSlCOSgOE+
XkYo/OKk8Xvr82tLVX8XQVEPCMJ+rxea8T5zSs8/alvsPq7gA8wW3y6fsoa3vUU/
+Gd+W+Q/OsONIDtp8rQAY1qsD0ScDpaR8052RSH0zTa7pj8HsQgE5PjZ+cJW0SEi
cKN+Oe8+ETqKald+xZ6PDf58O212VLrru3RpQWrOQcJ7fmKmfT4REK0RcbLgg4qT
CBgOo6eg+ub4pxq2y11LZJBNTv1/S7xAEzFE0kArew64KB2gyVud0VJRZVAJnEfe
93QQVDFrwK2bhgWQZ6J6IbTvGeQW0L93IibuaU6jhZPR283VtUIIvM7vrOylN7Fq
4jsae0T7YGYfKUhgTpm7rCnm8A/D3Ni8MY0sKYYgDSyKmZUsnpI5wpx1xke4lwwV
ErrY46RCFa+k8wscc6iWfB4cGXyyFHyu+wtyg0KpFn5JAzcfz4A=
=Rgbj
-----END PGP SIGNATURE-----
Merge tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux
Pull RCU updates from Joel Fernandes:
- Updates and additions to MAINTAINERS files, with Boqun being added to
the RCU entry and Zqiang being added as an RCU reviewer.
I have also transitioned from reviewer to maintainer; however, Paul
will be taking over sending RCU pull-requests for the next merge
window.
- Resolution of hotplug warning in nohz code, achieved by fixing
cpu_is_hotpluggable() through interaction with the nohz subsystem.
Tick dependency modifications by Zqiang, focusing on fixing usage of
the TICK_DEP_BIT_RCU_EXP bitmask.
- Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n
kernels, fixed by Zqiang.
- Improvements to rcu-tasks stall reporting by Neeraj.
- Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for
increased robustness, affecting several components like mac802154,
drbd, vmw_vmci, tracing, and more.
A report by Eric Dumazet showed that the API could be unknowingly
used in an atomic context, so we'd rather make sure they know what
they're asking for by being explicit:
https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/
- Documentation updates, including corrections to spelling,
clarifications in comments, and improvements to the srcu_size_state
comments.
- Better srcu_struct cache locality for readers, by adjusting the size
of srcu_struct in support of SRCU usage by Christoph Hellwig.
- Teach lockdep to detect deadlocks between srcu_read_lock() vs
synchronize_srcu() contributed by Boqun.
Previously lockdep could not detect such deadlocks, now it can.
- Integration of rcutorture and rcu-related tools, targeted for v6.4
from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis
module parameter, and more
- Miscellaneous changes, various code cleanups and comment improvements
* tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits)
checkpatch: Error out if deprecated RCU API used
mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep()
rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep()
ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep()
net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep()
net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan()
rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early
rcu: Remove never-set needwake assignment from rcu_report_qs_rdp()
rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels
rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check
rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race
rcu/trace: use strscpy() to instead of strncpy()
tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem
...
Merge my x86 user copy updates branch.
This cleans up a lot of our x86 memory copy code, particularly for user
accesses. I've been pushing for microarchitectural support for good
memory copying and clearing for a long while, and it's been visible in
how the kernel has aggressively used 'rep movs' and 'rep stos' whenever
possible.
And that micro-architectural support has been improving over the years,
to the point where on modern CPU's the best option for a memory copy
that would become a function call (as opposed to being something that
can just be turned into individual 'mov' instructions) is now to inline
the string instruction sequence instead.
However, that only makes sense when we have the modern markers for this:
the x86 FSRM and FSRS capabilities ("Fast Short REP MOVS/STOS").
So this cleans up a lot of our historical code, gets rid of the legacy
marker use ("REP_GOOD" and "ERMS") from the memcpy/memset cases, and
replaces it with that modern reality. Note that REP_GOOD and ERMS end
up still being used by the known large cases (ie page copyin gand
clearing).
The reason much of this ends up being about user memory accesses is that
the normal in-kernel cases are done by the compiler (__builtin_memcpy()
and __builtin_memset()) and getting to the point where we can use our
instruction rewriting to inline those to be string instructions will
need some compiler support.
In contrast, the user accessor functions are all entirely controlled by
the kernel code, so we can change those arbitrarily.
Thanks to Borislav Petkov for feedback on the series, and Jens testing
some of this on micro-architectures I didn't personally have access to.
* x86-rep-insns:
x86: rewrite '__copy_user_nocache' function
x86: remove 'zerorest' argument from __copy_user_nocache()
x86: set FSRS automatically on AMD CPUs that have FSRM
x86: improve on the non-rep 'copy_user' function
x86: improve on the non-rep 'clear_user' function
x86: inline the 'rep movs' in user copies for the FSRM case
x86: move stac/clac from user copy routines into callers
x86: don't use REP_GOOD or ERMS for user memory clearing
x86: don't use REP_GOOD or ERMS for user memory copies
x86: don't use REP_GOOD or ERMS for small memory clearing
x86: don't use REP_GOOD or ERMS for small memory copies
Add missing clockticks and cas_count_* events for Intel SapphireRapids IMC
PMU. These events are useful to measure memory bandwidth.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20230419214241.2310385-1-eranian@google.com
* kvm-arm64/smccc-filtering:
: .
: SMCCC call filtering and forwarding to userspace, courtesy of
: Oliver Upton. From the cover letter:
:
: "The Arm SMCCC is rather prescriptive in regards to the allocation of
: SMCCC function ID ranges. Many of the hypercall ranges have an
: associated specification from Arm (FF-A, PSCI, SDEI, etc.) with some
: room for vendor-specific implementations.
:
: The ever-expanding SMCCC surface leaves a lot of work within KVM for
: providing new features. Furthermore, KVM implements its own
: vendor-specific ABI, with little room for other implementations (like
: Hyper-V, for example). Rather than cramming it all into the kernel we
: should provide a way for userspace to handle hypercalls."
: .
KVM: selftests: Fix spelling mistake "KVM_HYPERCAL_EXIT_SMC" -> "KVM_HYPERCALL_EXIT_SMC"
KVM: arm64: Test that SMC64 arch calls are reserved
KVM: arm64: Prevent userspace from handling SMC64 arch range
KVM: arm64: Expose SMC/HVC width to userspace
KVM: selftests: Add test for SMCCC filter
KVM: selftests: Add a helper for SMCCC calls with SMC instruction
KVM: arm64: Let errors from SMCCC emulation to reach userspace
KVM: arm64: Return NOT_SUPPORTED to guest for unknown PSCI version
KVM: arm64: Introduce support for userspace SMCCC filtering
KVM: arm64: Add support for KVM_EXIT_HYPERCALL
KVM: arm64: Use a maple tree to represent the SMCCC filter
KVM: arm64: Refactor hvc filtering to support different actions
KVM: arm64: Start handling SMCs from EL1
KVM: arm64: Rename SMC/HVC call handler to reflect reality
KVM: arm64: Add vm fd device attribute accessors
KVM: arm64: Add a helper to check if a VM has ran once
KVM: x86: Redefine 'longmode' as a flag for KVM_EXIT_HYPERCALL
Signed-off-by: Marc Zyngier <maz@kernel.org>
I didn't really want to do this, but as part of all the other changes to
the user copy loops, I've been looking at this horror.
I tried to clean it up multiple times, but every time I just found more
problems, and the way it's written, it's just too hard to fix them.
For example, the code is written to do quad-word alignment, and will use
regular byte accesses to get to that point. That's fairly simple, but
it means that any initial 8-byte alignment will be done with cached
copies.
However, the code then is very careful to do any 4-byte _tail_ accesses
using an uncached 4-byte write, and that was claimed to be relevant in
commit a82eee7424 ("x86/uaccess/64: Handle the caching of 4-byte
nocache copies properly in __copy_user_nocache()").
So if you do a 4-byte copy using that function, it carefully uses a
4-byte 'movnti' for the destination. But if you were to do a 12-byte
copy that is 4-byte aligned, it would _not_ do a 4-byte 'movnti'
followed by a 8-byte 'movnti' to keep it all uncached.
Instead, it would align the destination to 8 bytes using a
byte-at-a-time loop, and then do a 8-byte 'movnti' for the final 8
bytes.
The main caller that cares is __copy_user_flushcache(), which knows
about this insanity, and has odd cases for it all. But I just can't
deal with looking at this kind of "it does one case right, and another
related case entirely wrong".
And the code really wasn't fixable without hard drugs, which I try to
avoid.
So instead, rewrite it in a form that hopefully not only gets this
right, but is a bit more maintainable. Knock wood.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a lot of code here that hard-codes that the
data is a single page, and right now that seems to
be sufficient, but to make it easier to change this
in the future, add a new STUB_DATA_PAGES constant
and use it throughout the code.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Avoid cluttering up the kallsyms symbol table with entries that should
not end up in things like backtraces, as they have undescriptive and
generated identifiers.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Avoid cluttering up the kallsyms symbol table with entries that should
not end up in things like backtraces, as they have undescriptive and
generated identifiers.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Avoid cluttering up the kallsyms symbol table with entries that should
not end up in things like backtraces, as they have undescriptive and
generated identifiers.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Prefer RIP-relative addressing where possible, which removes the need
for boot time relocation fixups.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Prefer RIP-relative addressing where possible, which removes the need
for boot time relocation fixups.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Prefer RIP-relative addressing where possible, which removes the need
for boot time relocation fixups.
Co-developed-by: Thomas Garnier <thgarnie@chromium.org>
Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Prefer RIP-relative addressing where possible, which removes the need
for boot time relocation fixups.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Prefer RIP-relative addressing where possible, which removes the need
for boot time relocation fixups.
Co-developed-by: Thomas Garnier <thgarnie@chromium.org>
Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Prefer RIP-relative addressing where possible, which removes the need
for boot time relocation fixups.
Co-developed-by: Thomas Garnier <thgarnie@chromium.org>
Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Prefer RIP-relative addressing where possible, which removes the need
for boot time relocation fixups.
Co-developed-by: Thomas Garnier <thgarnie@chromium.org>
Signed-off-by: Thomas Garnier <thgarnie@chromium.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Prefer RIP-relative addressing where possible, which removes the need
for boot time relocation fixups.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Prefer RIP-relative addressing where possible, which removes the need
for boot time relocation fixups. In the GCM case, we can get rid of the
oversized permutation array entirely while at it.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Prefer RIP-relative addressing where possible, which removes the need
for boot time relocation fixups.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Include <asm-generic/fb.h> and set the required preprocessor tokens
correctly. x86 now implements its own set of fb helpers, but still
follows the overall pattern of the other <asm/fb.h> files.
v3:
* clarified commit message
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Helge Deller <deller@gmx.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20230417125651.25126-20-tzimmermann@suse.de
Every caller passes in zero, meaning they don't want any partial copy to
zero the remainder of the destination buffer.
Which is just as well, because the implementation of that function
didn't actually even look at that argument, and wasn't even aware it
existed, although some misleading comments did mention it still.
The 'zerorest' thing is a historical artifact of how "copy_from_user()"
worked, in that it would zero the rest of the kernel buffer that it
copied into.
That zeroing still exists, but it's long since been moved to generic
code, and the raw architecture-specific code doesn't do it. See
_copy_from_user() in lib/usercopy.c for this all.
However, while __copy_user_nocache() shares some history and superficial
other similarities with copy_from_user(), it is in many ways also very
different.
In particular, while the code makes it *look* similar to the generic
user copy functions that can copy both to and from user space, and take
faults on both reads and writes as a result, __copy_user_nocache() does
no such thing at all.
__copy_user_nocache() always copies to kernel space, and will never take
a page fault on the destination. What *can* happen, though, is that the
non-temporal stores take a machine check because one of the use cases is
for writing to stable memory, and any memory errors would then take
synchronous faults.
So __copy_user_nocache() does look a lot like copy_from_user(), but has
faulting behavior that is more akin to our old copy_in_user() (which no
longer exists, but copied from user space to user space and could fault
on both source and destination).
And it very much does not have the "zero the end of the destination
buffer", since a problem with the destination buffer is very possibly
the very source of the partial copy.
So this whole thing was just a confusing historical artifact from having
shared some code with a completely different function with completely
different use cases.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
So Intel introduced the FSRS ("Fast Short REP STOS") CPU capability bit,
because they seem to have done the (much simpler) REP STOS optimizations
separately and later than the REP MOVS one.
In contrast, when AMD introduced support for FSRM ("Fast Short REP
MOVS"), in the Zen 3 core, it appears to have improved the REP STOS case
at the same time, and since the FSRS bit was added by Intel later, it
doesn't show up on those AMD Zen 3 cores.
And now that we made use of FSRS for the "rep stos" conditional, that
made those AMD machines unnecessarily slower. The Intel situation where
"rep movs" is fast, but "rep stos" isn't, is just odd. The 'stos' case
is a lot simpler with no aliasing, no mutual alignment issues, no
complicated cases.
So this just sets FSRS automatically when FSRM is available on AMD
machines, to get back all the nice REP STOS goodness in Zen 3.
Reported-and-tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The old 'copy_user_generic_unrolled' function was oddly implemented for
largely historical reasons: it had been largely based on the uncached
copy case, which has some other concerns.
For example, the __copy_user_nocache() function uses 'movnti' for the
destination stores, and those want the destination to be aligned. In
contrast, the regular copy function doesn't really care, and trying to
align things only complicates matters.
Also, like the clear_user function, the copy function had some odd
handling of the repeat counts, complicating the exception handling for
no really good reason. So as with clear_user, just write it to keep all
the byte counts in the %rcx register, exactly like the 'rep movs'
functionality that this replaces.
Unlike a real 'rep movs', we do allow for this to trash a few temporary
registers to not have to unnecessarily save/restore registers on the
stack.
And like the clearing case, rename this to what it now clearly is:
'rep_movs_alternative', and make it one coherent function, so that it
shows up as such in profiles (instead of the odd split between
"copy_user_generic_unrolled" and "copy_user_short_string", the latter of
which was not about strings at all, and which was shared with the
uncached case).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The old version was oddly written to have the repeat count in multiple
registers. So instead of taking advantage of %rax being zero, it had
some sub-counts in it. All just for a "single word clearing" loop,
which isn't even efficient to begin with.
So get rid of those games, and just keep all the state in the same
registers we got it in (and that we should return things in). That not
only makes this act much more like 'rep stos' (which this function is
replacing), but makes it much easier to actually do the obvious loop
unrolling.
Also rename the function from the now nonsensical 'clear_user_original'
to what it now clearly is: 'rep_stos_alternative'.
End result: if we don't have a fast 'rep stosb', at least we can have a
fast fallback for it.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This does the same thing for the user copies as commit 0db7058e8e
("x86/clear_user: Make it faster") did for clear_user(). In other
words, it inlines the "rep movs" case when X86_FEATURE_FSRM is set,
avoiding the function call entirely.
In order to do that, it makes the calling convention for the out-of-line
case ("copy_user_generic_unrolled") match the 'rep movs' calling
convention, although it does also end up clobbering a number of
additional registers.
Also, to simplify code sharing in the low-level assembly with the
__copy_user_nocache() function (that uses the normal C calling
convention), we end up with a kind of mixed return value for the
low-level asm code: it will return the result in both %rcx (to work as
an alternative for the 'rep movs' case), _and_ in %rax (for the nocache
case).
We could avoid this by wrapping __copy_user_nocache() callers in an
inline asm, but since the cost is just an extra register copy, it's
probably not worth it.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is preparatory work for inlining the 'rep movs' case, but also a
cleanup. The __copy_user_nocache() function was mis-used by the rdma
code to do uncached kernel copies that don't actually want user copies
at all, and as a result doesn't want the stac/clac either.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The modern target to use is FSRS (Fast Short REP STOS), and the other
cases should only be used for bigger areas (ie mainly things like page
clearing).
Note! This changes the conditional for the inlining from FSRM ("fast
short rep movs") to FSRS ("fast short rep stos").
We'll have a separate fixup for AMD microarchitectures that have a good
'rep stosb' yet do not set the new Intel-specific FSRS bit (because FSRM
was there first).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The modern target to use is FSRM (Fast Short REP MOVS), and the other
cases should only be used for bigger areas (ie mainly things like page
clearing).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The modern target to use is FSRS (Fast Short REP STOS), and the other
cases should only be used for bigger areas (ie mainly things like page
clearing).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The modern target to use is FSRM (Fast Short REP MOVS), and the other
cases should only be used for bigger areas (ie mainly things like page
copying and clearing).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now we use ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP config option to
indicate devdax and hugetlb vmemmap optimization support. Hence rename
that to a generic ARCH_WANT_OPTIMIZE_VMEMMAP
Link: https://lkml.kernel.org/r/20230412050025.84346-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Tarun Sahu <tsahu@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fundamentally semaphores are a counted primitive, but
DEFINE_SEMAPHORE() does not expose this and explicitly creates a
binary semaphore.
Change DEFINE_SEMAPHORE() to take a number argument and use that in the
few places that open-coded it using __SEMAPHORE_INITIALIZER().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[mcgrof: add some tribal knowledge about why some folks prefer
binary sempahores over mutexes]
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Virtual Trust Levels (VTL) helps enable Hyper-V Virtual Secure Mode (VSM)
feature. VSM is a set of hypervisor capabilities and enlightenments
offered to host and guest partitions which enable the creation and
management of new security boundaries within operating system software.
VSM achieves and maintains isolation through VTLs.
Add early initialization for Virtual Trust Levels (VTL). This includes
initializing the x86 platform for VTL and enabling boot support for
secondary CPUs to start in targeted VTL context. For now, only enable
the code for targeted VTL level as 2.
When starting an AP at a VTL other than VTL0, the AP must start directly
in 64-bit mode, bypassing the usual 16-bit -> 32-bit -> 64-bit mode
transition sequence that occurs after waking up an AP with SIPI whose
vector points to the 16-bit AP startup trampoline code.
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>
Link: https://lore.kernel.org/r/1681192532-15460-6-git-send-email-ssengar@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Add structs and hypercalls required to enable VTL support on x86.
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>
Link: https://lore.kernel.org/r/1681192532-15460-3-git-send-email-ssengar@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Make get/set_rtc_noop() to be public so that they can be used
in other modules as well.
Co-developed-by: Tianyu Lan <tiala@microsoft.com>
Signed-off-by: Tianyu Lan <tiala@microsoft.com>
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/1681192532-15460-2-git-send-email-ssengar@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
The usage of the BIT() macro in inline asm code was introduced in 6.3 by
the commit in the Fixes tag. However, this macro uses "1UL" for integer
constant suffixes in its shift operation, while gas before 2.28 does not
support the "L" suffix after a number, and gas before 2.27 does not
support the "U" suffix, resulting in build errors such as the following
with such versions:
./arch/x86/include/asm/uaccess_64.h:124: Error: found 'L', expected: ')'
./arch/x86/include/asm/uaccess_64.h:124: Error: junk at end of line,
first unrecognized character is `L'
However, the currently minimal binutils version the kernel supports is
2.25.
There's a single use of this macro here, revert to (1 << 0) that works
with such older binutils.
As an additional info, the binutils PRs which add support for those
suffixes are:
https://sourceware.org/bugzilla/show_bug.cgi?id=19910https://sourceware.org/bugzilla/show_bug.cgi?id=20732
[ bp: Massage and extend commit message. ]
Fixes: 5d1dd961e7 ("x86/alternatives: Add alt_instr.flags")
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://lore.kernel.org/lkml/a9aae568-3046-306c-bd71-92c1fc8eeddc@linux.alibaba.com/
In the case where page tables are not freed, native_flush_tlb_multi()
does not do a remote TLB flush on CPUs in lazy TLB mode because the
CPU will flush itself at the next context switch. By comparison, the
Hyper-V enlightened TLB flush does not exclude CPUs in lazy TLB mode
and so performs unnecessary flushes.
If we're not freeing page tables, add logic to test for lazy TLB
mode when adding CPUs to the input argument to the Hyper-V TLB
flush hypercall. Exclude lazy TLB mode CPUs so the behavior
matches native_flush_tlb_multi() and the unnecessary flushes are
avoided. Handle both the <=64 vCPU case and the _ex case for >64
vCPUs.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1679922967-26582-3-git-send-email-mikelley@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
When copying CPUs from a Linux cpumask to a Hyper-V VPset,
cpumask_to_vpset() currently has a "_noself" variant that doesn't copy
the current CPU to the VPset. Generalize this variant by replacing it
with a "_skip" variant having a callback function that is invoked for
each CPU to decide if that CPU should be copied. Update the one caller
of cpumask_to_vpset_noself() to use the new "_skip" variant instead.
No functional change.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1679922967-26582-2-git-send-email-mikelley@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
For PCI pass-thru devices in a Confidential VM, Hyper-V requires
that PCI config space be accessed via hypercalls. In normal VMs,
config space accesses are trapped to the Hyper-V host and emulated.
But in a confidential VM, the host can't access guest memory to
decode the instruction for emulation, so an explicit hypercall must
be used.
Add functions to make the new MMIO read and MMIO write hypercalls.
Update the PCI config space access functions to use the hypercalls
when such use is indicated by Hyper-V flags. Also, set the flag to
allow the Hyper-V PCI driver to be loaded and used in a Confidential
VM (a.k.a., "Isolation VM"). The driver has previously been hardened
against a malicious Hyper-V host[1].
[1] https://lore.kernel.org/all/20220511223207.3386-2-parri.andrea@gmail.com/
Co-developed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://lore.kernel.org/r/1679838727-87310-13-git-send-email-mikelley@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
With the vTOM bit now treated as a protection flag and not part of
the physical address, avoid remapping physical addresses with vTOM set
since technically such addresses aren't valid. Use ioremap_cache()
instead of memremap() to ensure that the mapping provides decrypted
access, which will correctly set the vTOM bit as a protection flag.
While this change is not required for correctness with the current
implementation of memremap(), for general code hygiene it's better to
not depend on the mapping functions doing something reasonable with
a physical address that is out-of-range.
While here, fix typos in two error messages.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Link: https://lore.kernel.org/r/1679838727-87310-12-git-send-email-mikelley@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
With changes to how Hyper-V guest VMs flip memory between private
(encrypted) and shared (decrypted), creating a second kernel virtual
mapping for shared memory is no longer necessary. Everything needed
for the transition to shared is handled by set_memory_decrypted().
As such, remove the code to create and manage the second
mapping for the pre-allocated send and recv buffers. This mapping
is the last user of hv_map_memory()/hv_unmap_memory(), so delete
these functions as well. Finally, hv_map_memory() is the last
user of vmap_pfn() in Hyper-V guest code, so remove the Kconfig
selection of VMAP_PFN.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Link: https://lore.kernel.org/r/1679838727-87310-11-git-send-email-mikelley@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
With changes to how Hyper-V guest VMs flip memory between private
(encrypted) and shared (decrypted), creating a second kernel virtual
mapping for shared memory is no longer necessary. Everything needed
for the transition to shared is handled by set_memory_decrypted().
As such, remove swiotlb_unencrypted_base and the associated
code.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/1679838727-87310-8-git-send-email-mikelley@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Merge the following 6 patches from tip/x86/sev, which are taken from
Michael Kelley's series [0]. The rest of Michael's series depend on
them.
x86/hyperv: Change vTOM handling to use standard coco mechanisms
init: Call mem_encrypt_init() after Hyper-V hypercall init is done
x86/mm: Handle decryption/re-encryption of bss_decrypted consistently
Drivers: hv: Explicitly request decrypted in vmap_pfn() calls
x86/hyperv: Reorder code to facilitate future work
x86/ioremap: Add hypervisor callback for private MMIO mapping in coco VM
0: https://lore.kernel.org/linux-hyperv/1679838727-87310-1-git-send-email-mikelley@microsoft.com/
boot is done, in order to prevent a crash
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmQ76ZYACgkQEsHwGGHe
VUouEA//UfGRp9FysHX80FRIb9ZBwdVYZJDdf6CKaSqVQ+JLlT2cDZN3XrogtGG9
1mxyRhmLXuh6D944o4NQIOsz4odXh9oNqNtIvqHz37YcMeJEtZlm/y6mKOJ4X8+E
zbjlNyEsMvnqN5Y0Dr+fOSKgulhYLBOhFa+rgPqJqPOQ2/Ug4pCxJUrBhHxonPXQ
rNjjG1kWyHWlGTh1CGtfESiYZUOuC5Ag4hahbB+VLMyW5iUALmUIfGnwpciuwEFx
zNE0EgHW8al4e+UfH+Pa5mdyawMfur26l/EmA1yNEDbXH1uQNR1EUllD17ybfDnq
MxicIGjpY8Cv1DZQzIAi1JFRVGcYyTIPGfn7P7pzF8C4Q9aOaOLr1MZhA8OA3D3o
/Gb5Jn09uLKjn5u3axKjgs8pVAAmb3xWURLXxBILC2Mv9S7m1pgsbTmz3JmX3kpt
6nqFXvxt6ZZ6xTLC5GM8RRyed4/gsPX3O7LuamjtSaswZqh0DnVXU64meShLzTph
z2u5rTPOXeYxQE2C87bh0ghU/G2mywrH1wsK59OhDSzsKcJYD5HkG66aBFyCAkoV
e0a565BC/lUn+6DjCK4ldTKv+QSbtwMyUg2iJxh/bhENROEI/0t/sUe2WYxQJCKn
UaEIsJn0Cj4/iyAROt1nFRnZknYZgeGu587+GCBwtFL+asRyUhs=
=Lrs2
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.3_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Borislav Petkov:
- Drop __init annotation from two rtc functions which get called after
boot is done, in order to prevent a crash
* tag 'x86_urgent_for_v6.3_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/rtc: Remove __init for runtime functions
- Drop debug info from purgatory objects again
- Document that kernel.org provides prebuilt LLVM toolchains
- Give up handling untracked files for source package builds
- Avoid creating corrupted cpio when KBUILD_BUILD_TIMESTAMP is given
with a pre-epoch data.
- Change panic_show_mem() to a macro to handle variable-length argument
- Compress tarballs on-the-fly again
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmQ7zuUVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsG3/AP/12Y/zmiCY/R2NQmhYAGt05K9I5R
M0mQbdNLlzmXfrXPEAUVWbUcEOZ1jFxqkXJwhf1kYHmJ+2MeS/n1d4A+4x4eGXDE
a2WfB/G8FP5+T6+u/U/LDfMqVpbuAfBkZXu5QeceUlHyTY1Q1BAO9PUvNu+Df2P7
i1SBEDVzaN3WE1ZOc3pjQf0tIQybyM/x9EjGrh4DpBVTQYpmmgxpCdyzVr7yJvQ3
LFHy+OtX2WTqtgULUunH776pp59EGgoKLDSLck+wG7bdmYm5y/13uVGc1iBfGBAX
RUV8qaP1ijB0BHZZnzsppUjZshSeS97sOUVv6hiwRdWgr0ISfZuVYN0E2YlhBcFV
UUidlk1l1VOytM8/EDrYyHnTvmHm+glMp1FxRR48ymZr1PqVUxQcad0lPClylp1b
Xc50C2wkFwa5a8RkY0aIihrVpnbHBSiPVHvaF01kFwNUor+VASpanR/xtTr4b88x
OK9aImRII15CxoOZdWtvut4c0OHw4sbyzmCuXM/nyS6c5+yroM/QZZs+c2ja/QEv
QNlDW54JzU6u+JE4O7W/gH3mqKH8ytL7Z5hmVECiiCYWp77IBCE8B+3dXVGfdGUg
Wy8MgtvZMW0ZAseqBD6VmVXkQIizUAgpJvkJy7R6YTw52P6sT2P2WAcWUTi2FQTD
FpUEf8eMi1UTDQgG
=bFZ1
-----END PGP SIGNATURE-----
Merge tag 'kbuild-fixes-v6.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Drop debug info from purgatory objects again
- Document that kernel.org provides prebuilt LLVM toolchains
- Give up handling untracked files for source package builds
- Avoid creating corrupted cpio when KBUILD_BUILD_TIMESTAMP is given
with a pre-epoch data.
- Change panic_show_mem() to a macro to handle variable-length argument
- Compress tarballs on-the-fly again
* tag 'kbuild-fixes-v6.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: do not create intermediate *.tar for tar packages
kbuild: do not create intermediate *.tar for source tarballs
kbuild: merge cmd_archive_linux and cmd_archive_perf
init/initramfs: Fix argument forwarding to panic() in panic_show_mem()
initramfs: Check negative timestamp to prevent broken cpio archive
kbuild: give up untracked files for source package builds
Documentation/llvm: Add a note about prebuilt kernel.org toolchains
purgatory: fix disabling debug info
vga_default_device really is supposed to cover all corners, at least
for x86. Additionally checking for rom shadowing should be redundant,
because the bios/fw only does that for the boot vga device.
If this turns out to be wrong, then most likely that's a special case
which should be added to the vgaarb code, not replicated all over.
Patch motived by changes to the aperture helpers, which also have this
open code in a bunch of places, and which are all removed in a
clean-up series. This function here is just for selecting the default
fbdev device for fbcon, but I figured for consistency it might be good
to throw this patch in on top.
Note that the shadow rom check predates vgaarb, which was only wired
up in commit 88674088d1 ("x86: Use vga_default_device() when
determining whether an fb is primary"). That patch doesn't explain why
we still fall back to the shadow rom check.
v4:
- fix commit message style (i.e., commit 1234 ("..."))
- fix Daniel's S-o-b address
v5:
- add back an S-o-b tag with Daniel's Intel address
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Helge Deller <deller@gmx.de>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Javier Martinez Canillas <javierm@redhat.com>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230406132109.32050-9-tzimmermann@suse.de
Annotate the function prototype and definition as noreturn to prevent
objtool warnings like:
vmlinux.o: warning: objtool: hyperv_init+0x55c: unreachable instruction
Also, as per Josh's suggestion, add it to the global_noreturns list.
As a comparison, an objdump output without the annotation:
[...]
1b63: mov $0x1,%esi
1b68: xor %edi,%edi
1b6a: callq ffffffff8102f680 <hv_ghcb_terminate>
1b6f: jmpq ffffffff82f217ec <hyperv_init+0x9c> # unreachable
1b74: cmpq $0xffffffffffffffff,-0x702a24(%rip)
[...]
Now, after adding the __noreturn to the function prototype:
[...]
17df: callq ffffffff8102f6d0 <hv_ghcb_negotiate_protocol>
17e4: test %al,%al
17e6: je ffffffff82f21bb9 <hyperv_init+0x469>
[...] <many insns>
1bb9: mov $0x1,%esi
1bbe: xor %edi,%edi
1bc0: callq ffffffff8102f680 <hv_ghcb_terminate>
1bc5: nopw %cs:0x0(%rax,%rax,1) # end of function
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/32453a703dfcf0d007b473c9acbf70718222b74b.1681342859.git.jpoimboe@kernel.org
Since commit 8b41fc4454 ("kbuild: create modules.builtin without
Makefile.modbuiltin or tristate.conf"), MODULE_LICENSE declarations
are used to identify modules. As a consequence, uses of the macro
in non-modules will cause modprobe to misidentify their containing
object file as a module when it is not (false positives), and modprobe
might succeed rather than failing with a suitable error message.
So remove it in the files in this commit, none of which can be built as
modules.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Suggested-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: linux-modules@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Hitomi Hasegawa <hasegawa-hitomi@fujitsu.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Now this can no longer be built as a module, drop all remaining
module-related code as well.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: linux-modules@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Hitomi Hasegawa <hasegawa-hitomi@fujitsu.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-crypto@vger.kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Since commit 8b41fc4454 ("kbuild: create modules.builtin without
Makefile.modbuiltin or tristate.conf"), MODULE_LICENSE declarations
are used to identify modules. As a consequence, uses of the macro
in non-modules will cause modprobe to misidentify their containing
object file as a module when it is not (false positives), and modprobe
might succeed rather than failing with a suitable error message.
So remove it in the files in this commit, none of which can be built as
modules.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Suggested-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: linux-modules@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Hitomi Hasegawa <hasegawa-hitomi@fujitsu.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: linux-crypto@vger.kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
set_rtc_noop(), get_rtc_noop() are after booting, therefore their __init
annotation is wrong.
A crash was observed on an x86 platform where CMOS RTC is unused and
disabled via device tree. set_rtc_noop() was invoked from ntp:
sync_hw_clock(), although CONFIG_RTC_SYSTOHC=n, however sync_cmos_clock()
doesn't honour that.
Workqueue: events_power_efficient sync_hw_clock
RIP: 0010:set_rtc_noop
Call Trace:
update_persistent_clock64
sync_hw_clock
Fix this by dropping the __init annotation from set/get_rtc_noop().
Fixes: c311ed6183 ("x86/init: Allow DT configured systems to disable RTC at boot time")
Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nokia.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/59f7ceb1-446b-1d3d-0bc8-1f0ee94b1e18@nokia.com
arch_dynirq_lower_bound() is invoked by the core interrupt code to
retrieve the lowest possible Linux interrupt number for dynamically
allocated interrupts like MSI.
The x86 implementation uses this to exclude the IO/APIC GSI space.
This works correctly as long as there is an IO/APIC registered, but
returns 0 if not. This has been observed in VMs where the BIOS does
not advertise an IO/APIC.
0 is an invalid interrupt number except for the legacy timer interrupt
on x86. The return value is unchecked in the core code, so it ends up
to allocate interrupt number 0 which is subsequently considered to be
invalid by the caller, e.g. the MSI allocation code.
The function has already a check for 0 in the case that an IO/APIC is
registered, as ioapic_dynirq_base is 0 in case of device tree setups.
Consolidate this and zero check for both ioapic_dynirq_base and gsi_top,
which is used in the case that no IO/APIC is registered.
Fixes: 3e5bedc2c2 ("x86/apic: Fix arch_dynirq_lower_bound() bug for DT enabled machines")
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1679988604-20308-1-git-send-email-ssengar@linux.microsoft.com
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmQ1qFYUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vxRJA/+JnDrZ8Ag8sIGgI7meuqWcLe3dHXQ
7KaOGM/MP0ir8qJwiobuon2ml1lEU+iWZEcpFH9b9b5iurAfbvZkPhFAWZzJG3GW
B6qEX5zaK4tNJHC8byK4pSat2pJLg8Q6xwA6c3H9264tj4Dc83dg2Y8hWWozbU2l
xFaawRexCus7Xzd2VEfskwOpHQpSYkWCDtTFclIt1iZeelj6an/0FVdvYqvvIXi+
QZjS80tMeZIcKJq06023KJX36lZHf0RGzYzQTyr7asiNnXBnQ1DfYIASwx1Nf6Y3
9PpvoDVbUGnLDAB8THcOM9j9UOrhlbOJ+4b60J0b4nsT0ORn4rCrN05K0DIIXZi0
CsP3qR2l04xYZFElWrvnA30KS76G7vq0OeAGpQsE0nB1PgydXuiXk5lniQhsrOuj
xRiY7YeDFwiNzpjWsVtqkANvl6tA53ZUn+AMEPfIpdYbjnbLcSSF/qY3WCeKoftx
KUg6fVOoXbbSctK5rXQnj2yhwQZSkm3DrG9Ol2i3E524vcMKs28QBBuUN+gtL3TP
aD6nYP2Q0gItYAQCMtJHCBzfstrO5l9JSLM6xUHR/GwiaLCC9+QRdLJFcAxZvZJi
buGs09B2Xz8AbgQ3AejEw7UXk+4JP19ZLyznBhLaaQQO9YOvDYlcn3Xw1QGvJGY/
qsYKElgrLWSz+50=
=2gjK
-----END PGP SIGNATURE-----
Merge tag 'pci-v6.3-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci fixes from Bjorn Helgaas:
- Provide pci_msix_can_alloc_dyn() stub when CONFIG_PCI_MSI unset to
avoid build errors (Reinette Chatre)
- Quirk AMD XHCI controller that loses MSI-X state in D3hot to avoid
broken USB after hotplug or suspend/resume (Basavaraj Natikar)
- Fix use-after-free in pci_bus_release_domain_nr() (Rob Herring)
* tag 'pci-v6.3-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI: Fix use-after-free in pci_bus_release_domain_nr()
x86/PCI: Add quirk for AMD XHCI controller that loses MSI-X state in D3hot
PCI/MSI: Provide missing stub for pci_msix_can_alloc_dyn()
On Google Coral and Reef family Chromebooks with Intel Apollo Lake SoC,
firmware clobbers the header of the L1 PM Substates capability and the
previous capability when returning from D3cold to D0.
Save those headers at enumeration-time and restore them at resume.
[bhelgaas: The main benefit is to make the lspci output after resume
correct. Apparently there's little or no effect on power consumption.]
Link: https://lore.kernel.org/linux-pci/CAFJ_xbq0cxcH-cgpXLU4Mjk30+muWyWm1aUZGK7iG53yaLBaQg@mail.gmail.com/T/#u
Link: https://lore.kernel.org/r/20230411160213.4453-1-ron.lee@intel.com
Signed-off-by: Ron Lee <ron.lee@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Filter out XTILE_CFG from the supported XCR0 reported to userspace if the
current process doesn't have access to XTILE_DATA. Attempting to set
XTILE_CFG in XCR0 will #GP if XTILE_DATA is also not set, and so keeping
XTILE_CFG as supported results in explosions if userspace feeds
KVM_GET_SUPPORTED_CPUID back into KVM and the guest doesn't sanity check
CPUID.
Fixes: 445ecdf79b ("kvm: x86: Exclude unpermitted xfeatures at KVM_GET_SUPPORTED_CPUID")
Reported-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Aaron Lewis <aaronlewis@google.com>
Tested-by: Aaron Lewis <aaronlewis@google.com>
Link: https://lore.kernel.org/r/20230405004520.421768-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a helper, kvm_get_filtered_xcr0(), to dedup code that needs to account
for XCR0 features that require explicit opt-in on a per-process basis. In
addition to documenting when KVM should/shouldn't consult
xstate_get_guest_group_perm(), the helper will also allow sanitizing the
filtered XCR0 to avoid enumerating architecturally illegal XCR0 values,
e.g. XTILE_CFG without XTILE_DATA.
No functional changes intended.
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
[sean: rename helper, move to x86.h, massage changelog]
Reviewed-by: Aaron Lewis <aaronlewis@google.com>
Tested-by: Aaron Lewis <aaronlewis@google.com>
Link: https://lore.kernel.org/r/20230405004520.421768-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>