Commit Graph

11984 Commits

Author SHA1 Message Date
David Woodhouse 310bc39546 KVM: x86/xen: Avoid deadlock by adding kvm->arch.xen.xen_lock leaf node lock
In commit 14243b3871 ("KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN
and event channel delivery") the clever version of me left some helpful
notes for those who would come after him:

       /*
        * For the irqfd workqueue, using the main kvm->lock mutex is
        * fine since this function is invoked from kvm_set_irq() with
        * no other lock held, no srcu. In future if it will be called
        * directly from a vCPU thread (e.g. on hypercall for an IPI)
        * then it may need to switch to using a leaf-node mutex for
        * serializing the shared_info mapping.
        */
       mutex_lock(&kvm->lock);

In commit 2fd6df2f2b ("KVM: x86/xen: intercept EVTCHNOP_send from guests")
the other version of me ran straight past that comment without reading it,
and introduced a potential deadlock by taking vcpu->mutex and kvm->lock
in the wrong order.

Solve this as originally suggested, by adding a leaf-node lock in the Xen
state rather than using kvm->lock for it.

Fixes: 2fd6df2f2b ("KVM: x86/xen: intercept EVTCHNOP_send from guests")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20230111180651.14394-4-dwmw2@infradead.org>
[Rebase, add docs. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-11 17:45:58 -05:00
Jason A. Donenfeld 72bb8f8cc0 x86/insn: Avoid namespace clash by separating instruction decoder MMIO type from MMIO trace type
Both <linux/mmiotrace.h> and <asm/insn-eval.h> define various MMIO_ enum constants,
whose namespace overlaps.

Rename the <asm/insn-eval.h> ones to have a INSN_ prefix, so that the headers can be
used from the same source file.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230101162910.710293-2-Jason@zx2c4.com
2023-01-03 18:46:06 +01:00
Linus Torvalds 4f292c4de4 New Feature:
* Randomize the per-cpu entry areas
 Cleanups:
 * Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open
   coding it
 * Move to "native" set_memory_rox() helper
 * Clean up pmd_get_atomic() and i386-PAE
 * Remove some unused page table size macros
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmOc53UACgkQaDWVMHDJ
 krCUHw//SGZ+La0hLZLAiAiZTXLZZHpYkOmg1Oj1+11qSU11uZzTFqDpauhaKpRS
 cJCSh+D+RXe5e2ipgt0+Zl0hESLt7pJf8258OE4ra0DL/IlyO9uqruAs9Kn3eRS/
 Fk76nG8gdEU+JKJqpG02GqOLslYQuIy96n9hpuj1x25b614+uezPfC7S4XEat0NT
 MbJQ+jnVDf16aJIJkzT+iSwhubDVeh+bSHeO0SSCzX23WLUqDeg5NvlyxoCHGbBh
 UpUTWggV/0pYAkBKRHToeJs8qTWREwuuH/8JGewpe9A0tjdB5wyZfNL2PuracweN
 9MauXC3T5f0+Ca4yIIaPq1fF7Ny/PR2dBFihk27rOD0N7tjaZxNwal2pB1sZcmvZ
 +PAokjyTPVH5ZXjkMYGGAUe1jyjwr2+TgFSZxhTnDuGtyVQiY4pihGKOifLCX6tv
 x6khvYeTBw7wfaDRtKEAf+2kLHYn+71HszHP/8bNKX9T03h+Zf0i1wdZu5xbM5Gc
 VK2wR7bCC+UftJJYG0pldcHg2qaF19RBHK2tLwp7zngUv7lTbkKfkgKjre73KV2a
 D4b76lrqdUMo6UYwYdw7WtDyarZS4OVLq2DcNhwwMddBCaX8kyN5a4AqwQlZYJ0u
 dM+kuMofE8U3yMxmMhJimkZUsj09yLHIqfynY0jbAcU3nhKZZNY=
 =wwVF
 -----END PGP SIGNATURE-----

Merge tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mm updates from Dave Hansen:
 "New Feature:

   - Randomize the per-cpu entry areas

  Cleanups:

   - Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open coding it

   - Move to "native" set_memory_rox() helper

   - Clean up pmd_get_atomic() and i386-PAE

   - Remove some unused page table size macros"

* tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
  x86/mm: Ensure forced page table splitting
  x86/kasan: Populate shadow for shared chunk of the CPU entry area
  x86/kasan: Add helpers to align shadow addresses up and down
  x86/kasan: Rename local CPU_ENTRY_AREA variables to shorten names
  x86/mm: Populate KASAN shadow for entire per-CPU range of CPU entry area
  x86/mm: Recompute physical address for every page of per-CPU CEA mapping
  x86/mm: Rename __change_page_attr_set_clr(.checkalias)
  x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias()
  x86/mm: Untangle __change_page_attr_set_clr(.checkalias)
  x86/mm: Add a few comments
  x86/mm: Fix CR3_ADDR_MASK
  x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros
  mm: Convert __HAVE_ARCH_P..P_GET to the new style
  mm: Remove pointless barrier() after pmdp_get_lockless()
  x86/mm/pae: Get rid of set_64bit()
  x86_64: Remove pointless set_64bit() usage
  x86/mm/pae: Be consistent with pXXp_get_and_clear()
  x86/mm/pae: Use WRITE_ONCE()
  x86/mm/pae: Don't (ab)use atomic64
  mm/gup: Fix the lockless PMD access
  ...
2022-12-17 14:06:53 -06:00
Linus Torvalds 8fa590bf34 ARM64:
* Enable the per-vcpu dirty-ring tracking mechanism, together with an
   option to keep the good old dirty log around for pages that are
   dirtied by something other than a vcpu.
 
 * Switch to the relaxed parallel fault handling, using RCU to delay
   page table reclaim and giving better performance under load.
 
 * Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option,
   which multi-process VMMs such as crosvm rely on (see merge commit 382b5b87a97d:
   "Fix a number of issues with MTE, such as races on the tags being
   initialised vs the PG_mte_tagged flag as well as the lack of support
   for VM_SHARED when KVM is involved.  Patches from Catalin Marinas and
   Peter Collingbourne").
 
 * Merge the pKVM shadow vcpu state tracking that allows the hypervisor
   to have its own view of a vcpu, keeping that state private.
 
 * Add support for the PMUv3p5 architecture revision, bringing support
   for 64bit counters on systems that support it, and fix the
   no-quite-compliant CHAIN-ed counter support for the machines that
   actually exist out there.
 
 * Fix a handful of minor issues around 52bit VA/PA support (64kB pages
   only) as a prefix of the oncoming support for 4kB and 16kB pages.
 
 * Pick a small set of documentation and spelling fixes, because no
   good merge window would be complete without those.
 
 s390:
 
 * Second batch of the lazy destroy patches
 
 * First batch of KVM changes for kernel virtual != physical address support
 
 * Removal of a unused function
 
 x86:
 
 * Allow compiling out SMM support
 
 * Cleanup and documentation of SMM state save area format
 
 * Preserve interrupt shadow in SMM state save area
 
 * Respond to generic signals during slow page faults
 
 * Fixes and optimizations for the non-executable huge page errata fix.
 
 * Reprogram all performance counters on PMU filter change
 
 * Cleanups to Hyper-V emulation and tests
 
 * Process Hyper-V TLB flushes from a nested guest (i.e. from a L2 guest
   running on top of a L1 Hyper-V hypervisor)
 
 * Advertise several new Intel features
 
 * x86 Xen-for-KVM:
 
 ** Allow the Xen runstate information to cross a page boundary
 
 ** Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
 
 ** Add support for 32-bit guests in SCHEDOP_poll
 
 * Notable x86 fixes and cleanups:
 
 ** One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
 
 ** Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few
    years back when eliminating unnecessary barriers when switching between
    vmcs01 and vmcs02.
 
 ** Clean up vmread_error_trampoline() to make it more obvious that params
    must be passed on the stack, even for x86-64.
 
 ** Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective
    of the current guest CPUID.
 
 ** Fudge around a race with TSC refinement that results in KVM incorrectly
    thinking a guest needs TSC scaling when running on a CPU with a
    constant TSC, but no hardware-enumerated TSC frequency.
 
 ** Advertise (on AMD) that the SMM_CTL MSR is not supported
 
 ** Remove unnecessary exports
 
 Generic:
 
 * Support for responding to signals during page faults; introduces
   new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks
 
 Selftests:
 
 * Fix an inverted check in the access tracking perf test, and restore
   support for asserting that there aren't too many idle pages when
   running on bare metal.
 
 * Fix build errors that occur in certain setups (unsure exactly what is
   unique about the problematic setup) due to glibc overriding
   static_assert() to a variant that requires a custom message.
 
 * Introduce actual atomics for clear/set_bit() in selftests
 
 * Add support for pinning vCPUs in dirty_log_perf_test.
 
 * Rename the so called "perf_util" framework to "memstress".
 
 * Add a lightweight psuedo RNG for guest use, and use it to randomize
   the access pattern and write vs. read percentage in the memstress tests.
 
 * Add a common ucall implementation; code dedup and pre-work for running
   SEV (and beyond) guests in selftests.
 
 * Provide a common constructor and arch hook, which will eventually be
   used by x86 to automatically select the right hypercall (AMD vs. Intel).
 
 * A bunch of added/enabled/fixed selftests for ARM64, covering memslots,
   breakpoints, stage-2 faults and access tracking.
 
 * x86-specific selftest changes:
 
 ** Clean up x86's page table management.
 
 ** Clean up and enhance the "smaller maxphyaddr" test, and add a related
    test to cover generic emulation failure.
 
 ** Clean up the nEPT support checks.
 
 ** Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.
 
 ** Fix an ordering issue in the AMX test introduced by recent conversions
    to use kvm_cpu_has(), and harden the code to guard against similar bugs
    in the future.  Anything that tiggers caching of KVM's supported CPUID,
    kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if
    the caching occurs before the test opts in via prctl().
 
 Documentation:
 
 * Remove deleted ioctls from documentation
 
 * Clean up the docs for the x86 MSR filter.
 
 * Various fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmOaFrcUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPemQgAq49excg2Cc+EsHnZw3vu/QWdA0Rt
 KhL3OgKxuHNjCbD2O9n2t5di7eJOTQ7F7T0eDm3xPTr4FS8LQ2327/mQePU/H2CF
 mWOpq9RBWLzFsSTeVA2Mz9TUTkYSnDHYuRsBvHyw/n9cL76BWVzjImldFtjYjjex
 yAwl8c5itKH6bc7KO+5ydswbvBzODkeYKUSBNdbn6m0JGQST7XppNwIAJvpiHsii
 Qgpk0e4Xx9q4PXG/r5DedI6BlufBsLhv0aE9SHPzyKH3JbbUFhJYI8ZD5OhBQuYW
 MwxK2KlM5Jm5ud2NZDDlsMmmvd1lnYCFDyqNozaKEWC1Y5rq1AbMa51fXA==
 =QAYX
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "ARM64:

   - Enable the per-vcpu dirty-ring tracking mechanism, together with an
     option to keep the good old dirty log around for pages that are
     dirtied by something other than a vcpu.

   - Switch to the relaxed parallel fault handling, using RCU to delay
     page table reclaim and giving better performance under load.

   - Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
     option, which multi-process VMMs such as crosvm rely on (see merge
     commit 382b5b87a97d: "Fix a number of issues with MTE, such as
     races on the tags being initialised vs the PG_mte_tagged flag as
     well as the lack of support for VM_SHARED when KVM is involved.
     Patches from Catalin Marinas and Peter Collingbourne").

   - Merge the pKVM shadow vcpu state tracking that allows the
     hypervisor to have its own view of a vcpu, keeping that state
     private.

   - Add support for the PMUv3p5 architecture revision, bringing support
     for 64bit counters on systems that support it, and fix the
     no-quite-compliant CHAIN-ed counter support for the machines that
     actually exist out there.

   - Fix a handful of minor issues around 52bit VA/PA support (64kB
     pages only) as a prefix of the oncoming support for 4kB and 16kB
     pages.

   - Pick a small set of documentation and spelling fixes, because no
     good merge window would be complete without those.

  s390:

   - Second batch of the lazy destroy patches

   - First batch of KVM changes for kernel virtual != physical address
     support

   - Removal of a unused function

  x86:

   - Allow compiling out SMM support

   - Cleanup and documentation of SMM state save area format

   - Preserve interrupt shadow in SMM state save area

   - Respond to generic signals during slow page faults

   - Fixes and optimizations for the non-executable huge page errata
     fix.

   - Reprogram all performance counters on PMU filter change

   - Cleanups to Hyper-V emulation and tests

   - Process Hyper-V TLB flushes from a nested guest (i.e. from a L2
     guest running on top of a L1 Hyper-V hypervisor)

   - Advertise several new Intel features

   - x86 Xen-for-KVM:

      - Allow the Xen runstate information to cross a page boundary

      - Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured

      - Add support for 32-bit guests in SCHEDOP_poll

   - Notable x86 fixes and cleanups:

      - One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).

      - Reinstate IBPB on emulated VM-Exit that was incorrectly dropped
        a few years back when eliminating unnecessary barriers when
        switching between vmcs01 and vmcs02.

      - Clean up vmread_error_trampoline() to make it more obvious that
        params must be passed on the stack, even for x86-64.

      - Let userspace set all supported bits in MSR_IA32_FEAT_CTL
        irrespective of the current guest CPUID.

      - Fudge around a race with TSC refinement that results in KVM
        incorrectly thinking a guest needs TSC scaling when running on a
        CPU with a constant TSC, but no hardware-enumerated TSC
        frequency.

      - Advertise (on AMD) that the SMM_CTL MSR is not supported

      - Remove unnecessary exports

  Generic:

   - Support for responding to signals during page faults; introduces
     new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks

  Selftests:

   - Fix an inverted check in the access tracking perf test, and restore
     support for asserting that there aren't too many idle pages when
     running on bare metal.

   - Fix build errors that occur in certain setups (unsure exactly what
     is unique about the problematic setup) due to glibc overriding
     static_assert() to a variant that requires a custom message.

   - Introduce actual atomics for clear/set_bit() in selftests

   - Add support for pinning vCPUs in dirty_log_perf_test.

   - Rename the so called "perf_util" framework to "memstress".

   - Add a lightweight psuedo RNG for guest use, and use it to randomize
     the access pattern and write vs. read percentage in the memstress
     tests.

   - Add a common ucall implementation; code dedup and pre-work for
     running SEV (and beyond) guests in selftests.

   - Provide a common constructor and arch hook, which will eventually
     be used by x86 to automatically select the right hypercall (AMD vs.
     Intel).

   - A bunch of added/enabled/fixed selftests for ARM64, covering
     memslots, breakpoints, stage-2 faults and access tracking.

   - x86-specific selftest changes:

      - Clean up x86's page table management.

      - Clean up and enhance the "smaller maxphyaddr" test, and add a
        related test to cover generic emulation failure.

      - Clean up the nEPT support checks.

      - Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.

      - Fix an ordering issue in the AMX test introduced by recent
        conversions to use kvm_cpu_has(), and harden the code to guard
        against similar bugs in the future. Anything that tiggers
        caching of KVM's supported CPUID, kvm_cpu_has() in this case,
        effectively hides opt-in XSAVE features if the caching occurs
        before the test opts in via prctl().

  Documentation:

   - Remove deleted ioctls from documentation

   - Clean up the docs for the x86 MSR filter.

   - Various fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits)
  KVM: x86: Add proper ReST tables for userspace MSR exits/flags
  KVM: selftests: Allocate ucall pool from MEM_REGION_DATA
  KVM: arm64: selftests: Align VA space allocator with TTBR0
  KVM: arm64: Fix benign bug with incorrect use of VA_BITS
  KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow
  KVM: x86: Advertise that the SMM_CTL MSR is not supported
  KVM: x86: remove unnecessary exports
  KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic"
  tools: KVM: selftests: Convert clear/set_bit() to actual atomics
  tools: Drop "atomic_" prefix from atomic test_and_set_bit()
  tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers
  KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests
  perf tools: Use dedicated non-atomic clear/set bit helpers
  tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers
  KVM: arm64: selftests: Enable single-step without a "full" ucall()
  KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself
  KVM: Remove stale comment about KVM_REQ_UNHALT
  KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR
  KVM: Reference to kvm_userspace_memory_region in doc and comments
  KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl
  ...
2022-12-15 11:12:21 -08:00
Kirill A. Shutemov 80d0969aa7 x86/mm: Fix CR3_ADDR_MASK
The mask must not include bits above physical address mask. These bits
are reserved and can be used for other things. Bits 61 and 62 are used
for Linear Address Masking.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Alexander Potapenko <glider@google.com>
Link: https://lore.kernel.org/all/20221109165140.9137-2-kirill.shutemov%40linux.intel.com
2022-12-15 10:37:28 -08:00
Pasha Tatashin 82328227db x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros
Other architectures and the common mm/ use P*D_MASK, and P*D_SIZE.
Remove the duplicated P*D_PAGE_MASK and P*D_PAGE_SIZE which are only
used in x86/*.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Link: https://lore.kernel.org/r/20220516185202.604654-1-tatashin@google.com
2022-12-15 10:37:27 -08:00
Peter Zijlstra d4a72e7fe6 x86/mm/pae: Get rid of set_64bit()
Recognise that set_64bit() is a special case of our previously
introduced pxx_xchg64(), so use that and get rid of set_64bit().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221022114425.233481884%40infradead.org
2022-12-15 10:37:27 -08:00
Peter Zijlstra 9ee850acd2 x86_64: Remove pointless set_64bit() usage
The use of set_64bit() in X86_64 only code is pretty pointless, seeing
how it's a direct assignment. Remove all this nonsense.

[nathanchance: unbreak irte]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221022114425.168036718%40infradead.org
2022-12-15 10:37:27 -08:00
Peter Zijlstra b7301f2010 x86/mm/pae: Be consistent with pXXp_get_and_clear()
Given that ptep_get_and_clear() uses cmpxchg8b, and that should be by
far the most common case, there's no point in having an optimized
variant for pmd/pud.

Introduce the pxx_xchg64() helper to implement the common logic once.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221022114425.103392961%40infradead.org
2022-12-15 10:37:27 -08:00
Peter Zijlstra f7bcd4617d x86/mm/pae: Use WRITE_ONCE()
Disallow write-tearing, that would be really unfortunate.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221022114425.038102604%40infradead.org
2022-12-15 10:37:27 -08:00
Peter Zijlstra 7a9b8bdb6a x86/mm/pae: Don't (ab)use atomic64
PAE implies CX8, write readable code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221022114424.971450128%40infradead.org
2022-12-15 10:37:27 -08:00
Peter Zijlstra 024d232ae4 mm: Fix pmd_read_atomic()
AFAICT there's no reason to do anything different than what we do for
PTEs. Make it so (also affects SH).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221022114424.711181252%40infradead.org
2022-12-15 10:37:27 -08:00
Peter Zijlstra fbfdec9989 x86/mm/pae: Make pmd_t similar to pte_t
Instead of mucking about with at least 2 different ways of fudging
it, do the same thing we do for pte_t.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221022114424.580310787%40infradead.org
2022-12-15 10:37:27 -08:00
Peter Zijlstra 60463628c9 x86/mm: Implement native set_memory_rox()
Provide a native implementation of set_memory_rox(), avoiding the
double set_memory_ro();set_memory_x(); calls.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2022-12-15 10:37:27 -08:00
Peter Zijlstra 97e3d26b5e x86/mm: Randomize per-cpu entry area
Seth found that the CPU-entry-area; the piece of per-cpu data that is
mapped into the userspace page-tables for kPTI is not subject to any
randomization -- irrespective of kASLR settings.

On x86_64 a whole P4D (512 GB) of virtual address space is reserved for
this structure, which is plenty large enough to randomize things a
little.

As such, use a straight forward randomization scheme that avoids
duplicates to spread the existing CPUs over the available space.

  [ bp: Fix le build. ]

Reported-by: Seth Jenkins <sethjenkins@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-12-15 10:37:26 -08:00
Andrey Ryabinin 3f148f3318 x86/kasan: Map shadow for percpu pages on demand
KASAN maps shadow for the entire CPU-entry-area:
  [CPU_ENTRY_AREA_BASE, CPU_ENTRY_AREA_BASE + CPU_ENTRY_AREA_MAP_SIZE]

This will explode once the per-cpu entry areas are randomized since it
will increase CPU_ENTRY_AREA_MAP_SIZE to 512 GB and KASAN fails to
allocate shadow for such big area.

Fix this by allocating KASAN shadow only for really used cpu entry area
addresses mapped by cea_map_percpu_pages()

Thanks to the 0day folks for finding and reporting this to be an issue.

[ dhansen: tweak changelog since this will get committed before peterz's
	   actual cpu-entry-area randomization ]

Signed-off-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Yujie Liu <yujie.liu@intel.com>
Cc: kernel test robot <yujie.liu@intel.com>
Link: https://lore.kernel.org/r/202210241508.2e203c3d-yujie.liu@intel.com
2022-12-15 10:37:26 -08:00
Linus Torvalds 94a855111e - Add the call depth tracking mitigation for Retbleed which has
been long in the making. It is a lighterweight software-only fix for
 Skylake-based cores where enabling IBRS is a big hammer and causes a
 significant performance impact.
 
 What it basically does is, it aligns all kernel functions to 16 bytes
 boundary and adds a 16-byte padding before the function, objtool
 collects all functions' locations and when the mitigation gets applied,
 it patches a call accounting thunk which is used to track the call depth
 of the stack at any time.
 
 When that call depth reaches a magical, microarchitecture-specific value
 for the Return Stack Buffer, the code stuffs that RSB and avoids its
 underflow which could otherwise lead to the Intel variant of Retbleed.
 
 This software-only solution brings a lot of the lost performance back,
 as benchmarks suggest:
 
   https://lore.kernel.org/all/20220915111039.092790446@infradead.org/
 
 That page above also contains a lot more detailed explanation of the
 whole mechanism
 
 - Implement a new control flow integrity scheme called FineIBT which is
 based on the software kCFI implementation and uses hardware IBT support
 where present to annotate and track indirect branches using a hash to
 validate them
 
 - Other misc fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOZp5EACgkQEsHwGGHe
 VUrZFxAAvi/+8L0IYSK4mKJvixGbTFjxN/Swo2JVOfs34LqGUT6JaBc+VUMwZxdb
 VMTFIZ3ttkKEodjhxGI7oGev6V8UfhI37SmO2lYKXpQVjXXnMlv/M+Vw3teE38CN
 gopi+xtGnT1IeWQ3tc/Tv18pleJ0mh5HKWiW+9KoqgXj0wgF9x4eRYDz1TDCDA/A
 iaBzs56j8m/FSykZHnrWZ/MvjKNPdGlfJASUCPeTM2dcrXQGJ93+X2hJctzDte0y
 Nuiw6Y0htfFBE7xoJn+sqm5Okr+McoUM18/CCprbgSKYk18iMYm3ZtAi6FUQZS1A
 ua4wQCf49loGp15PO61AS5d3OBf5D3q/WihQRbCaJvTVgPp9sWYnWwtcVUuhMllh
 ZQtBU9REcVJ/22bH09Q9CjBW0VpKpXHveqQdqRDViLJ6v/iI6EFGmD24SW/VxyRd
 73k9MBGrL/dOf1SbEzdsnvcSB3LGzp0Om8o/KzJWOomrVKjBCJy16bwTEsCZEJmP
 i406m92GPXeaN1GhTko7vmF0GnkEdJs1GVCZPluCAxxbhHukyxHnrjlQjI4vC80n
 Ylc0B3Kvitw7LGJsPqu+/jfNHADC/zhx1qz/30wb5cFmFbN1aRdp3pm8JYUkn+l/
 zri2Y6+O89gvE/9/xUhMohzHsWUO7xITiBavewKeTP9GSWybWUs=
 =cRy1
 -----END PGP SIGNATURE-----

Merge tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 core updates from Borislav Petkov:

 - Add the call depth tracking mitigation for Retbleed which has been
   long in the making. It is a lighterweight software-only fix for
   Skylake-based cores where enabling IBRS is a big hammer and causes a
   significant performance impact.

   What it basically does is, it aligns all kernel functions to 16 bytes
   boundary and adds a 16-byte padding before the function, objtool
   collects all functions' locations and when the mitigation gets
   applied, it patches a call accounting thunk which is used to track
   the call depth of the stack at any time.

   When that call depth reaches a magical, microarchitecture-specific
   value for the Return Stack Buffer, the code stuffs that RSB and
   avoids its underflow which could otherwise lead to the Intel variant
   of Retbleed.

   This software-only solution brings a lot of the lost performance
   back, as benchmarks suggest:

       https://lore.kernel.org/all/20220915111039.092790446@infradead.org/

   That page above also contains a lot more detailed explanation of the
   whole mechanism

 - Implement a new control flow integrity scheme called FineIBT which is
   based on the software kCFI implementation and uses hardware IBT
   support where present to annotate and track indirect branches using a
   hash to validate them

 - Other misc fixes and cleanups

* tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (80 commits)
  x86/paravirt: Use common macro for creating simple asm paravirt functions
  x86/paravirt: Remove clobber bitmask from .parainstructions
  x86/debug: Include percpu.h in debugreg.h to get DECLARE_PER_CPU() et al
  x86/cpufeatures: Move X86_FEATURE_CALL_DEPTH from bit 18 to bit 19 of word 11, to leave space for WIP X86_FEATURE_SGX_EDECCSSA bit
  x86/Kconfig: Enable kernel IBT by default
  x86,pm: Force out-of-line memcpy()
  objtool: Fix weak hole vs prefix symbol
  objtool: Optimize elf_dirty_reloc_sym()
  x86/cfi: Add boot time hash randomization
  x86/cfi: Boot time selection of CFI scheme
  x86/ibt: Implement FineIBT
  objtool: Add --cfi to generate the .cfi_sites section
  x86: Add prefix symbols for function padding
  objtool: Add option to generate prefix symbols
  objtool: Avoid O(bloody terrible) behaviour -- an ode to libelf
  objtool: Slice up elf_create_section_symbol()
  kallsyms: Revert "Take callthunks into account"
  x86: Unconfuse CONFIG_ and X86_FEATURE_ namespaces
  x86/retpoline: Fix crash printing warning
  x86/paravirt: Fix a !PARAVIRT build warning
  ...
2022-12-14 15:03:00 -08:00
Linus Torvalds e2ca6ba6ba MM patches for 6.2-rc1.
- More userfaultfs work from Peter Xu.
 
 - Several convert-to-folios series from Sidhartha Kumar and Huang Ying.
 
 - Some filemap cleanups from Vishal Moola.
 
 - David Hildenbrand added the ability to selftest anon memory COW handling.
 
 - Some cpuset simplifications from Liu Shixin.
 
 - Addition of vmalloc tracing support by Uladzislau Rezki.
 
 - Some pagecache folioifications and simplifications from Matthew Wilcox.
 
 - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it.
 
 - Miguel Ojeda contributed some cleanups for our use of the
   __no_sanitize_thread__ gcc keyword.  This series shold have been in the
   non-MM tree, my bad.
 
 - Naoya Horiguchi improved the interaction between memory poisoning and
   memory section removal for huge pages.
 
 - DAMON cleanups and tuneups from SeongJae Park
 
 - Tony Luck fixed the handling of COW faults against poisoned pages.
 
 - Peter Xu utilized the PTE marker code for handling swapin errors.
 
 - Hugh Dickins reworked compound page mapcount handling, simplifying it
   and making it more efficient.
 
 - Removal of the autonuma savedwrite infrastructure from Nadav Amit and
   David Hildenbrand.
 
 - zram support for multiple compression streams from Sergey Senozhatsky.
 
 - David Hildenbrand reworked the GUP code's R/O long-term pinning so
   that drivers no longer need to use the FOLL_FORCE workaround which
   didn't work very well anyway.
 
 - Mel Gorman altered the page allocator so that local IRQs can remnain
   enabled during per-cpu page allocations.
 
 - Vishal Moola removed the try_to_release_page() wrapper.
 
 - Stefan Roesch added some per-BDI sysfs tunables which are used to
   prevent network block devices from dirtying excessive amounts of
   pagecache.
 
 - David Hildenbrand did some cleanup and repair work on KSM COW
   breaking.
 
 - Nhat Pham and Johannes Weiner have implemented writeback in zswap's
   zsmalloc backend.
 
 - Brian Foster has fixed a longstanding corner-case oddity in
   file[map]_write_and_wait_range().
 
 - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang
   Chen.
 
 - Shiyang Ruan has done some work on fsdax, to make its reflink mode
   work better under xfstests.  Better, but still not perfect.
 
 - Christoph Hellwig has removed the .writepage() method from several
   filesystems.  They only need .writepages().
 
 - Yosry Ahmed wrote a series which fixes the memcg reclaim target
   beancounting.
 
 - David Hildenbrand has fixed some of our MM selftests for 32-bit
   machines.
 
 - Many singleton patches, as usual.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY5j6ZwAKCRDdBJ7gKXxA
 jkDYAP9qNeVqp9iuHjZNTqzMXkfmJPsw2kmy2P+VdzYVuQRcJgEAgoV9d7oMq4ml
 CodAgiA51qwzId3GRytIo/tfWZSezgA=
 =d19R
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - More userfaultfs work from Peter Xu

 - Several convert-to-folios series from Sidhartha Kumar and Huang Ying

 - Some filemap cleanups from Vishal Moola

 - David Hildenbrand added the ability to selftest anon memory COW
   handling

 - Some cpuset simplifications from Liu Shixin

 - Addition of vmalloc tracing support by Uladzislau Rezki

 - Some pagecache folioifications and simplifications from Matthew
   Wilcox

 - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use
   it

 - Miguel Ojeda contributed some cleanups for our use of the
   __no_sanitize_thread__ gcc keyword.

   This series should have been in the non-MM tree, my bad

 - Naoya Horiguchi improved the interaction between memory poisoning and
   memory section removal for huge pages

 - DAMON cleanups and tuneups from SeongJae Park

 - Tony Luck fixed the handling of COW faults against poisoned pages

 - Peter Xu utilized the PTE marker code for handling swapin errors

 - Hugh Dickins reworked compound page mapcount handling, simplifying it
   and making it more efficient

 - Removal of the autonuma savedwrite infrastructure from Nadav Amit and
   David Hildenbrand

 - zram support for multiple compression streams from Sergey Senozhatsky

 - David Hildenbrand reworked the GUP code's R/O long-term pinning so
   that drivers no longer need to use the FOLL_FORCE workaround which
   didn't work very well anyway

 - Mel Gorman altered the page allocator so that local IRQs can remnain
   enabled during per-cpu page allocations

 - Vishal Moola removed the try_to_release_page() wrapper

 - Stefan Roesch added some per-BDI sysfs tunables which are used to
   prevent network block devices from dirtying excessive amounts of
   pagecache

 - David Hildenbrand did some cleanup and repair work on KSM COW
   breaking

 - Nhat Pham and Johannes Weiner have implemented writeback in zswap's
   zsmalloc backend

 - Brian Foster has fixed a longstanding corner-case oddity in
   file[map]_write_and_wait_range()

 - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang
   Chen

 - Shiyang Ruan has done some work on fsdax, to make its reflink mode
   work better under xfstests. Better, but still not perfect

 - Christoph Hellwig has removed the .writepage() method from several
   filesystems. They only need .writepages()

 - Yosry Ahmed wrote a series which fixes the memcg reclaim target
   beancounting

 - David Hildenbrand has fixed some of our MM selftests for 32-bit
   machines

 - Many singleton patches, as usual

* tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits)
  mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio
  mm: mmu_gather: allow more than one batch of delayed rmaps
  mm: fix typo in struct pglist_data code comment
  kmsan: fix memcpy tests
  mm: add cond_resched() in swapin_walk_pmd_entry()
  mm: do not show fs mm pc for VM_LOCKONFAULT pages
  selftests/vm: ksm_functional_tests: fixes for 32bit
  selftests/vm: cow: fix compile warning on 32bit
  selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions
  mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem
  mm,thp,rmap: fix races between updates of subpages_mapcount
  mm: memcg: fix swapcached stat accounting
  mm: add nodes= arg to memory.reclaim
  mm: disable top-tier fallback to reclaim on proactive reclaim
  selftests: cgroup: make sure reclaim target memcg is unprotected
  selftests: cgroup: refactor proactive reclaim code to reclaim_until()
  mm: memcg: fix stale protection of reclaim target memcg
  mm/mmap: properly unaccount memory on mas_preallocate() failure
  omfs: remove ->writepage
  jfs: remove ->writepage
  ...
2022-12-13 19:29:45 -08:00
Linus Torvalds 82c72902c5 - Simplify paravirt patching machinery by removing the now unused
clobber mask
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOYkNQACgkQEsHwGGHe
 VUqflhAAnZBE/UzP499WcdjvnnQWLORfjUzAyU157U/08iMCx5qF/Qw73SrQwjIX
 XPWiqi0gvLS/2eDK02UylBF5reNUuATNO5wavGb0ORUKmDytcklZSzoygunByvrG
 fco2DXw99qQ2xhZ4fU7UbVoEek7P7nxM33dMfZjeiGldPSiRKV8pmtTG4Gmq5Ihl
 gEI1MLKw75oTg3t/VJqPoENv+wIBH/i2Q+oFr7E9cUoO/3T94OsKgeuCf0f+tINY
 YQtiFYFQGLzMEQlPnGBj+2S7usocAs80RCjvRSMMXhuC6hmoGBibw2/u25s+QmY/
 yHYsmxQ+a4dEFiv/8jBAxJew/eQwiz0LSTteVS8/2cV4lhiNJHb8L2AV+xHms8Az
 fPTYGEX4GbykZPCemVI/y3saVbPOidGIoOUvOqML8LeQYURntgh1KleTmF5UweR+
 ZMwEbXOS0Ni7wsB1Tfj/A21eGOG8dFzhszZBz8QjhtQHgckeF/JtAsNWSVRdhRyz
 Z0Cm1+XPftNx2eus6T7i6y9Kj8QSKszsDs+oLFvhiO8BkeJhLn9w3IQXCsDjVeZU
 y0Qx3VakL4MK3Im+Ah/ZM6IQQuGR3ld8x1jQJkp3FAylmznqSjYl1MoUA5MZwnwv
 M3yNGcLFO5uIjBfvFwBQXH28FkSS7tL9OfmtDyrEk2ITILmGqBU=
 =LcBv
 -----END PGP SIGNATURE-----

Merge tag 'x86_paravirt_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 paravirt update from Borislav Petkov:

 - Simplify paravirt patching machinery by removing the now unused
   clobber mask

* tag 'x86_paravirt_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/paravirt: Remove clobber bitmask from .parainstructions
2022-12-13 15:09:40 -08:00
Linus Torvalds a70210f415 - Add support for multiple testing sequences to the Intel In-Field Scan
driver in order to be able to run multiple different test patterns.
 Rework things and remove the BROKEN dependency so that the driver can be
 enabled (Jithu Joseph)
 
 - Remove the subsys interface usage in the microcode loader because it
 is not really needed
 
 - A couple of smaller fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOYjh8ACgkQEsHwGGHe
 VUpu8xAAhY7ywLcAoG9p3AaGiXpryFwnXFBah13o1rkgkJGRaG/eVjPJ4KUUjOQs
 Wo3WUHeeHwmFWq+F/OSRefNsptOLBQ3u/cSza9TDDjPoS3glO5cIFc34JqIItMTg
 L1GMB4LfmD1+9lYpM6Td11/Dluqf7EjeEdF4qDmCRZ5i4YNsaAlM4HtgATavNkYc
 6Bvsi1r7tv7tCNDAEYqEfsQLoc79Yca4W5s86HNIyrxtyk9RLrK75WvRkcpTSnK9
 SEpgpYwZy4iRTtZmePC7BqqbHfV6NoeuRqIMR73FrNK9pQuauGFMPkIx08Sgl3BW
 /YGpefleGBHhy6Dqa6rEPsYS9xHfhqYAde09zzECJWW4VSI0PuFKyfm67ep2O7q6
 zbV2DjxEZ+8kWeO9cDJPedEd8pXC8Ua7H+KNl00npdfNlkBaVR9ZRjX7ZVoiFMi8
 6SRmCr1MLngldSMkUr6cYiLpoXmRzM+7gnKhVzhO6yNa0eihYBAIZ5lei0n9Q01W
 Soxvec2KKeSZraNLoQH0MSndEJY4sqx6lPjlXgFT6gGHzgfQZTg+9INdaPK9gbI7
 tg5j1e0/1UyvWrxYxOdzThtRY1X7Y1QtdpQDcatkVOgR1uZi1CTDx1dxTrHP5jbZ
 7MSKn/8/T61beG6ujjif+pC8kOwNISLNDBBZGNzeLRyx8t9/6jQ=
 =Z2Nu
 -----END PGP SIGNATURE-----

Merge tag 'x86_microcode_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 microcode and IFS updates from Borislav Petkov:
 "The IFS (In-Field Scan) stuff goes through tip because the IFS driver
  uses the same structures and similar functionality as the microcode
  loader and it made sense to route it all through this branch so that
  there are no conflicts.

   - Add support for multiple testing sequences to the Intel In-Field
     Scan driver in order to be able to run multiple different test
     patterns. Rework things and remove the BROKEN dependency so that
     the driver can be enabled (Jithu Joseph)

   - Remove the subsys interface usage in the microcode loader because
     it is not really needed

   - A couple of smaller fixes and cleanups"

* tag 'x86_microcode_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  x86/microcode/intel: Do not retry microcode reloading on the APs
  x86/microcode/intel: Do not print microcode revision and processor flags
  platform/x86/intel/ifs: Add missing kernel-doc entry
  Revert "platform/x86/intel/ifs: Mark as BROKEN"
  Documentation/ABI: Update IFS ABI doc
  platform/x86/intel/ifs: Add current_batch sysfs entry
  platform/x86/intel/ifs: Remove reload sysfs entry
  platform/x86/intel/ifs: Add metadata validation
  platform/x86/intel/ifs: Use generic microcode headers and functions
  platform/x86/intel/ifs: Add metadata support
  x86/microcode/intel: Use a reserved field for metasize
  x86/microcode/intel: Add hdr_type to intel_microcode_sanity_check()
  x86/microcode/intel: Reuse microcode_sanity_check()
  x86/microcode/intel: Use appropriate type in microcode_sanity_check()
  x86/microcode/intel: Reuse find_matching_signature()
  platform/x86/intel/ifs: Remove memory allocation from load path
  platform/x86/intel/ifs: Remove image loading during init
  platform/x86/intel/ifs: Return a more appropriate error code
  platform/x86/intel/ifs: Remove unused selection
  x86/microcode: Drop struct ucode_cpu_info.valid
  ...
2022-12-13 15:05:29 -08:00
Linus Torvalds 3ef3ace4e2 - Split MTRR and PAT init code to accomodate at least Xen PV and TDX
guests which do not get MTRRs exposed but only PAT. (TDX guests do not
 support the cache disabling dance when setting up MTRRs so they fall
 under the same category.) This is a cleanup work to remove all the ugly
 workarounds for such guests and init things separately (Juergen Gross)
 
 - Add two new Intel CPUs to the list of CPUs with "normal" Energy
 Performance Bias, leading to power savings
 
 - Do not do bus master arbitration in C3 (ARB_DISABLE) on modern Centaur
 CPUs
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOYhIMACgkQEsHwGGHe
 VUpxug//ZKw3hYFroKhsULJi/e0j2nGARiSlJrJcFHl2vgh9yGvDsnYUyM/rgjgt
 cM3uCLbEG7nA6uhB3nupzaXZ8lBM1nU9kiEl/kjQ5oYf9nmJ48fLttvWGfxYN4s3
 kj5fYVhlOZpntQXIWrwxnPqghUysumMnZmBJeKYiYNNfkj62l3xU2Ni4Gnjnp02I
 9MmUhl7pj1aEyOQfM8rovy+wtYCg5WTOmXVlyVN+b9MwfYeK+stojvCZHxtJs9BD
 fezpJjjG+78xKUC7vVZXCh1p1N5Qvj014XJkVl9Hg0n7qizKFZRtqi8I769G2ptd
 exP8c2nDXKCqYzE8vK6ukWgDANQPs3d6Z7EqUKuXOCBF81PnMPSUMyNtQFGNM6Wp
 S5YSvFfCgUjp50IunOpvkDABgpM+PB8qeWUq72UFQJSOymzRJg/KXtE2X+qaMwtC
 0i6VLXfMddGcmqNKDppfGtCjq2W5VrNIIJedtAQQGyl+pl3XzZeNomhJpm/0mVfJ
 8UrlXZeXl/EUQ7qk40gC/Ash27pU9ZDx4CMNMy1jDIQqgufBjEoRIDSFqQlghmZq
 An5/BqMLhOMxUYNA7bRUnyeyxCBypetMdQt5ikBmVXebvBDmArXcuSNAdiy1uBFX
 KD8P3Y1AnsHIklxkLNyZRUy7fb4mgMFenUbgc0vmbYHbFl0C0pQ=
 =Zmgh
 -----END PGP SIGNATURE-----

Merge tag 'x86_cpu_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cpu updates from Borislav Petkov:

 - Split MTRR and PAT init code to accomodate at least Xen PV and TDX
   guests which do not get MTRRs exposed but only PAT. (TDX guests do
   not support the cache disabling dance when setting up MTRRs so they
   fall under the same category)

   This is a cleanup work to remove all the ugly workarounds for such
   guests and init things separately (Juergen Gross)

 - Add two new Intel CPUs to the list of CPUs with "normal" Energy
   Performance Bias, leading to power savings

 - Do not do bus master arbitration in C3 (ARB_DISABLE) on modern
   Centaur CPUs

* tag 'x86_cpu_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
  x86/mtrr: Make message for disabled MTRRs more descriptive
  x86/pat: Handle TDX guest PAT initialization
  x86/cpuid: Carve out all CPUID functionality
  x86/cpu: Switch to cpu_feature_enabled() for X86_FEATURE_XENPV
  x86/cpu: Remove X86_FEATURE_XENPV usage in setup_cpu_entry_area()
  x86/cpu: Drop 32-bit Xen PV guest code in update_task_stack()
  x86/cpu: Remove unneeded 64-bit dependency in arch_enter_from_user_mode()
  x86/cpufeatures: Add X86_FEATURE_XENPV to disabled-features.h
  x86/acpi/cstate: Optimize ARB_DISABLE on Centaur CPUs
  x86/mtrr: Simplify mtrr_ops initialization
  x86/cacheinfo: Switch cache_ap_init() to hotplug callback
  x86: Decouple PAT and MTRR handling
  x86/mtrr: Add a stop_machine() handler calling only cache_cpu_init()
  x86/mtrr: Let cache_aps_delayed_init replace mtrr_aps_delayed_init
  x86/mtrr: Get rid of __mtrr_enabled bool
  x86/mtrr: Simplify mtrr_bp_init()
  x86/mtrr: Remove set_all callback from struct mtrr_ops
  x86/mtrr: Disentangle MTRR init from PAT init
  x86/mtrr: Move cache control code to cacheinfo.c
  x86/mtrr: Split MTRR-specific handling from cache dis/enabling
  ...
2022-12-13 14:56:56 -08:00
Linus Torvalds 4eb77fa102 - Do some spring cleaning to the compressed boot code by moving the
EFI mixed-mode code to a separate compilation unit, the AMD memory
 encryption early code where it belongs and fixing up build dependencies.
 Make the deprecated EFI handover protocol optional with the goal of
 removing it at some point (Ard Biesheuvel)
 
 - Skip realmode init code on Xen PV guests as it is not needed there
 
 - Remove an old 32-bit PIC code compiler workaround
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOYaiMACgkQEsHwGGHe
 VUrNVhAAk3lLagEsrBcQ24SnMMAyQvdKfRucn9fbs72jBCyWbDqXcE59qNgdbMS1
 3rIL+EJdF8jlm5K28GjRS1WSvwUyYbyFEfUcYfqZl9L/5PAl7PlG7nNQw7/gXnw+
 xS57w/Q3cONlo5LC0K2Zkbj/59RvDoBEs3nkhozkKR0npTDW/LK3Vl0zgKTkvqsV
 DzRIHhWsqSEvpdowbQmQCyqFh/pOoQlZkQwjYVA9+SaQYdH3Yo1dpLd5i9I9eVmJ
 dci/HDU+plwYYuZ1XhxwXr82PcdCUVYjJ/DTt9GkTVYq7u5EWx62puxTl+c+wbG2
 H1WBXuZHBGdzNMFdnb1k9RuLCaYdaxKTNlZh3FPMMDtkjtjKTl/olXTlFUYFgI6E
 FPv4hi15g6pMveS3K6YUAd0uGvpsjvLUZHPqMDVS2trhxLENQALc6Id/PwqzrQ1T
 FzfPYcDyFFwMM3MDuWc8ClwEDD9wr0Z4m4Aek/ca2r85AKEX8ZtTTlWZoI4E9A4B
 hEjUFnRhT/d6XLWwZqcOIKfwtbpKAjdsCN3ElFst8ogRFAXqW8luDoI4BRCkBC4p
 T4RHdij4afkuFjSAxBacazpaavtcCsDqXwBpeL4YN+4fA7+NokVZGiQVh/3S8BPn
 LlgIf6awFq6yQq7JyEGPdk+dWn5sknldixZ55m666ZLzSvQhvE8=
 =VGZx
 -----END PGP SIGNATURE-----

Merge tag 'x86_boot_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 boot updates from Borislav Petkov:
 "A  of early boot cleanups and fixes.

   - Do some spring cleaning to the compressed boot code by moving the
     EFI mixed-mode code to a separate compilation unit, the AMD memory
     encryption early code where it belongs and fixing up build
     dependencies. Make the deprecated EFI handover protocol optional
     with the goal of removing it at some point (Ard Biesheuvel)

   - Skip realmode init code on Xen PV guests as it is not needed there

   - Remove an old 32-bit PIC code compiler workaround"

* tag 'x86_boot_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot: Remove x86_32 PIC using %ebx workaround
  x86/boot: Skip realmode init code when running as Xen PV guest
  x86/efi: Make the deprecated EFI handover protocol optional
  x86/boot/compressed: Only build mem_encrypt.S if AMD_MEM_ENCRYPT=y
  x86/boot/compressed: Adhere to calling convention in get_sev_encryption_bit()
  x86/boot/compressed: Move startup32_check_sev_cbit() out of head_64.S
  x86/boot/compressed: Move startup32_check_sev_cbit() into .text
  x86/boot/compressed: Move startup32_load_idt() out of head_64.S
  x86/boot/compressed: Move startup32_load_idt() into .text section
  x86/boot/compressed: Pull global variable reference into startup32_load_idt()
  x86/boot/compressed: Avoid touching ECX in startup32_set_idt_entry()
  x86/boot/compressed: Simplify IDT/GDT preserve/restore in the EFI thunk
  x86/boot/compressed, efi: Merge multiple definitions of image_offset into one
  x86/boot/compressed: Move efi32_pe_entry() out of head_64.S
  x86/boot/compressed: Move efi32_entry out of head_64.S
  x86/boot/compressed: Move efi32_pe_entry into .text section
  x86/boot/compressed: Move bootargs parsing out of 32-bit startup code
  x86/boot/compressed: Move 32-bit entrypoint code into .text section
  x86/boot/compressed: Rename efi_thunk_64.S to efi-mixed.S
2022-12-13 14:45:29 -08:00
Linus Torvalds fc4c9f4504 EFI updates for v6.2:
- Refactor the zboot code so that it incorporates all the EFI stub
   logic, rather than calling the decompressed kernel as a EFI app.
 - Add support for initrd= command line option to x86 mixed mode.
 - Allow initrd= to be used with arbitrary EFI accessible file systems
   instead of just the one the kernel itself was loaded from.
 - Move some x86-only handling and manipulation of the EFI memory map
   into arch/x86, as it is not used anywhere else.
 - More flexible handling of any random seeds provided by the boot
   environment (i.e., systemd-boot) so that it becomes available much
   earlier during the boot.
 - Allow improved arch-agnostic EFI support in loaders, by setting a
   uniform baseline of supported features, and adding a generic magic
   number to the DOS/PE header. This should allow loaders such as GRUB or
   systemd-boot to reduce the amount of arch-specific handling
   substantially.
 - (arm64) Run EFI runtime services from a dedicated stack, and use it to
   recover from synchronous exceptions that might occur in the firmware
   code.
 - (arm64) Ensure that we don't allocate memory outside of the 48-bit
   addressable physical range.
 - Make EFI pstore record size configurable
 - Add support for decoding CXL specific CPER records
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE+9lifEBpyUIVN1cpw08iOZLZjyQFAmOTQ1cACgkQw08iOZLZ
 jyQRkAv+LqaZFWeVwhAQHiw/N3RnRM0nZHea6++D2p1y/ZbCpwv3pdLl2YHQ1KmW
 wDG9Nr4C1ITLtfy1YZKeYpwloQtq9S1GZDWnFpVv/hdo7L924eRAwIlxowWn1OnP
 ruxv2PaYXyb0plh1YD1f6E1BqrfUOtajET55Kxs9ZsxmnMtDpIX3NiYy4LKMBIZC
 +Eywt41M3uBX+wgmSujFBMVVJjhOX60WhUYXqy0RXwDKOyrz/oW5td+eotSCreB6
 FVbjvwQvUdtzn4s1FayOMlTrkxxLw4vLhsaUGAdDOHd3rg3sZT9Xh1HqFFD6nss6
 ZAzAYQ6BzdiV/5WSB9meJe+BeG1hjTNKjJI6JPO2lctzYJqlnJJzI6JzBuH9vzQ0
 dffLB8NITeEW2rphIh+q+PAKFFNbXWkJtV4BMRpqmzZ/w7HwupZbUXAzbWE8/5km
 qlFpr0kmq8GlVcbXNOFjmnQVrJ8jPYn+O3AwmEiVAXKZJOsMH0sjlXHKsonme9oV
 Sk71c6Em
 =JEXz
 -----END PGP SIGNATURE-----

Merge tag 'efi-next-for-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

Pull EFI updates from Ard Biesheuvel:
 "Another fairly sizable pull request, by EFI subsystem standards.

  Most of the work was done by me, some of it in collaboration with the
  distro and bootloader folks (GRUB, systemd-boot), where the main focus
  has been on removing pointless per-arch differences in the way EFI
  boots a Linux kernel.

   - Refactor the zboot code so that it incorporates all the EFI stub
     logic, rather than calling the decompressed kernel as a EFI app.

   - Add support for initrd= command line option to x86 mixed mode.

   - Allow initrd= to be used with arbitrary EFI accessible file systems
     instead of just the one the kernel itself was loaded from.

   - Move some x86-only handling and manipulation of the EFI memory map
     into arch/x86, as it is not used anywhere else.

   - More flexible handling of any random seeds provided by the boot
     environment (i.e., systemd-boot) so that it becomes available much
     earlier during the boot.

   - Allow improved arch-agnostic EFI support in loaders, by setting a
     uniform baseline of supported features, and adding a generic magic
     number to the DOS/PE header. This should allow loaders such as GRUB
     or systemd-boot to reduce the amount of arch-specific handling
     substantially.

   - (arm64) Run EFI runtime services from a dedicated stack, and use it
     to recover from synchronous exceptions that might occur in the
     firmware code.

   - (arm64) Ensure that we don't allocate memory outside of the 48-bit
     addressable physical range.

   - Make EFI pstore record size configurable

   - Add support for decoding CXL specific CPER records"

* tag 'efi-next-for-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: (43 commits)
  arm64: efi: Recover from synchronous exceptions occurring in firmware
  arm64: efi: Execute runtime services from a dedicated stack
  arm64: efi: Limit allocations to 48-bit addressable physical region
  efi: Put Linux specific magic number in the DOS header
  efi: libstub: Always enable initrd command line loader and bump version
  efi: stub: use random seed from EFI variable
  efi: vars: prohibit reading random seed variables
  efi: random: combine bootloader provided RNG seed with RNG protocol output
  efi/cper, cxl: Decode CXL Error Log
  efi/cper, cxl: Decode CXL Protocol Error Section
  efi: libstub: fix efi_load_initrd_dev_path() kernel-doc comment
  efi: x86: Move EFI runtime map sysfs code to arch/x86
  efi: runtime-maps: Clarify purpose and enable by default for kexec
  efi: pstore: Add module parameter for setting the record size
  efi: xen: Set EFI_PARAVIRT for Xen dom0 boot on all architectures
  efi: memmap: Move manipulation routines into x86 arch tree
  efi: memmap: Move EFI fake memmap support into x86 arch tree
  efi: libstub: Undeprecate the command line initrd loader
  efi: libstub: Add mixed mode support to command line initrd loader
  efi: libstub: Permit mixed mode return types other than efi_status_t
  ...
2022-12-13 14:31:47 -08:00
Linus Torvalds 405b2fc663 Unification of regset and non-regset sides of ELF coredump
handling.  Collecting per-thread register values is the
 only thing that needs to be ifdefed there...
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCY5ZyNgAKCRBZ7Krx/gZQ
 63MZAQDZDE9Pk9EQ/3qOPNb2cuz8KSB3THUyotvustUUGPTUVAD/Ut1xD03jpWCY
 oQ6tM8dNyh3+Vsx6/XKNd1+pj6IgNQE=
 =XYkZ
 -----END PGP SIGNATURE-----

Merge tag 'pull-elfcore' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull elf coredumping updates from Al Viro:
 "Unification of regset and non-regset sides of ELF coredump handling.

  Collecting per-thread register values is the only thing that needs to
  be ifdefed there..."

* tag 'pull-elfcore' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  [elf] get rid of get_note_info_size()
  [elf] unify regset and non-regset cases
  [elf][non-regset] use elf_core_copy_task_regs() for dumper as well
  [elf][non-regset] uninline elf_core_copy_task_fpregs() (and lose pt_regs argument)
  elf_core_copy_task_regs(): task_pt_regs is defined everywhere
  [elf][regset] simplify thread list handling in fill_note_info()
  [elf][regset] clean fill_note_info() a bit
  kill extern of vsyscall32_sysctl
  kill coredump_params->regs
  kill signal_pt_regs()
2022-12-12 18:18:34 -08:00
Linus Torvalds 268325bda5 Random number generator updates for Linux 6.2-rc1.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmOU+U8ACgkQSfxwEqXe
 A67NnQ//Y5DltmvibyPd7r1TFT2gUYv+Rx3sUV9ZE1NYptd/SWhhcL8c5FZ70Fuw
 bSKCa1uiWjOxosjXT1kGrWq3de7q7oUpAPSOGxgxzoaNURIt58N/ajItCX/4Au8I
 RlGAScHy5e5t41/26a498kB6qJ441fBEqCYKQpPLINMBAhe8TQ+NVp0rlpUwNHFX
 WrUGg4oKWxdBIW3HkDirQjJWDkkAiklRTifQh/Al4b6QDbOnRUGGCeckNOhixsvS
 waHWTld+Td8jRrA4b82tUb2uVZ2/b8dEvj/A8CuTv4yC0lywoyMgBWmJAGOC+UmT
 ZVNdGW02Jc2T+Iap8ZdsEmeLHNqbli4+IcbY5xNlov+tHJ2oz41H9TZoYKbudlr6
 /ReAUPSn7i50PhbQlEruj3eg+M2gjOeh8OF8UKwwRK8PghvyWQ1ScW0l3kUhPIhI
 PdIG6j4+D2mJc1FIj2rTVB+Bg933x6S+qx4zDxGlNp62AARUFYf6EgyD6aXFQVuX
 RxcKb6cjRuFkzFiKc8zkqg5edZH+IJcPNuIBmABqTGBOxbZWURXzIQvK/iULqZa4
 CdGAFIs6FuOh8pFHLI3R4YoHBopbHup/xKDEeAO9KZGyeVIuOSERDxxo5f/ITzcq
 APvT77DFOEuyvanr8RMqqh0yUjzcddXqw9+ieufsAyDwjD9DTuE=
 =QRhK
 -----END PGP SIGNATURE-----

Merge tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random

Pull random number generator updates from Jason Donenfeld:

 - Replace prandom_u32_max() and various open-coded variants of it,
   there is now a new family of functions that uses fast rejection
   sampling to choose properly uniformly random numbers within an
   interval:

       get_random_u32_below(ceil) - [0, ceil)
       get_random_u32_above(floor) - (floor, U32_MAX]
       get_random_u32_inclusive(floor, ceil) - [floor, ceil]

   Coccinelle was used to convert all current users of
   prandom_u32_max(), as well as many open-coded patterns, resulting in
   improvements throughout the tree.

   I'll have a "late" 6.1-rc1 pull for you that removes the now unused
   prandom_u32_max() function, just in case any other trees add a new
   use case of it that needs to converted. According to linux-next,
   there may be two trivial cases of prandom_u32_max() reintroductions
   that are fixable with a 's/.../.../'. So I'll have for you a final
   conversion patch doing that alongside the removal patch during the
   second week.

   This is a treewide change that touches many files throughout.

 - More consistent use of get_random_canary().

 - Updates to comments, documentation, tests, headers, and
   simplification in configuration.

 - The arch_get_random*_early() abstraction was only used by arm64 and
   wasn't entirely useful, so this has been replaced by code that works
   in all relevant contexts.

 - The kernel will use and manage random seeds in non-volatile EFI
   variables, refreshing a variable with a fresh seed when the RNG is
   initialized. The RNG GUID namespace is then hidden from efivarfs to
   prevent accidental leakage.

   These changes are split into random.c infrastructure code used in the
   EFI subsystem, in this pull request, and related support inside of
   EFISTUB, in Ard's EFI tree. These are co-dependent for full
   functionality, but the order of merging doesn't matter.

 - Part of the infrastructure added for the EFI support is also used for
   an improvement to the way vsprintf initializes its siphash key,
   replacing an sleep loop wart.

 - The hardware RNG framework now always calls its correct random.c
   input function, add_hwgenerator_randomness(), rather than sometimes
   going through helpers better suited for other cases.

 - The add_latent_entropy() function has long been called from the fork
   handler, but is a no-op when the latent entropy gcc plugin isn't
   used, which is fine for the purposes of latent entropy.

   But it was missing out on the cycle counter that was also being mixed
   in beside the latent entropy variable. So now, if the latent entropy
   gcc plugin isn't enabled, add_latent_entropy() will expand to a call
   to add_device_randomness(NULL, 0), which adds a cycle counter,
   without the absent latent entropy variable.

 - The RNG is now reseeded from a delayed worker, rather than on demand
   when used. Always running from a worker allows it to make use of the
   CPU RNG on platforms like S390x, whose instructions are too slow to
   do so from interrupts. It also has the effect of adding in new inputs
   more frequently with more regularity, amounting to a long term
   transcript of random values. Plus, it helps a bit with the upcoming
   vDSO implementation (which isn't yet ready for 6.2).

 - The jitter entropy algorithm now tries to execute on many different
   CPUs, round-robining, in hopes of hitting even more memory latencies
   and other unpredictable effects. It also will mix in a cycle counter
   when the entropy timer fires, in addition to being mixed in from the
   main loop, to account more explicitly for fluctuations in that timer
   firing. And the state it touches is now kept within the same cache
   line, so that it's assured that the different execution contexts will
   cause latencies.

* tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (23 commits)
  random: include <linux/once.h> in the right header
  random: align entropy_timer_state to cache line
  random: mix in cycle counter when jitter timer fires
  random: spread out jitter callback to different CPUs
  random: remove extraneous period and add a missing one in comments
  efi: random: refresh non-volatile random seed when RNG is initialized
  vsprintf: initialize siphash key using notifier
  random: add back async readiness notifier
  random: reseed in delayed work rather than on-demand
  random: always mix cycle counter in add_latent_entropy()
  hw_random: use add_hwgenerator_randomness() for early entropy
  random: modernize documentation comment on get_random_bytes()
  random: adjust comment to account for removed function
  random: remove early archrandom abstraction
  random: use random.trust_{bootloader,cpu} command line option only
  stackprotector: actually use get_random_canary()
  stackprotector: move get_random_canary() into stackprotector.h
  treewide: use get_random_u32_inclusive() when possible
  treewide: use get_random_u32_{above,below}() instead of manual loop
  treewide: use get_random_u32_below() instead of deprecated function
  ...
2022-12-12 16:22:22 -08:00
Linus Torvalds 287f037db5 Minor cleanups:
* Remove unnecessary arch_has_empty_bitmaps structure memory
  * Move rescrtl MSR defines into msr-index.h, like normal MSRs
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmOXYhsACgkQaDWVMHDJ
 krA75w//XmOC929XGMOY7WQL6IZlH62xsJbtb3BhmM24Ho7RHSNQGPD+ukArCb0u
 V/w50Q4crQrLsIxqWjXkyDQ7w66PvvsAIhYFBEV4kssRli9y173CzJQt/lQfUXL9
 T7vG5WY1n4f+vtvmZfwcFaGOPkZ5edp8v1y8Grk3r93ci2VDSk+yvEiq80c+JQoX
 ZnEYPxGPUpwAVuaysY8wkGCEc4Yln6gtTKzpVPXE18WAs82OeiCWBfldI/+95j3o
 /5r5asYQpD8bVhtLHi1mepkBAGbeVNWhSJVlOE9HdU9WnzCkNKn1ZXRuXSBlvTeq
 FPjg6vsBXuz8zQV4Dd3Jk3hWv3H/4sTWsgiyUFdHtz/VlE9M8NjGcE4caOgSuBqR
 2ovI/HwdvdYyiZwvNN0fXrnzEn1MliSXDgAscNuxzovJXqdTP2BpUj0SVlZdVs0U
 0xba5sZ5A6fh2SwKX7JQYYsEh4gudiixR+D2l5u7EUOiNyfw0DZgWi/ElpvX4ncy
 QvDIIqlm29A/VkJQAdSHJc0ew+w39M7f3VNfQviLXxudGFuhrg+kXlI1UYGcX/cH
 4LEjmE1KCymmq7v+7+zBrHwsVCxr5mi/CZnx+/4Y/2O+xOKJ1U7GQDXWzu/SC+aF
 tEwqDCldYKjqrfdkmuGXSt2YipkNOC2EBLY32mW7rtTDDIXPSro=
 =n2UE
 -----END PGP SIGNATURE-----

Merge tag 'x86_cache_for_6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cache resource control updates from Dave Hansen:
 "These declare the resource control (rectrl) MSRs a bit more normally
  and clean up an unnecessary structure member:

   - Remove unnecessary arch_has_empty_bitmaps structure memory

   - Move rescrtl MSR defines into msr-index.h, like normal MSRs"

* tag 'x86_cache_for_6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/resctrl: Move MSR defines into msr-index.h
  x86/resctrl: Remove arch_has_empty_bitmaps
2022-12-12 14:30:54 -08:00
Linus Torvalds a89ef2aa55 Add TDX guest attestation infrastructure and driver
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmOXYmwACgkQaDWVMHDJ
 krD8hg/+J0hUTfljmlCctwGZyqVR3Y2E722wL9oTvbgYiUAtFrARzfPF0WNwvHi5
 Ywvod5hQ4unPoluthdVAD/uJqcPVhjIZ7CvNTGrS8J7ED5x5ydGLNWAL3Rn+9s6O
 xkz/DsV4zl+cPQ60XLsO+3Mc6RhwVs9DUthpUovl22epmgmRPCovkHWkvQsZajJq
 ceF/78ThfrkG4dDouaIXi1gsmKLLzU4KdHeBATMg0bgPQXFJZSGBCLaeJXWmLapq
 7N3SznUqDMn4Plr/IuP4XuMA6VTVojrakCcBmw5SGVqhkVWGM1/FMg7jHSQS7Z5V
 5uG7CkhTBqh17v9xKwDMPh34D51TLtNifA7jbecyL5155czFkj7BoSwEFINU/wCz
 agUO9NvK9j1chUnA2UGqGQigM3nWGZHMwaQjfgBWyq5gqF8HURUUrjx6XuunOfmB
 1byyrDu0g48u/zaQ/RpNfewz1ZY+WylDPcqOhYaVWF1PYThStML/VMBKpdsl1Ovw
 nytUdQsaBIjFHQdB+snizaF93+/0FG+FTGAlDnHYmey/8plL2LYuzrcDnDYnGEXa
 tN3HFd2lAi4JBLmvmgF39gH+BLXuKTLweIhwTXZTn91cfire3yxiXAnLd0tuptMP
 aXFddxKMdMpxTqzy2X+8gJjqCr2lZ9gZkxaPsWwrBM+xrJf0p2w=
 =JGnq
 -----END PGP SIGNATURE-----

Merge tag 'x86_tdx_for_6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 tdx updates from Dave Hansen:
 "This includes a single chunk of new functionality for TDX guests which
  allows them to talk to the trusted TDX module software and obtain an
  attestation report.

  This report can then be used to prove the trustworthiness of the guest
  to a third party and get access to things like storage encryption
  keys"

* tag 'x86_tdx_for_6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  selftests/tdx: Test TDX attestation GetReport support
  virt: Add TDX guest driver
  x86/tdx: Add a wrapper to get TDREPORT0 from the TDX Module
2022-12-12 14:27:49 -08:00
Linus Torvalds 2da68a77b9 * Introduce a new SGX feature (Asynchrounous Exit Notification)
for bare-metal enclaves and KVM guests to mitigate single-step
    attacks
  * Increase batching to speed up enclave release
  * Replace kmap/kunmap_atomic() calls
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmOXYkEACgkQaDWVMHDJ
 krB5Og//Vn0oy0pGhda+LtHJgpa9/qPlzvoZCBxi/6SfLneadE5/g/q2KHbiCgVf
 sQ6SEZ0MiVc2SrQcA6CntMO+stJIHG4LqYutygfKDoxXHGzxotzvzTmRV7Qxfhj5
 LrPfl4cLWVO/jGDs0XQpOVFykKgdMcg1OjlnQYfriFiIiBkcClC7F0zYrOWAQWW0
 z+4h3mlWzyAcBdxrZ9qPVqBMbM3qVKQWeE4D9K2Edfgx1lhQBmvtRdYXTplk08tV
 DrfEkG5L189lrwlmbkKT5+pXSTmJqJzBoYyAGOH8n4Wb9aKLdagJErVg0ocXx8uV
 ngPFU5vmaZza7EZcQheu8iRfM+zQCrcVjBImrRLyQPgCeMBX7o75axYvu4/bvPkP
 3+1/JUL6/m738Fqom4wUKdeoJFw/HLGRyQ36yhZAEzH7wPv7/9Q1zpdxcypE6a+Q
 B7UGQNVXV9g5Ivhe44gZIKx/3VL7AthtyCQvhwGQzzm4jX2SwnQKNXy0iKlJr2iI
 LyREdYlJsRR1/wMdjnj2QqtnWPRZ5/rzl7bvWqiXa4xyvcgArrBowjMdZBttaItJ
 cVK5Aj2bvR3Yc/e9GtPoLvwU5IwtoXgUe1B4DsJtoFoUq7gUGZZcEd5uAYRAk7PX
 lyP2LQNxX5i150cxjlSYLLLTNmwvZQ+5PFq+V5+McKbAge8OD8g=
 =bIXL
 -----END PGP SIGNATURE-----

Merge tag 'x86_sgx_for_6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 sgx updates from Dave Hansen:
 "The biggest deal in this series is support for a new hardware feature
  that allows enclaves to detect and mitigate single-stepping attacks.

  There's also a minor performance tweak and a little piece of the
  kmap_atomic() -> kmap_local() transition.

  Summary:

   - Introduce a new SGX feature (Asynchrounous Exit Notification) for
     bare-metal enclaves and KVM guests to mitigate single-step attacks

   - Increase batching to speed up enclave release

   - Replace kmap/kunmap_atomic() calls"

* tag 'x86_sgx_for_6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sgx: Replace kmap/kunmap_atomic() calls
  KVM/VMX: Allow exposing EDECCSSA user leaf function to KVM guest
  x86/sgx: Allow enclaves to use Asynchrounous Exit Notification
  x86/sgx: Reduce delay and interference of enclave release
2022-12-12 14:18:44 -08:00
Linus Torvalds 631aa74442 Updates for miscellaneous x86 areas:
- Reserve a new boot loader type for barebox which is usally used on ARM
     and MIPS, but can also be utilized as EFI payload on x86 to provide
     watchdog-supervised boot up.
 
   - Consolidate the native and compat 32bit signal handling code and split
     the 64bit version out into a separate source file
 
   - Switch the ESPFIX random usage to get_random_long().
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmOUvMQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoQmmD/9xVeaZbBInehnzbsZi4C4WyOMGUg4l
 AoZC0QSzp2hFZRwpbu4Df1Zh2VN5nItAhQUvNLfdZv9/GL5VkhO+J5fPEHUbtnQ8
 34TujaTHAssyib8uRFTAxxGSz3S2jPRrzUloZ71M+Whx7Fw7Fh8M/t8DmnvnaPtw
 uYbBmZd9mZ0Y7BVMoXh70V0nd21PN8a8qQhYRaUD7lyb1w6Tcfzag4J1DXFfP8Lm
 ovaf2AW3mgt+RmzIRNqP28weLt/VxFC38H/nZ9Jlc9npfnLTyGfwfOxE0CILfEo+
 cYYVbMaIN+vs5kJQaVbvEJvk7oumLC9CvwE6oIL8J0XOs8dbBHkbZPQYW0yVF1/m
 rXEd3LBSNhnZIF0aMUoJrBZAI++nGZo0izSu3eGwLZXSbWBVjlzPAqeBJQtqfQ/E
 j87IisQjkWeOOSNvBas1bURWa7Gy5QFRCxbJQFfAZjIHhg+fIwxrK0HlSqxUXqK5
 PRbc1LsWjUn9TspOC+mRIKrqAfetkohL7BGc+uuslH3uXiMQVAghg37+rSqvAjkn
 50d8XxqOd7aC0NOVn8BfxhMf85Ge7z/0r7JJcaLcRY7/CP6S3vTCAgbSjN4+WzfN
 sRu5W/m8oLuF8Q9DdgqtqiNrYezhoEKJHZsGoi/IGy6eAYjMxPX/Cl4YysdqV32N
 Z55ZeEBwg9KC1g==
 =AHdL
 -----END PGP SIGNATURE-----

Merge tag 'x86-misc-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 updates from Thomas Gleixner:
 "Updates for miscellaneous x86 areas:

   - Reserve a new boot loader type for barebox which is usally used on
     ARM and MIPS, but can also be utilized as EFI payload on x86 to
     provide watchdog-supervised boot up.

   - Consolidate the native and compat 32bit signal handling code and
     split the 64bit version out into a separate source file

   - Switch the ESPFIX random usage to get_random_long()"

* tag 'x86-misc-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/espfix: Use get_random_long() rather than archrandom
  x86/signal/64: Move 64-bit signal code to its own file
  x86/signal/32: Merge native and compat 32-bit signal code
  x86/signal: Add ABI prefixes to frame setup functions
  x86/signal: Merge get_sigframe()
  x86: Remove __USER32_DS
  signal/compat: Remove compat_sigset_t override
  x86/signal: Remove sigset_t parameter from frame setup functions
  x86/signal: Remove sig parameter from frame setup functions
  Documentation/x86/boot: Reserve type_of_loader=13 for barebox
2022-12-12 13:01:14 -08:00
Paolo Bonzini 9352e7470a Merge remote-tracking branch 'kvm/queue' into HEAD
x86 Xen-for-KVM:

* Allow the Xen runstate information to cross a page boundary

* Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured

* add support for 32-bit guests in SCHEDOP_poll

x86 fixes:

* One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).

* Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few
   years back when eliminating unnecessary barriers when switching between
   vmcs01 and vmcs02.

* Clean up the MSR filter docs.

* Clean up vmread_error_trampoline() to make it more obvious that params
  must be passed on the stack, even for x86-64.

* Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective
  of the current guest CPUID.

* Fudge around a race with TSC refinement that results in KVM incorrectly
  thinking a guest needs TSC scaling when running on a CPU with a
  constant TSC, but no hardware-enumerated TSC frequency.

* Advertise (on AMD) that the SMM_CTL MSR is not supported

* Remove unnecessary exports

Selftests:

* Fix an inverted check in the access tracking perf test, and restore
  support for asserting that there aren't too many idle pages when
  running on bare metal.

* Fix an ordering issue in the AMX test introduced by recent conversions
  to use kvm_cpu_has(), and harden the code to guard against similar bugs
  in the future.  Anything that tiggers caching of KVM's supported CPUID,
  kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if
  the caching occurs before the test opts in via prctl().

* Fix build errors that occur in certain setups (unsure exactly what is
  unique about the problematic setup) due to glibc overriding
  static_assert() to a variant that requires a custom message.

* Introduce actual atomics for clear/set_bit() in selftests

Documentation:

* Remove deleted ioctls from documentation

* Various fixes
2022-12-12 15:54:07 -05:00
Linus Torvalds 369013162f A set of changes for the x86 APIC code:
- Handle the case where x2APIC is enabled and locked by the BIOS on a
     kernel with CONFIG_X86_X2APIC=n gracefully. Instead of a panic which
     does not make it to the graphical console during very early boot,
     simply disable the local APIC completely and boot with the PIC and very
     limited functionality, which allows to diagnose the issue.
 
   - Convert x86 APIC device tree bindings to YAML
 
   - Extend x86 APIC device tree bindings to configure interrupt delivery
     mode and handle this in during init. This allows to boot with device
     tree on platforms which lack a legacy PIC.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmOUuYUTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaTED/9D33bnJesbDVZs31HxLJc/jZED0/Do
 dli0wRHWmQx9jpUmTXlKRhhIcUOjPy3Cdz44yoOH14wdJ96qUCBUj8sS9vFO4F7M
 CS/eoO77GKG6oXpMvsNC5TcSaZnXAb4UYz5wCV21ZXL6P0izhOivKSqTR222jT6e
 afEzQhwWhHZmrkX44F1YvMuc+HP6+swfO635vNtZhKtlA7NeKdHRijGZhrXEhNO/
 Pue2xbYVMSLNaRTRtN0Mjm6UvShBLQhbmD/vXrVOCztfzhSfwq0LRC9xXcXmdWCY
 XjflM+osQxIUs2WbpL1lohq5VUzTlWVNsZe4YkH5b0xMEO9HkD7apF03p03SIO4n
 X37joMbrfPz9ZsmSdaN836YZd74IfQ5wnFFQTVL0BC0M4lZNeAnNcxVr3Mfio4yX
 GvYahmyvxHlbWag4SYqVsy15QiNV/xZZZD6uIvBvMCfxoFKw8tBF+9/2Iy+3R+zj
 n7q17Y9bLSXwh1Z/9xgwdTs+7SNCpIlZ/5nz8NpBhHaZF2BziICCv2TEKZUXmli3
 HHkWM7ikj67zgFMiWLLOZpiYz/vgJEFE9nhlmXEH1RNMIfqom/JG8FN8GE1C9kYV
 dmSjOE7x/CdZfJ83BRlTx5j2HfAs7RW4A7IMWPIxNdqEFmhxWnQIHasAfMrHcoIU
 pAQ8u/qoduJA4A==
 =dpZx
 -----END PGP SIGNATURE-----

Merge tag 'x86-apic-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 apic update from Thomas Gleixner:
 "A set of changes for the x86 APIC code:

   - Handle the case where x2APIC is enabled and locked by the BIOS on a
     kernel with CONFIG_X86_X2APIC=n gracefully.

     Instead of a panic which does not make it to the graphical console
     during very early boot, simply disable the local APIC completely
     and boot with the PIC and very limited functionality, which allows
     to diagnose the issue

   - Convert x86 APIC device tree bindings to YAML

   - Extend x86 APIC device tree bindings to configure interrupt
     delivery mode and handle this in during init. This allows to boot
     with device tree on platforms which lack a legacy PIC"

* tag 'x86-apic-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/of: Add support for boot time interrupt delivery mode configuration
  x86/of: Replace printk(KERN_LVL) with pr_lvl()
  dt-bindings: x86: apic: Introduce new optional bool property for lapic
  dt-bindings: x86: apic: Convert Intel's APIC bindings to YAML schema
  x86/of: Remove unused early_init_dt_add_memory_arch()
  x86/apic: Handle no CONFIG_X86_X2APIC on systems with x2APIC enabled by BIOS
2022-12-12 12:30:31 -08:00
Linus Torvalds 9d33edb20f Updates for the interrupt core and driver subsystem:
- Core:
 
    The bulk is the rework of the MSI subsystem to support per device MSI
    interrupt domains. This solves conceptual problems of the current
    PCI/MSI design which are in the way of providing support for PCI/MSI[-X]
    and the upcoming PCI/IMS mechanism on the same device.
 
    IMS (Interrupt Message Store] is a new specification which allows device
    manufactures to provide implementation defined storage for MSI messages
    contrary to the uniform and specification defined storage mechanisms for
    PCI/MSI and PCI/MSI-X. IMS not only allows to overcome the size limitations
    of the MSI-X table, but also gives the device manufacturer the freedom to
    store the message in arbitrary places, even in host memory which is shared
    with the device.
 
    There have been several attempts to glue this into the current MSI code,
    but after lengthy discussions it turned out that there is a fundamental
    design problem in the current PCI/MSI-X implementation. This needs some
    historical background.
 
    When PCI/MSI[-X] support was added around 2003, interrupt management was
    completely different from what we have today in the actively developed
    architectures. Interrupt management was completely architecture specific
    and while there were attempts to create common infrastructure the
    commonalities were rudimentary and just providing shared data structures and
    interfaces so that drivers could be written in an architecture agnostic
    way.
 
    The initial PCI/MSI[-X] support obviously plugged into this model which
    resulted in some basic shared infrastructure in the PCI core code for
    setting up MSI descriptors, which are a pure software construct for holding
    data relevant for a particular MSI interrupt, but the actual association to
    Linux interrupts was completely architecture specific. This model is still
    supported today to keep museum architectures and notorious stranglers
    alive.
 
    In 2013 Intel tried to add support for hot-pluggable IO/APICs to the kernel,
    which was creating yet another architecture specific mechanism and resulted
    in an unholy mess on top of the existing horrors of x86 interrupt handling.
    The x86 interrupt management code was already an incomprehensible maze of
    indirections between the CPU vector management, interrupt remapping and the
    actual IO/APIC and PCI/MSI[-X] implementation.
 
    At roughly the same time ARM struggled with the ever growing SoC specific
    extensions which were glued on top of the architected GIC interrupt
    controller.
 
    This resulted in a fundamental redesign of interrupt management and
    provided the today prevailing concept of hierarchical interrupt
    domains. This allowed to disentangle the interactions between x86 vector
    domain and interrupt remapping and also allowed ARM to handle the zoo of
    SoC specific interrupt components in a sane way.
 
    The concept of hierarchical interrupt domains aims to encapsulate the
    functionality of particular IP blocks which are involved in interrupt
    delivery so that they become extensible and pluggable. The X86
    encapsulation looks like this:
 
                                             |--- device 1
      [Vector]---[Remapping]---[PCI/MSI]--|...
                                             |--- device N
 
    where the remapping domain is an optional component and in case that it is
    not available the PCI/MSI[-X] domains have the vector domain as their
    parent. This reduced the required interaction between the domains pretty
    much to the initialization phase where it is obviously required to
    establish the proper parent relation ship in the components of the
    hierarchy.
 
    While in most cases the model is strictly representing the chain of IP
    blocks and abstracting them so they can be plugged together to form a
    hierarchy, the design stopped short on PCI/MSI[-X]. Looking at the hardware
    it's clear that the actual PCI/MSI[-X] interrupt controller is not a global
    entity, but strict a per PCI device entity.
 
    Here we took a short cut on the hierarchical model and went for the easy
    solution of providing "global" PCI/MSI domains which was possible because
    the PCI/MSI[-X] handling is uniform across the devices. This also allowed
    to keep the existing PCI/MSI[-X] infrastructure mostly unchanged which in
    turn made it simple to keep the existing architecture specific management
    alive.
 
    A similar problem was created in the ARM world with support for IP block
    specific message storage. Instead of going all the way to stack a IP block
    specific domain on top of the generic MSI domain this ended in a construct
    which provides a "global" platform MSI domain which allows overriding the
    irq_write_msi_msg() callback per allocation.
 
    In course of the lengthy discussions we identified other abuse of the MSI
    infrastructure in wireless drivers, NTB etc. where support for
    implementation specific message storage was just mindlessly glued into the
    existing infrastructure. Some of this just works by chance on particular
    platforms but will fail in hard to diagnose ways when the driver is used
    on platforms where the underlying MSI interrupt management code does not
    expect the creative abuse.
 
    Another shortcoming of today's PCI/MSI-X support is the inability to
    allocate or free individual vectors after the initial enablement of
    MSI-X. This results in an works by chance implementation of VFIO (PCI
    pass-through) where interrupts on the host side are not set up upfront to
    avoid resource exhaustion. They are expanded at run-time when the guest
    actually tries to use them. The way how this is implemented is that the
    host disables MSI-X and then re-enables it with a larger number of
    vectors again. That works by chance because most device drivers set up
    all interrupts before the device actually will utilize them. But that's
    not universally true because some drivers allocate a large enough number
    of vectors but do not utilize them until it's actually required,
    e.g. for acceleration support. But at that point other interrupts of the
    device might be in active use and the MSI-X disable/enable dance can
    just result in losing interrupts and therefore hard to diagnose subtle
    problems.
 
    Last but not least the "global" PCI/MSI-X domain approach prevents to
    utilize PCI/MSI[-X] and PCI/IMS on the same device due to the fact that IMS
    is not longer providing a uniform storage and configuration model.
 
    The solution to this is to implement the missing step and switch from
    global PCI/MSI domains to per device PCI/MSI domains. The resulting
    hierarchy then looks like this:
 
                               |--- [PCI/MSI] device 1
      [Vector]---[Remapping]---|...
                               |--- [PCI/MSI] device N
 
    which in turn allows to provide support for multiple domains per device:
 
                               |--- [PCI/MSI] device 1
                               |--- [PCI/IMS] device 1
      [Vector]---[Remapping]---|...
                               |--- [PCI/MSI] device N
                               |--- [PCI/IMS] device N
 
    This work converts the MSI and PCI/MSI core and the x86 interrupt
    domains to the new model, provides new interfaces for post-enable
    allocation/free of MSI-X interrupts and the base framework for PCI/IMS.
    PCI/IMS has been verified with the work in progress IDXD driver.
 
    There is work in progress to convert ARM over which will replace the
    platform MSI train-wreck. The cleanup of VFIO, NTB and other creative
    "solutions" are in the works as well.
 
  - Drivers:
 
    - Updates for the LoongArch interrupt chip drivers
 
    - Support for MTK CIRQv2
 
    - The usual small fixes and updates all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmOUsygTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoYXiD/40tXKzCzf0qFIqUlZLia1N3RRrwrNC
 DVTixuLtR9MrjwE+jWLQILa85SHInV8syXHSd35SzhsGDxkURFGi+HBgVWmysODf
 br9VSh3Gi+kt7iXtIwAg8WNWviGNmS3kPksxCko54F0YnJhMY5r5bhQVUBQkwFG2
 wES1C9Uzd4pdV2bl24Z+WKL85cSmZ+pHunyKw1n401lBABXnTF9c4f13zC14jd+y
 wDxNrmOxeL3mEH4Pg6VyrDuTOURSf3TjJjeEq3EYqvUo0FyLt9I/cKX0AELcZQX7
 fkRjrQQAvXNj39RJfeSkojDfllEPUHp7XSluhdBu5aIovSamdYGCDnuEoZ+l4MJ+
 CojIErp3Dwj/uSaf5c7C3OaDAqH2CpOFWIcrUebShJE60hVKLEpUwd6W8juplaoT
 gxyXRb1Y+BeJvO8VhMN4i7f3232+sj8wuj+HTRTTbqMhkElnin94tAx8rgwR1sgR
 BiOGMJi4K2Y8s9Rqqp0Dvs01CW4guIYvSR4YY+WDbbi1xgiev89OYs6zZTJCJe4Y
 NUwwpqYSyP1brmtdDdBOZLqegjQm+TwUb6oOaasFem4vT1swgawgLcDnPOx45bk5
 /FWt3EmnZxMz99x9jdDn1+BCqAZsKyEbEY1avvhPVMTwoVIuSX2ceTBMLseGq+jM
 03JfvdxnueM3gw==
 =9erA
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
 "Updates for the interrupt core and driver subsystem:

  The bulk is the rework of the MSI subsystem to support per device MSI
  interrupt domains. This solves conceptual problems of the current
  PCI/MSI design which are in the way of providing support for
  PCI/MSI[-X] and the upcoming PCI/IMS mechanism on the same device.

  IMS (Interrupt Message Store] is a new specification which allows
  device manufactures to provide implementation defined storage for MSI
  messages (as opposed to PCI/MSI and PCI/MSI-X that has a specified
  message store which is uniform accross all devices). The PCI/MSI[-X]
  uniformity allowed us to get away with "global" PCI/MSI domains.

  IMS not only allows to overcome the size limitations of the MSI-X
  table, but also gives the device manufacturer the freedom to store the
  message in arbitrary places, even in host memory which is shared with
  the device.

  There have been several attempts to glue this into the current MSI
  code, but after lengthy discussions it turned out that there is a
  fundamental design problem in the current PCI/MSI-X implementation.
  This needs some historical background.

  When PCI/MSI[-X] support was added around 2003, interrupt management
  was completely different from what we have today in the actively
  developed architectures. Interrupt management was completely
  architecture specific and while there were attempts to create common
  infrastructure the commonalities were rudimentary and just providing
  shared data structures and interfaces so that drivers could be written
  in an architecture agnostic way.

  The initial PCI/MSI[-X] support obviously plugged into this model
  which resulted in some basic shared infrastructure in the PCI core
  code for setting up MSI descriptors, which are a pure software
  construct for holding data relevant for a particular MSI interrupt,
  but the actual association to Linux interrupts was completely
  architecture specific. This model is still supported today to keep
  museum architectures and notorious stragglers alive.

  In 2013 Intel tried to add support for hot-pluggable IO/APICs to the
  kernel, which was creating yet another architecture specific mechanism
  and resulted in an unholy mess on top of the existing horrors of x86
  interrupt handling. The x86 interrupt management code was already an
  incomprehensible maze of indirections between the CPU vector
  management, interrupt remapping and the actual IO/APIC and PCI/MSI[-X]
  implementation.

  At roughly the same time ARM struggled with the ever growing SoC
  specific extensions which were glued on top of the architected GIC
  interrupt controller.

  This resulted in a fundamental redesign of interrupt management and
  provided the today prevailing concept of hierarchical interrupt
  domains. This allowed to disentangle the interactions between x86
  vector domain and interrupt remapping and also allowed ARM to handle
  the zoo of SoC specific interrupt components in a sane way.

  The concept of hierarchical interrupt domains aims to encapsulate the
  functionality of particular IP blocks which are involved in interrupt
  delivery so that they become extensible and pluggable. The X86
  encapsulation looks like this:

                                            |--- device 1
     [Vector]---[Remapping]---[PCI/MSI]--|...
                                            |--- device N

  where the remapping domain is an optional component and in case that
  it is not available the PCI/MSI[-X] domains have the vector domain as
  their parent. This reduced the required interaction between the
  domains pretty much to the initialization phase where it is obviously
  required to establish the proper parent relation ship in the
  components of the hierarchy.

  While in most cases the model is strictly representing the chain of IP
  blocks and abstracting them so they can be plugged together to form a
  hierarchy, the design stopped short on PCI/MSI[-X]. Looking at the
  hardware it's clear that the actual PCI/MSI[-X] interrupt controller
  is not a global entity, but strict a per PCI device entity.

  Here we took a short cut on the hierarchical model and went for the
  easy solution of providing "global" PCI/MSI domains which was possible
  because the PCI/MSI[-X] handling is uniform across the devices. This
  also allowed to keep the existing PCI/MSI[-X] infrastructure mostly
  unchanged which in turn made it simple to keep the existing
  architecture specific management alive.

  A similar problem was created in the ARM world with support for IP
  block specific message storage. Instead of going all the way to stack
  a IP block specific domain on top of the generic MSI domain this ended
  in a construct which provides a "global" platform MSI domain which
  allows overriding the irq_write_msi_msg() callback per allocation.

  In course of the lengthy discussions we identified other abuse of the
  MSI infrastructure in wireless drivers, NTB etc. where support for
  implementation specific message storage was just mindlessly glued into
  the existing infrastructure. Some of this just works by chance on
  particular platforms but will fail in hard to diagnose ways when the
  driver is used on platforms where the underlying MSI interrupt
  management code does not expect the creative abuse.

  Another shortcoming of today's PCI/MSI-X support is the inability to
  allocate or free individual vectors after the initial enablement of
  MSI-X. This results in an works by chance implementation of VFIO (PCI
  pass-through) where interrupts on the host side are not set up upfront
  to avoid resource exhaustion. They are expanded at run-time when the
  guest actually tries to use them. The way how this is implemented is
  that the host disables MSI-X and then re-enables it with a larger
  number of vectors again. That works by chance because most device
  drivers set up all interrupts before the device actually will utilize
  them. But that's not universally true because some drivers allocate a
  large enough number of vectors but do not utilize them until it's
  actually required, e.g. for acceleration support. But at that point
  other interrupts of the device might be in active use and the MSI-X
  disable/enable dance can just result in losing interrupts and
  therefore hard to diagnose subtle problems.

  Last but not least the "global" PCI/MSI-X domain approach prevents to
  utilize PCI/MSI[-X] and PCI/IMS on the same device due to the fact
  that IMS is not longer providing a uniform storage and configuration
  model.

  The solution to this is to implement the missing step and switch from
  global PCI/MSI domains to per device PCI/MSI domains. The resulting
  hierarchy then looks like this:

                              |--- [PCI/MSI] device 1
     [Vector]---[Remapping]---|...
                              |--- [PCI/MSI] device N

  which in turn allows to provide support for multiple domains per
  device:

                              |--- [PCI/MSI] device 1
                              |--- [PCI/IMS] device 1
     [Vector]---[Remapping]---|...
                              |--- [PCI/MSI] device N
                              |--- [PCI/IMS] device N

  This work converts the MSI and PCI/MSI core and the x86 interrupt
  domains to the new model, provides new interfaces for post-enable
  allocation/free of MSI-X interrupts and the base framework for
  PCI/IMS. PCI/IMS has been verified with the work in progress IDXD
  driver.

  There is work in progress to convert ARM over which will replace the
  platform MSI train-wreck. The cleanup of VFIO, NTB and other creative
  "solutions" are in the works as well.

  Drivers:

   - Updates for the LoongArch interrupt chip drivers

   - Support for MTK CIRQv2

   - The usual small fixes and updates all over the place"

* tag 'irq-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (134 commits)
  irqchip/ti-sci-inta: Fix kernel doc
  irqchip/gic-v2m: Mark a few functions __init
  irqchip/gic-v2m: Include arm-gic-common.h
  irqchip/irq-mvebu-icu: Fix works by chance pointer assignment
  iommu/amd: Enable PCI/IMS
  iommu/vt-d: Enable PCI/IMS
  x86/apic/msi: Enable PCI/IMS
  PCI/MSI: Provide pci_ims_alloc/free_irq()
  PCI/MSI: Provide IMS (Interrupt Message Store) support
  genirq/msi: Provide constants for PCI/IMS support
  x86/apic/msi: Enable MSI_FLAG_PCI_MSIX_ALLOC_DYN
  PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X
  PCI/MSI: Provide prepare_desc() MSI domain op
  PCI/MSI: Split MSI-X descriptor setup
  genirq/msi: Provide MSI_FLAG_MSIX_ALLOC_DYN
  genirq/msi: Provide msi_domain_alloc_irq_at()
  genirq/msi: Provide msi_domain_ops:: Prepare_desc()
  genirq/msi: Provide msi_desc:: Msi_data
  genirq/msi: Provide struct msi_map
  x86/apic/msi: Remove arch_create_remap_msi_irq_domain()
  ...
2022-12-12 11:21:29 -08:00
Linus Torvalds 06cff4a58e arm64 updates for 6.2
ACPI:
 	* Enable FPDT support for boot-time profiling
 	* Fix CPU PMU probing to work better with PREEMPT_RT
 	* Update SMMUv3 MSI DeviceID parsing to latest IORT spec
 	* APMT support for probing Arm CoreSight PMU devices
 
 CPU features:
 	* Advertise new SVE instructions (v2.1)
 	* Advertise range prefetch instruction
 	* Advertise CSSC ("Common Short Sequence Compression") scalar
 	  instructions, adding things like min, max, abs, popcount
 	* Enable DIT (Data Independent Timing) when running in the kernel
 	* More conversion of system register fields over to the generated
 	  header
 
 CPU misfeatures:
 	* Workaround for Cortex-A715 erratum #2645198
 
 Dynamic SCS:
 	* Support for dynamic shadow call stacks to allow switching at
 	  runtime between Clang's SCS implementation and the CPU's
 	  pointer authentication feature when it is supported (complete
 	  with scary DWARF parser!)
 
 Tracing and debug:
 	* Remove static ftrace in favour of, err, dynamic ftrace!
 	* Seperate 'struct ftrace_regs' from 'struct pt_regs' in core
 	  ftrace and existing arch code
 	* Introduce and implement FTRACE_WITH_ARGS on arm64 to replace
 	  the old FTRACE_WITH_REGS
 	* Extend 'crashkernel=' parameter with default value and fallback
 	  to placement above 4G physical if initial (low) allocation
 	  fails
 
 SVE:
 	* Optimisation to avoid disabling SVE unconditionally on syscall
 	  entry and just zeroing the non-shared state on return instead
 
 Exceptions:
 	* Rework of undefined instruction handling to avoid serialisation
 	  on global lock (this includes emulation of user accesses to the
 	  ID registers)
 
 Perf and PMU:
 	* Support for TLP filters in Hisilicon's PCIe PMU device
 	* Support for the DDR PMU present in Amlogic Meson G12 SoCs
 	* Support for the terribly-named "CoreSight PMU" architecture
 	  from Arm (and Nvidia's implementation of said architecture)
 
 Misc:
 	* Tighten up our boot protocol for systems with memory above
           52 bits physical
 	* Const-ify static keys to satisty jump label asm constraints
 	* Trivial FFA driver cleanups in preparation for v1.1 support
 	* Export the kernel_neon_* APIs as GPL symbols
 	* Harden our instruction generation routines against
 	  instrumentation
 	* A bunch of robustness improvements to our arch-specific selftests
 	* Minor cleanups and fixes all over (kbuild, kprobes, kfence, PMU, ...)
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmOPLFAQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNPRcCACLyDTvkimiqfoPxzzgdkx/6QOvw9s3/mXg
 UcTORSZBR1VnYkiMYEKVz/tTfG99dnWtD8/0k/rz48NbhBfsF2sN4ukyBBXVf0zR
 fjnaVyVC11LUgBgZKPo6maV+jf/JWf9hJtpPl06KTiPb2Hw2JX4DXg+PeF8t2hGx
 NLH4ekQOrlDM8mlsN5mc0YsHbiuO7Xe/NRuet8TsgU4bEvLAwO6bzOLVUMqDQZNq
 bQe2ENcGVAzAf7iRJb38lj9qB/5hrQTHRXqLXMSnJyyVjQEwYca0PeJMa7x30bXF
 ZZ+xQ8Wq0mxiffZraf6SE34yD4gaYS4Fziw7rqvydC15vYhzJBH1
 =hV+2
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "The highlights this time are support for dynamically enabling and
  disabling Clang's Shadow Call Stack at boot and a long-awaited
  optimisation to the way in which we handle the SVE register state on
  system call entry to avoid taking unnecessary traps from userspace.

  Summary:

  ACPI:
   - Enable FPDT support for boot-time profiling
   - Fix CPU PMU probing to work better with PREEMPT_RT
   - Update SMMUv3 MSI DeviceID parsing to latest IORT spec
   - APMT support for probing Arm CoreSight PMU devices

  CPU features:
   - Advertise new SVE instructions (v2.1)
   - Advertise range prefetch instruction
   - Advertise CSSC ("Common Short Sequence Compression") scalar
     instructions, adding things like min, max, abs, popcount
   - Enable DIT (Data Independent Timing) when running in the kernel
   - More conversion of system register fields over to the generated
     header

  CPU misfeatures:
   - Workaround for Cortex-A715 erratum #2645198

  Dynamic SCS:
   - Support for dynamic shadow call stacks to allow switching at
     runtime between Clang's SCS implementation and the CPU's pointer
     authentication feature when it is supported (complete with scary
     DWARF parser!)

  Tracing and debug:
   - Remove static ftrace in favour of, err, dynamic ftrace!
   - Seperate 'struct ftrace_regs' from 'struct pt_regs' in core ftrace
     and existing arch code
   - Introduce and implement FTRACE_WITH_ARGS on arm64 to replace the
     old FTRACE_WITH_REGS
   - Extend 'crashkernel=' parameter with default value and fallback to
     placement above 4G physical if initial (low) allocation fails

  SVE:
   - Optimisation to avoid disabling SVE unconditionally on syscall
     entry and just zeroing the non-shared state on return instead

  Exceptions:
   - Rework of undefined instruction handling to avoid serialisation on
     global lock (this includes emulation of user accesses to the ID
     registers)

  Perf and PMU:
   - Support for TLP filters in Hisilicon's PCIe PMU device
   - Support for the DDR PMU present in Amlogic Meson G12 SoCs
   - Support for the terribly-named "CoreSight PMU" architecture from
     Arm (and Nvidia's implementation of said architecture)

  Misc:
   - Tighten up our boot protocol for systems with memory above 52 bits
     physical
   - Const-ify static keys to satisty jump label asm constraints
   - Trivial FFA driver cleanups in preparation for v1.1 support
   - Export the kernel_neon_* APIs as GPL symbols
   - Harden our instruction generation routines against instrumentation
   - A bunch of robustness improvements to our arch-specific selftests
   - Minor cleanups and fixes all over (kbuild, kprobes, kfence, PMU, ...)"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (151 commits)
  arm64: kprobes: Return DBG_HOOK_ERROR if kprobes can not handle a BRK
  arm64: kprobes: Let arch do_page_fault() fix up page fault in user handler
  arm64: Prohibit instrumentation on arch_stack_walk()
  arm64:uprobe fix the uprobe SWBP_INSN in big-endian
  arm64: alternatives: add __init/__initconst to some functions/variables
  arm_pmu: Drop redundant armpmu->map_event() in armpmu_event_init()
  kselftest/arm64: Allow epoll_wait() to return more than one result
  kselftest/arm64: Don't drain output while spawning children
  kselftest/arm64: Hold fp-stress children until they're all spawned
  arm64/sysreg: Remove duplicate definitions from asm/sysreg.h
  arm64/sysreg: Convert ID_DFR1_EL1 to automatic generation
  arm64/sysreg: Convert ID_DFR0_EL1 to automatic generation
  arm64/sysreg: Convert ID_AFR0_EL1 to automatic generation
  arm64/sysreg: Convert ID_MMFR5_EL1 to automatic generation
  arm64/sysreg: Convert MVFR2_EL1 to automatic generation
  arm64/sysreg: Convert MVFR1_EL1 to automatic generation
  arm64/sysreg: Convert MVFR0_EL1 to automatic generation
  arm64/sysreg: Convert ID_PFR2_EL1 to automatic generation
  arm64/sysreg: Convert ID_PFR1_EL1 to automatic generation
  arm64/sysreg: Convert ID_PFR0_EL1 to automatic generation
  ...
2022-12-12 09:50:05 -08:00
Linus Torvalds 7d62159919 hyperv-next for v6.2
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmORzR4THHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXkqCCACFwHz04iepLE7R8ZZ6BVUhD6uzfzDo
 s1j7ozOUGUe3vI6q0DElHWVQZgzIzLypVsfWkZToe6jeOU6R48b0tZSFyJCUNwGM
 ogmS7N8fBdHfY9SBFoUPoziBifXpf3kq4hhX/w+1Lge9CN5Ywc4KjuJb91EAInbs
 lm47O4KQY8w8A7BbPBHYBueUVWLvgwPRPOS032zqxN1787m2tCxpqkfnImK39kh6
 IsBBIZfYsok0H5wldhZXnsARpEOeFF6BoFBXpFPlmnbv2VcK2AfZgTYdA3ESyEgd
 NyOFDfh6BO07gTR1xCH6gvOpkHwx6xKAkjE36RymdhXS6fhRCRsfahVB
 =m78g
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-next-signed-20221208' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv updates from Wei Liu:

 - Drop unregister syscore from hyperv_cleanup to avoid hang (Gaurav
   Kohli)

 - Clean up panic path for Hyper-V framebuffer (Guilherme G. Piccoli)

 - Allow IRQ remapping to work without x2apic (Nuno Das Neves)

 - Fix comments (Olaf Hering)

 - Expand hv_vp_assist_page definition (Saurabh Sengar)

 - Improvement to page reporting (Shradha Gupta)

 - Make sure TSC clocksource works when Linux runs as the root partition
   (Stanislav Kinsburskiy)

* tag 'hyperv-next-signed-20221208' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  x86/hyperv: Remove unregister syscore call from Hyper-V cleanup
  iommu/hyper-v: Allow hyperv irq remapping without x2apic
  clocksource: hyper-v: Add TSC page support for root partition
  clocksource: hyper-v: Use TSC PFN getter to map vvar page
  clocksource: hyper-v: Introduce TSC PFN getter
  clocksource: hyper-v: Introduce a pointer to TSC page
  x86/hyperv: Expand definition of struct hv_vp_assist_page
  PCI: hv: update comment in x86 specific hv_arch_irq_unmask
  hv: fix comment typo in vmbus_channel/low_latency
  drivers: hv, hyperv_fb: Untangle and refactor Hyper-V panic notifiers
  video: hyperv_fb: Avoid taking busy spinlock on panic path
  hv_balloon: Add support for configurable order free page reporting
  mm/page_reporting: Add checks for page_reporting_order param
2022-12-12 09:34:16 -08:00
Paolo Bonzini eb5618911a KVM/arm64 updates for 6.2
- Enable the per-vcpu dirty-ring tracking mechanism, together with an
   option to keep the good old dirty log around for pages that are
   dirtied by something other than a vcpu.
 
 - Switch to the relaxed parallel fault handling, using RCU to delay
   page table reclaim and giving better performance under load.
 
 - Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
   option, which multi-process VMMs such as crosvm rely on.
 
 - Merge the pKVM shadow vcpu state tracking that allows the hypervisor
   to have its own view of a vcpu, keeping that state private.
 
 - Add support for the PMUv3p5 architecture revision, bringing support
   for 64bit counters on systems that support it, and fix the
   no-quite-compliant CHAIN-ed counter support for the machines that
   actually exist out there.
 
 - Fix a handful of minor issues around 52bit VA/PA support (64kB pages
   only) as a prefix of the oncoming support for 4kB and 16kB pages.
 
 - Add/Enable/Fix a bunch of selftests covering memslots, breakpoints,
   stage-2 faults and access tracking. You name it, we got it, we
   probably broke it.
 
 - Pick a small set of documentation and spelling fixes, because no
   good merge window would be complete without those.
 
 As a side effect, this tag also drags:
 
 - The 'kvmarm-fixes-6.1-3' tag as a dependency to the dirty-ring
   series
 
 - A shared branch with the arm64 tree that repaints all the system
   registers to match the ARM ARM's naming, and resulting in
   interesting conflicts
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmOODb0PHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDztsQAInRnsgLl57/SpqhZzExNCllN6AT/bdeB3uz
 rnw3ScJOV174uNKp8lnPWoTvu2YUGiVtBp6tFHhDI8le7zHX438ZT8KE5mcs8p5i
 KfFKnb8SHV2DDpqkcy24c0Xl/6vsg1qkKrdfJb49yl5ZakRITDpynW/7tn6dXsxX
 wASeGFdCYeW4g2xMQzsCbtx6LgeQ8uomBmzRfPrOtZHYYxAn6+4Mj4595EC1sWxM
 AQnbp8tW3Vw46saEZAQvUEOGOW9q0Nls7G21YqQ52IA+ZVDK1LmAF2b1XY3edjkk
 pX8EsXOURfqdasBxfSfF3SgnUazoz9GHpSzp1cTVTktrPp40rrT7Ldtml0ktq69d
 1malPj47KVMDsIq0kNJGnMxciXFgAHw+VaCQX+k4zhIatNwviMbSop2fEoxj22jc
 4YGgGOxaGrnvmAJhreCIbr4CkZk5CJ8Zvmtfg+QM6npIp8BY8896nvORx/d4i6tT
 H4caadd8AAR56ANUyd3+KqF3x0WrkaU0PLHJLy1tKwOXJUUTjcpvIfahBAAeUlSR
 qEFrtb+EEMPgAwLfNOICcNkPZR/yyuYvM+FiUQNVy5cNiwFkpztpIctfOFaHySGF
 K07O2/a1F6xKL0OKRUg7hGKknF9ecmux4vHhiUMuIk9VOgNTWobHozBDorLKXMzC
 aWa6oGVC
 =iIPT
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-6.2' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for 6.2

- Enable the per-vcpu dirty-ring tracking mechanism, together with an
  option to keep the good old dirty log around for pages that are
  dirtied by something other than a vcpu.

- Switch to the relaxed parallel fault handling, using RCU to delay
  page table reclaim and giving better performance under load.

- Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
  option, which multi-process VMMs such as crosvm rely on.

- Merge the pKVM shadow vcpu state tracking that allows the hypervisor
  to have its own view of a vcpu, keeping that state private.

- Add support for the PMUv3p5 architecture revision, bringing support
  for 64bit counters on systems that support it, and fix the
  no-quite-compliant CHAIN-ed counter support for the machines that
  actually exist out there.

- Fix a handful of minor issues around 52bit VA/PA support (64kB pages
  only) as a prefix of the oncoming support for 4kB and 16kB pages.

- Add/Enable/Fix a bunch of selftests covering memslots, breakpoints,
  stage-2 faults and access tracking. You name it, we got it, we
  probably broke it.

- Pick a small set of documentation and spelling fixes, because no
  good merge window would be complete without those.

As a side effect, this tag also drags:

- The 'kvmarm-fixes-6.1-3' tag as a dependency to the dirty-ring
  series

- A shared branch with the arm64 tree that repaints all the system
  registers to match the ARM ARM's naming, and resulting in
  interesting conflicts
2022-12-09 09:12:12 +01:00
Ard Biesheuvel d9f26ae731 Linux 6.1-rc8
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmONI6weHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG9xgH/jqXGuMoO1ikfmGb
 7oY0W/f69G9V/e0DxFLvnIjhFgCUzdnNsmD4jQJA4x6QsxwLWuvpI282Ez+bHV5T
 U4RPsxJZIIMsXE2lKM9BRgeLzDdCt0aK4Pj+3x2x7NZC5cWFSQ8PyQJkCwg+0PQo
 u8Ly+GO8c4RUMf4/rrAZQq16qZUqGDaGm1EJhtSoa+KiR81LmUUmbDIK9Mr53rmQ
 wou+95XhibwMWr17WgXA28bTgYqn9UGr67V3qvTH2LC7GW8BCoKvn+3wh6TVhlWj
 dsWplXgcOP0/OHvSC5Sb1Uibk5Gx3DlIzYa6OfNZQuZ5xmQqm9kXjW8lmYpWFHy/
 38/5HWc=
 =EuoA
 -----END PGP SIGNATURE-----

Merge tag 'v6.1-rc8' into efi/next

Linux 6.1-rc8
2022-12-07 19:08:57 +01:00
Thomas Gleixner 486254ad96 x86/apic/msi: Enable MSI_FLAG_PCI_MSIX_ALLOC_DYN
x86 MSI irqdomains can handle MSI-X allocation post MSI-X enable just out
of the box - on the vector domain and on the remapping domains,

Add the feature flag to the supported feature list

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232326.787373104@linutronix.de
2022-12-05 22:22:34 +01:00
Thomas Gleixner 4d5a4ccc51 x86/apic/msi: Remove arch_create_remap_msi_irq_domain()
and related code which is not longer required now that the interrupt remap
code has been converted to MSI parent domains.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232326.267353814@linutronix.de
2022-12-05 22:22:33 +01:00
Thomas Gleixner b6d5fc3a52 x86/apic/vector: Provide MSI parent domain
Enable MSI parent domain support in the x86 vector domain and fixup the
checks in the iommu implementations to check whether device::msi::domain is
the default MSI parent domain. That keeps the existing logic to protect
e.g. devices behind VMD working.

The interrupt remap PCI/MSI code still works because the underlying vector
domain still provides the same functionality.

None of the other x86 PCI/MSI, e.g. XEN and HyperV, implementations are
affected either. They still work the same way both at the low level and the
PCI/MSI implementations they provide.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232326.034672592@linutronix.de
2022-12-05 22:22:33 +01:00
Marc Zyngier a937f37d85 Merge branch kvm-arm64/dirty-ring into kvmarm-master/next
* kvm-arm64/dirty-ring:
  : .
  : Add support for the "per-vcpu dirty-ring tracking with a bitmap
  : and sprinkles on top", courtesy of Gavin Shan.
  :
  : This branch drags the kvmarm-fixes-6.1-3 tag which was already
  : merged in 6.1-rc4 so that the branch is in a working state.
  : .
  KVM: Push dirty information unconditionally to backup bitmap
  KVM: selftests: Automate choosing dirty ring size in dirty_log_test
  KVM: selftests: Clear dirty ring states between two modes in dirty_log_test
  KVM: selftests: Use host page size to map ring buffer in dirty_log_test
  KVM: arm64: Enable ring-based dirty memory tracking
  KVM: Support dirty ring in conjunction with bitmap
  KVM: Move declaration of kvm_cpu_dirty_log_size() to kvm_dirty_ring.h
  KVM: x86: Introduce KVM_REQ_DIRTY_RING_SOFT_FULL

Signed-off-by: Marc Zyngier <maz@kernel.org>
2022-12-05 14:19:50 +00:00
Pawan Gupta 6606515742 x86/bugs: Make sure MSR_SPEC_CTRL is updated properly upon resume from S3
The "force" argument to write_spec_ctrl_current() is currently ambiguous
as it does not guarantee the MSR write. This is due to the optimization
that writes to the MSR happen only when the new value differs from the
cached value.

This is fine in most cases, but breaks for S3 resume when the cached MSR
value gets out of sync with the hardware MSR value due to S3 resetting
it.

When x86_spec_ctrl_current is same as x86_spec_ctrl_base, the MSR write
is skipped. Which results in SPEC_CTRL mitigations not getting restored.

Move the MSR write from write_spec_ctrl_current() to a new function that
unconditionally writes to the MSR. Update the callers accordingly and
rename functions.

  [ bp: Rework a bit. ]

Fixes: caa0ff24d5 ("x86/bugs: Keep a per-CPU IA32_SPEC_CTRL value")
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/806d39b0bfec2fe8f50dc5446dff20f5bb24a959.1669821572.git.pawan.kumar.gupta@linux.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-02 15:45:33 -08:00
Linus Torvalds bdaa78c6aa 15 hotfixes. 11 marked cc:stable. Only three or four of the latter
address post-6.0 issues, which is hopefully a sign that things are
 converging.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY4pQpQAKCRDdBJ7gKXxA
 jquxAP9Lqif7CGDgdq8uWY2hHS/Ujc3k7Ohgyzs37olnCuU8KwEA6/J7SpjsBgtY
 OfzvnwxpCTh8Kfzu/oNckIHo/EEiIA8=
 =o6qT
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2022-12-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc hotfixes from Andrew Morton:
 "15 hotfixes,  11 marked cc:stable.

  Only three or four of the latter address post-6.0 issues, which is
  hopefully a sign that things are converging"

* tag 'mm-hotfixes-stable-2022-12-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  revert "kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible"
  Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled
  drm/amdgpu: temporarily disable broken Clang builds due to blown stack-frame
  mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths
  mm/khugepaged: fix GUP-fast interaction by sending IPI
  mm/khugepaged: take the right locks for page table retraction
  mm: migrate: fix THP's mapcount on isolation
  mm: introduce arch_has_hw_nonleaf_pmd_young()
  mm: add dummy pmd_young() for architectures not having it
  mm/damon/sysfs: fix wrong empty schemes assumption under online tuning in damon_sysfs_set_schemes()
  tools/vm/slabinfo-gnuplot: use "grep -E" instead of "egrep"
  nilfs2: fix NULL pointer dereference in nilfs_palloc_commit_free_entry()
  hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
  madvise: use zap_page_range_single for madvise dontneed
  mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE
2022-12-02 13:39:38 -08:00
Javier Martinez Canillas 66a9221d73 KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl
The documentation says that the ioctl has been deprecated, but it has been
actually removed and the remaining references are just left overs.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Javier Martinez Canillas <javierm@redhat.com>
Message-Id: <20221202105011.185147-3-javierm@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-02 12:54:40 -05:00
Mateusz Jończyk e3998434da x86/apic: Handle no CONFIG_X86_X2APIC on systems with x2APIC enabled by BIOS
A kernel that was compiled without CONFIG_X86_X2APIC was unable to boot on
platforms that have x2APIC already enabled in the BIOS before starting the
kernel.

The kernel was supposed to panic with an approprite error message in
validate_x2apic() due to the missing X2APIC support.

However, validate_x2apic() was run too late in the boot cycle, and the
kernel tried to initialize the APIC nonetheless. This resulted in an
earlier panic in setup_local_APIC() because the APIC was not registered.

In my experiments, a panic message in setup_local_APIC() was not visible
in the graphical console, which resulted in a hang with no indication
what has gone wrong.

Instead of calling panic(), disable the APIC, which results in a somewhat
working system with the PIC only (and no SMP). This way the user is able to
diagnose the problem more easily.

Disabling X2APIC mode is not an option because it's impossible on systems
with locked x2APIC.

The proper place to disable the APIC in this case is in check_x2apic(),
which is called early from setup_arch(). Doing this in
__apic_intr_mode_select() is too late.

Make check_x2apic() unconditionally available and remove the empty stub.

Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reported-by: Robert Elliott (Servers) <elliott@hpe.com>
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/d573ba1c-0dc4-3016-712a-cc23a8a33d42@molgen.mpg.de
Link: https://lore.kernel.org/lkml/20220911084711.13694-3-mat.jonczyk@o2.pl
Link: https://lore.kernel.org/all/20221129215008.7247-1-mat.jonczyk@o2.pl
2022-12-02 14:28:52 +01:00
Peter Xu c2da319c2e mm/uffd: sanity check write bit for uffd-wp protected ptes
Let's add one sanity check for CONFIG_DEBUG_VM on the write bit in
whatever chance we have when walking through the pgtables.  It can bring
the error earlier even before the app notices the data was corrupted on
the snapshot.  Also it helps us to identify this is a wrong pgtable setup,
so hopefully a great information to have for debugging too.

Link: https://lkml.kernel.org/r/20221114000447.1681003-3-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30 15:58:55 -08:00
Andrew Morton a38358c934 Merge branch 'mm-hotfixes-stable' into mm-stable 2022-11-30 14:58:42 -08:00
Juergen Gross 4aaf269c76 mm: introduce arch_has_hw_nonleaf_pmd_young()
When running as a Xen PV guests commit eed9a328aa ("mm: x86: add
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG") can cause a protection violation in
pmdp_test_and_clear_young():

 BUG: unable to handle page fault for address: ffff8880083374d0
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0003) - permissions violation
 PGD 3026067 P4D 3026067 PUD 3027067 PMD 7fee5067 PTE 8010000008337065
 Oops: 0003 [#1] PREEMPT SMP NOPTI
 CPU: 7 PID: 158 Comm: kswapd0 Not tainted 6.1.0-rc5-20221118-doflr+ #1
 RIP: e030:pmdp_test_and_clear_young+0x25/0x40

This happens because the Xen hypervisor can't emulate direct writes to
page table entries other than PTEs.

This can easily be fixed by introducing arch_has_hw_nonleaf_pmd_young()
similar to arch_has_hw_pte_young() and test that instead of
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG.

Link: https://lkml.kernel.org/r/20221123064510.16225-1-jgross@suse.com
Fixes: eed9a328aa ("mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Acked-by: Yu Zhao <yuzhao@google.com>
Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Acked-by: David Hildenbrand <david@redhat.com>	[core changes]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30 14:49:41 -08:00
Juergen Gross 6617da8fb5 mm: add dummy pmd_young() for architectures not having it
In order to avoid #ifdeffery add a dummy pmd_young() implementation as a
fallback.  This is required for the later patch "mm: introduce
arch_has_hw_nonleaf_pmd_young()".

Link: https://lkml.kernel.org/r/fd3ac3cd-7349-6bbd-890a-71a9454ca0b3@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30 14:49:41 -08:00
David Woodhouse d8ba8ba4c8 KVM: x86/xen: Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
Closer inspection of the Xen code shows that we aren't supposed to be
using the XEN_RUNSTATE_UPDATE flag unconditionally. It should be
explicitly enabled by guests through the HYPERVISOR_vm_assist hypercall.
If we randomly set the top bit of ->state_entry_time for a guest that
hasn't asked for it and doesn't expect it, that could make the runtimes
fail to add up and confuse the guest. Without the flag it's perfectly
safe for a vCPU to read its own vcpu_runstate_info; just not for one
vCPU to read *another's*.

I briefly pondered adding a word for the whole set of VMASST_TYPE_*
flags but the only one we care about for HVM guests is this, so it
seemed a bit pointless.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20221127122210.248427-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 10:59:37 -05:00
David Woodhouse 5ec3289b31 KVM: x86/xen: Compatibility fixes for shared runstate area
The guest runstate area can be arbitrarily byte-aligned. In fact, even
when a sane 32-bit guest aligns the overall structure nicely, the 64-bit
fields in the structure end up being unaligned due to the fact that the
32-bit ABI only aligns them to 32 bits.

So setting the ->state_entry_time field to something|XEN_RUNSTATE_UPDATE
is buggy, because if it's unaligned then we can't update the whole field
atomically; the low bytes might be observable before the _UPDATE bit is.
Xen actually updates the *byte* containing that top bit, on its own. KVM
should do the same.

In addition, we cannot assume that the runstate area fits within a single
page. One option might be to make the gfn_to_pfn cache cope with regions
that cross a page — but getting a contiguous virtual kernel mapping of a
discontiguous set of IOMEM pages is a distinctly non-trivial exercise,
and it seems this is the *only* current use case for the GPC which would
benefit from it.

An earlier version of the runstate code did use a gfn_to_hva cache for
this purpose, but it still had the single-page restriction because it
used the uhva directly — because it needs to be able to do so atomically
when the vCPU is being scheduled out, so it used pagefault_disable()
around the accesses and didn't just use kvm_write_guest_cached() which
has a fallback path.

So... use a pair of GPCs for the first and potential second page covering
the runstate area. We can get away with locking both at once because
nothing else takes more than one GPC lock at a time so we can invent
a trivial ordering rule.

The common case where it's all in the same page is kept as a fast path,
but in both cases, the actual guest structure (compat or not) is built
up from the fields in @vx, following preset pointers to the state and
times fields. The only difference is whether those pointers point to
the kernel stack (in the split case) or to guest memory directly via
the GPC.  The fast path is also fixed to use a byte access for the
XEN_RUNSTATE_UPDATE bit, then the only real difference is the dual
memcpy.

Finally, Xen also does write the runstate area immediately when it's
configured. Flip the kvm_xen_update_runstate() and …_guest() functions
and call the latter directly when the runstate area is set. This means
that other ioctls which modify the runstate also write it immediately
to the guest when they do so, which is also intended.

Update the xen_shinfo_test to exercise the pathological case where the
XEN_RUNSTATE_UPDATE flag in the top byte of the state_entry_time is
actually in a different page to the rest of the 64-bit word.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 10:56:08 -05:00