OpenCloudOS-Kernel/arch/x86/coco/tdx/tdx.c

789 lines
20 KiB
C
Raw Normal View History

// SPDX-License-Identifier: GPL-2.0
/* Copyright (C) 2021-2022 Intel Corporation */
#undef pr_fmt
#define pr_fmt(fmt) "tdx: " fmt
#include <linux/cpufeature.h>
#include <asm/coco.h>
#include <asm/tdx.h>
x86/tdx: Add HLT support for TDX guests The HLT instruction is a privileged instruction, executing it stops instruction execution and places the processor in a HALT state. It is used in kernel for cases like reboot, idle loop and exception fixup handlers. For the idle case, interrupts will be enabled (using STI) before the HLT instruction (this is also called safe_halt()). To support the HLT instruction in TDX guests, it needs to be emulated using TDVMCALL (hypercall to VMM). More details about it can be found in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication Interface (GHCI) specification, section TDVMCALL[Instruction.HLT]. In TDX guests, executing HLT instruction will generate a #VE, which is used to emulate the HLT instruction. But #VE based emulation will not work for the safe_halt() flavor, because it requires STI instruction to be executed just before the TDCALL. Since idle loop is the only user of safe_halt() variant, handle it as a special case. To avoid *safe_halt() call in the idle function, define the tdx_guest_idle() and use it to override the "x86_idle" function pointer for a valid TDX guest. Alternative choices like PV ops have been considered for adding safe_halt() support. But it was rejected because HLT paravirt calls only exist under PARAVIRT_XXL, and enabling it in TDX guest just for safe_halt() use case is not worth the cost. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 07:29:17 +08:00
#include <asm/vmx.h>
x86/tdx: Handle in-kernel MMIO In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - TDX does not allow guests to execute from shared memory. All executed instructions are in TD-private memory. Being private to the TD, VMMs have no way to access TD-private memory and no way to read the instruction to decode and emulate it. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. This approach is bad for performance. But, it has (virtually) no impact on the size of the kernel image and will work for a wide variety of drivers. This allows TDX deployments to use arbitrary devices and device drivers, including virtio. TDX customers have asked for the capability to use random devices in their deployments. In other words, even if all of the work was done to paravirtualize all x86 MMIO users and virtio, this approach would still be needed. There is essentially no way to get rid of this code. This approach is functional for all in-kernel MMIO users current and future and does so with a minimal amount of code and kernel image bloat. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests *and* are performance critical to justify the effrort. Right now, that's limited only to virtio. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This approach will be adopted in the future, removing the bulk of MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 07:29:20 +08:00
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <asm/pgtable.h>
/* TDX module Call Leaf IDs */
#define TDX_GET_INFO 1
x86/traps: Add #VE support for TDX guest Virtualization Exceptions (#VE) are delivered to TDX guests due to specific guest actions which may happen in either user space or the kernel: * Specific instructions (WBINVD, for example) * Specific MSR accesses * Specific CPUID leaf accesses * Access to specific guest physical addresses Syscall entry code has a critical window where the kernel stack is not yet set up. Any exception in this window leads to hard to debug issues and can be exploited for privilege escalation. Exceptions in the NMI entry code also cause issues. Returning from the exception handler with IRET will re-enable NMIs and nested NMI will corrupt the NMI stack. For these reasons, the kernel avoids #VEs during the syscall gap and the NMI entry code. Entry code paths do not access TD-shared memory, MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves that might generate #VE. VMM can remove memory from TD at any point, but access to unaccepted (or missing) private memory leads to VM termination, not to #VE. Similarly to page faults and breakpoints, #VEs are allowed in NMI handlers once the kernel is ready to deal with nested NMIs. During #VE delivery, all interrupts, including NMIs, are blocked until TDGETVEINFO is called. It prevents #VE nesting until the kernel reads the VE info. TDGETVEINFO retrieves the #VE info from the TDX module, which also clears the "#VE valid" flag. This must be done before anything else as any #VE that occurs while the valid flag is set escalates to #DF by TDX module. It will result in an oops. Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be delivered until TDGETVEINFO is called. For now, convert unhandled #VE's (everything, until later in this series) so that they appear just like a #GP by calling the ve_raise_fault() directly. The ve_raise_fault() function is similar to #GP handler and is responsible for sending SIGSEGV to userspace and CPU die and notifying debuggers and other die chain users. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220405232939.73860-8-kirill.shutemov@linux.intel.com
2022-04-06 07:29:16 +08:00
#define TDX_GET_VEINFO 3
#define TDX_ACCEPT_PAGE 6
/* TDX hypercall Leaf IDs */
#define TDVMCALL_MAP_GPA 0x10001
x86/tdx: Handle in-kernel MMIO In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - TDX does not allow guests to execute from shared memory. All executed instructions are in TD-private memory. Being private to the TD, VMMs have no way to access TD-private memory and no way to read the instruction to decode and emulate it. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. This approach is bad for performance. But, it has (virtually) no impact on the size of the kernel image and will work for a wide variety of drivers. This allows TDX deployments to use arbitrary devices and device drivers, including virtio. TDX customers have asked for the capability to use random devices in their deployments. In other words, even if all of the work was done to paravirtualize all x86 MMIO users and virtio, this approach would still be needed. There is essentially no way to get rid of this code. This approach is functional for all in-kernel MMIO users current and future and does so with a minimal amount of code and kernel image bloat. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests *and* are performance critical to justify the effrort. Right now, that's limited only to virtio. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This approach will be adopted in the future, removing the bulk of MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 07:29:20 +08:00
/* MMIO direction */
#define EPT_READ 0
#define EPT_WRITE 1
x86/tdx: Port I/O: Add runtime hypercalls TDX hypervisors cannot emulate instructions directly. This includes port I/O which is normally emulated in the hypervisor. All port I/O instructions inside TDX trigger the #VE exception in the guest and would be normally emulated there. Use a hypercall to emulate port I/O. Extend the tdx_handle_virt_exception() and add support to handle the #VE due to port I/O instructions. String I/O operations are not supported in TDX. Unroll them by declaring CC_ATTR_GUEST_UNROLL_STRING_IO confidential computing attribute. == Userspace Implications == The ioperm() facility allows userspace access to I/O instructions like inb/outb. Among other things, this allows writing userspace device drivers. This series has no special handling for ioperm(). Users will be able to successfully request I/O permissions but will induce a #VE on their first I/O instruction which leads SIGSEGV. If this is undesirable users can enable kernel lockdown feature with 'lockdown=integrity' kernel command line option. It makes ioperm() fail. More robust handling of this situation (denying ioperm() in all TDX guests) will be addressed in follow-on work. Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-18-kirill.shutemov@linux.intel.com
2022-04-06 07:29:26 +08:00
/* Port I/O direction */
#define PORT_READ 0
#define PORT_WRITE 1
/* See Exit Qualification for I/O Instructions in VMX documentation */
#define VE_IS_IO_IN(e) ((e) & BIT(3))
#define VE_GET_IO_SIZE(e) (((e) & GENMASK(2, 0)) + 1)
#define VE_GET_PORT_NUM(e) ((e) >> 16)
#define VE_IS_IO_STRING(e) ((e) & BIT(4))
x86/tdx: Panic on bad configs that #VE on "private" memory access All normal kernel memory is "TDX private memory". This includes everything from kernel stacks to kernel text. Handling exceptions on arbitrary accesses to kernel memory is essentially impossible because they can happen in horribly nasty places like kernel entry/exit. But, TDX hardware can theoretically _deliver_ a virtualization exception (#VE) on any access to private memory. But, it's not as bad as it sounds. TDX can be configured to never deliver these exceptions on private memory with a "TD attribute" called ATTR_SEPT_VE_DISABLE. The guest has no way to *set* this attribute, but it can check it. Ensure ATTR_SEPT_VE_DISABLE is set in early boot. panic() if it is unset. There is no sane way for Linux to run with this attribute clear so a panic() is appropriate. There's small window during boot before the check where kernel has an early #VE handler. But the handler is only for port I/O and will also panic() as soon as it sees any other #VE, such as a one generated by a private memory access. [ dhansen: Rewrite changelog and rebase on new tdx_parse_tdinfo(). Add Kirill's tested-by because I made changes since he wrote this. ] Fixes: 9a22bf6debbf ("x86/traps: Add #VE support for TDX guest") Reported-by: ruogui.ygr@alibaba-inc.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20221028141220.29217-3-kirill.shutemov%40linux.intel.com
2022-10-28 22:12:20 +08:00
#define ATTR_SEPT_VE_DISABLE BIT(28)
x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions Guests communicate with VMMs with hypercalls. Historically, these are implemented using instructions that are known to cause VMEXITs like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer expose the guest state to the host. This prevents the old hypercall mechanisms from working. So, to communicate with VMM, TDX specification defines a new instruction called TDCALL. In a TDX based VM, since the VMM is an untrusted entity, an intermediary layer -- TDX module -- facilitates secure communication between the host and the guest. TDX module is loaded like a firmware into a special CPU mode called SEAM. TDX guests communicate with the TDX module using the TDCALL instruction. A guest uses TDCALL to communicate with both the TDX module and VMM. The value of the RAX register when executing the TDCALL instruction is used to determine the TDCALL type. A leaf of TDCALL used to communicate with the VMM is called TDVMCALL. Add generic interfaces to communicate with the TDX module and VMM (using the TDCALL instruction). __tdx_module_call() - Used to communicate with the TDX module (via TDCALL instruction). __tdx_hypercall() - Used by the guest to request services from the VMM (via TDVMCALL leaf of TDCALL). Also define an additional wrapper _tdx_hypercall(), which adds error handling support for the TDCALL failure. The __tdx_module_call() and __tdx_hypercall() helper functions are implemented in assembly in a .S file. The TDCALL ABI requires shuffling arguments in and out of registers, which proved to be awkward with inline assembly. Just like syscalls, not all TDVMCALL use cases need to use the same number of argument registers. The implementation here picks the current worst-case scenario for TDCALL (4 registers). For TDCALLs with fewer than 4 arguments, there will end up being a few superfluous (cheap) instructions. But, this approach maximizes code reuse. For registers used by the TDCALL instruction, please check TDX GHCI specification, the section titled "TDCALL instruction" and "TDG.VP.VMCALL Interface". Based on previous patch by Sean Christopherson. Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20220405232939.73860-4-kirill.shutemov@linux.intel.com
2022-04-06 07:29:12 +08:00
/*
* Wrapper for standard use of __tdx_hypercall with no output aside from
* return code.
*/
static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
{
struct tdx_hypercall_args args = {
.r10 = TDX_HYPERCALL_STANDARD,
.r11 = fn,
.r12 = r12,
.r13 = r13,
.r14 = r14,
.r15 = r15,
};
return __tdx_hypercall(&args, 0);
}
/* Called from __tdx_hypercall() for unrecoverable failure */
void __tdx_hypercall_failed(void)
{
panic("TDVMCALL failed. TDX module bug?");
}
x86/tdx: Add HLT support for TDX guests The HLT instruction is a privileged instruction, executing it stops instruction execution and places the processor in a HALT state. It is used in kernel for cases like reboot, idle loop and exception fixup handlers. For the idle case, interrupts will be enabled (using STI) before the HLT instruction (this is also called safe_halt()). To support the HLT instruction in TDX guests, it needs to be emulated using TDVMCALL (hypercall to VMM). More details about it can be found in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication Interface (GHCI) specification, section TDVMCALL[Instruction.HLT]. In TDX guests, executing HLT instruction will generate a #VE, which is used to emulate the HLT instruction. But #VE based emulation will not work for the safe_halt() flavor, because it requires STI instruction to be executed just before the TDCALL. Since idle loop is the only user of safe_halt() variant, handle it as a special case. To avoid *safe_halt() call in the idle function, define the tdx_guest_idle() and use it to override the "x86_idle" function pointer for a valid TDX guest. Alternative choices like PV ops have been considered for adding safe_halt() support. But it was rejected because HLT paravirt calls only exist under PARAVIRT_XXL, and enabling it in TDX guest just for safe_halt() use case is not worth the cost. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 07:29:17 +08:00
/*
* The TDG.VP.VMCALL-Instruction-execution sub-functions are defined
* independently from but are currently matched 1:1 with VMX EXIT_REASONs.
* Reusing the KVM EXIT_REASON macros makes it easier to connect the host and
* guest sides of these calls.
*/
static u64 hcall_func(u64 exit_reason)
{
return exit_reason;
}
#ifdef CONFIG_KVM_GUEST
long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2,
unsigned long p3, unsigned long p4)
{
struct tdx_hypercall_args args = {
.r10 = nr,
.r11 = p1,
.r12 = p2,
.r13 = p3,
.r14 = p4,
};
return __tdx_hypercall(&args, 0);
}
EXPORT_SYMBOL_GPL(tdx_kvm_hypercall);
#endif
/*
* Used for TDX guests to make calls directly to the TD module. This
* should only be used for calls that have no legitimate reason to fail
* or where the kernel can not survive the call failing.
*/
static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
struct tdx_module_output *out)
{
if (__tdx_module_call(fn, rcx, rdx, r8, r9, out))
panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
}
static void tdx_parse_tdinfo(u64 *cc_mask)
{
struct tdx_module_output out;
unsigned int gpa_width;
x86/tdx: Panic on bad configs that #VE on "private" memory access All normal kernel memory is "TDX private memory". This includes everything from kernel stacks to kernel text. Handling exceptions on arbitrary accesses to kernel memory is essentially impossible because they can happen in horribly nasty places like kernel entry/exit. But, TDX hardware can theoretically _deliver_ a virtualization exception (#VE) on any access to private memory. But, it's not as bad as it sounds. TDX can be configured to never deliver these exceptions on private memory with a "TD attribute" called ATTR_SEPT_VE_DISABLE. The guest has no way to *set* this attribute, but it can check it. Ensure ATTR_SEPT_VE_DISABLE is set in early boot. panic() if it is unset. There is no sane way for Linux to run with this attribute clear so a panic() is appropriate. There's small window during boot before the check where kernel has an early #VE handler. But the handler is only for port I/O and will also panic() as soon as it sees any other #VE, such as a one generated by a private memory access. [ dhansen: Rewrite changelog and rebase on new tdx_parse_tdinfo(). Add Kirill's tested-by because I made changes since he wrote this. ] Fixes: 9a22bf6debbf ("x86/traps: Add #VE support for TDX guest") Reported-by: ruogui.ygr@alibaba-inc.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20221028141220.29217-3-kirill.shutemov%40linux.intel.com
2022-10-28 22:12:20 +08:00
u64 td_attr;
/*
* TDINFO TDX module call is used to get the TD execution environment
* information like GPA width, number of available vcpus, debug mode
* information, etc. More details about the ABI can be found in TDX
* Guest-Host-Communication Interface (GHCI), section 2.4.2 TDCALL
* [TDG.VP.INFO].
*/
tdx_module_call(TDX_GET_INFO, 0, 0, 0, 0, &out);
/*
* The highest bit of a guest physical address is the "sharing" bit.
* Set it for shared pages and clear it for private pages.
x86/tdx: Panic on bad configs that #VE on "private" memory access All normal kernel memory is "TDX private memory". This includes everything from kernel stacks to kernel text. Handling exceptions on arbitrary accesses to kernel memory is essentially impossible because they can happen in horribly nasty places like kernel entry/exit. But, TDX hardware can theoretically _deliver_ a virtualization exception (#VE) on any access to private memory. But, it's not as bad as it sounds. TDX can be configured to never deliver these exceptions on private memory with a "TD attribute" called ATTR_SEPT_VE_DISABLE. The guest has no way to *set* this attribute, but it can check it. Ensure ATTR_SEPT_VE_DISABLE is set in early boot. panic() if it is unset. There is no sane way for Linux to run with this attribute clear so a panic() is appropriate. There's small window during boot before the check where kernel has an early #VE handler. But the handler is only for port I/O and will also panic() as soon as it sees any other #VE, such as a one generated by a private memory access. [ dhansen: Rewrite changelog and rebase on new tdx_parse_tdinfo(). Add Kirill's tested-by because I made changes since he wrote this. ] Fixes: 9a22bf6debbf ("x86/traps: Add #VE support for TDX guest") Reported-by: ruogui.ygr@alibaba-inc.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20221028141220.29217-3-kirill.shutemov%40linux.intel.com
2022-10-28 22:12:20 +08:00
*
* The GPA width that comes out of this call is critical. TDX guests
* can not meaningfully run without it.
*/
x86/tdx: Panic on bad configs that #VE on "private" memory access All normal kernel memory is "TDX private memory". This includes everything from kernel stacks to kernel text. Handling exceptions on arbitrary accesses to kernel memory is essentially impossible because they can happen in horribly nasty places like kernel entry/exit. But, TDX hardware can theoretically _deliver_ a virtualization exception (#VE) on any access to private memory. But, it's not as bad as it sounds. TDX can be configured to never deliver these exceptions on private memory with a "TD attribute" called ATTR_SEPT_VE_DISABLE. The guest has no way to *set* this attribute, but it can check it. Ensure ATTR_SEPT_VE_DISABLE is set in early boot. panic() if it is unset. There is no sane way for Linux to run with this attribute clear so a panic() is appropriate. There's small window during boot before the check where kernel has an early #VE handler. But the handler is only for port I/O and will also panic() as soon as it sees any other #VE, such as a one generated by a private memory access. [ dhansen: Rewrite changelog and rebase on new tdx_parse_tdinfo(). Add Kirill's tested-by because I made changes since he wrote this. ] Fixes: 9a22bf6debbf ("x86/traps: Add #VE support for TDX guest") Reported-by: ruogui.ygr@alibaba-inc.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20221028141220.29217-3-kirill.shutemov%40linux.intel.com
2022-10-28 22:12:20 +08:00
gpa_width = out.rcx & GENMASK(5, 0);
*cc_mask = BIT_ULL(gpa_width - 1);
x86/tdx: Panic on bad configs that #VE on "private" memory access All normal kernel memory is "TDX private memory". This includes everything from kernel stacks to kernel text. Handling exceptions on arbitrary accesses to kernel memory is essentially impossible because they can happen in horribly nasty places like kernel entry/exit. But, TDX hardware can theoretically _deliver_ a virtualization exception (#VE) on any access to private memory. But, it's not as bad as it sounds. TDX can be configured to never deliver these exceptions on private memory with a "TD attribute" called ATTR_SEPT_VE_DISABLE. The guest has no way to *set* this attribute, but it can check it. Ensure ATTR_SEPT_VE_DISABLE is set in early boot. panic() if it is unset. There is no sane way for Linux to run with this attribute clear so a panic() is appropriate. There's small window during boot before the check where kernel has an early #VE handler. But the handler is only for port I/O and will also panic() as soon as it sees any other #VE, such as a one generated by a private memory access. [ dhansen: Rewrite changelog and rebase on new tdx_parse_tdinfo(). Add Kirill's tested-by because I made changes since he wrote this. ] Fixes: 9a22bf6debbf ("x86/traps: Add #VE support for TDX guest") Reported-by: ruogui.ygr@alibaba-inc.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20221028141220.29217-3-kirill.shutemov%40linux.intel.com
2022-10-28 22:12:20 +08:00
/*
* The kernel can not handle #VE's when accessing normal kernel
* memory. Ensure that no #VE will be delivered for accesses to
* TD-private memory. Only VMM-shared memory (MMIO) will #VE.
*/
td_attr = out.rdx;
if (!(td_attr & ATTR_SEPT_VE_DISABLE))
panic("TD misconfiguration: SEPT_VE_DISABLE attibute must be set.\n");
}
/*
* The TDX module spec states that #VE may be injected for a limited set of
* reasons:
*
* - Emulation of the architectural #VE injection on EPT violation;
*
* - As a result of guest TD execution of a disallowed instruction,
* a disallowed MSR access, or CPUID virtualization;
*
* - A notification to the guest TD about anomalous behavior;
*
* The last one is opt-in and is not used by the kernel.
*
* The Intel Software Developer's Manual describes cases when instruction
* length field can be used in section "Information for VM Exits Due to
* Instruction Execution".
*
* For TDX, it ultimately means GET_VEINFO provides reliable instruction length
* information if #VE occurred due to instruction execution, but not for EPT
* violations.
*/
static int ve_instr_len(struct ve_info *ve)
{
switch (ve->exit_reason) {
case EXIT_REASON_HLT:
case EXIT_REASON_MSR_READ:
case EXIT_REASON_MSR_WRITE:
case EXIT_REASON_CPUID:
case EXIT_REASON_IO_INSTRUCTION:
/* It is safe to use ve->instr_len for #VE due instructions */
return ve->instr_len;
case EXIT_REASON_EPT_VIOLATION:
/*
* For EPT violations, ve->insn_len is not defined. For those,
* the kernel must decode instructions manually and should not
* be using this function.
*/
WARN_ONCE(1, "ve->instr_len is not defined for EPT violations");
return 0;
default:
WARN_ONCE(1, "Unexpected #VE-type: %lld\n", ve->exit_reason);
return ve->instr_len;
}
}
x86/tdx: Add HLT support for TDX guests The HLT instruction is a privileged instruction, executing it stops instruction execution and places the processor in a HALT state. It is used in kernel for cases like reboot, idle loop and exception fixup handlers. For the idle case, interrupts will be enabled (using STI) before the HLT instruction (this is also called safe_halt()). To support the HLT instruction in TDX guests, it needs to be emulated using TDVMCALL (hypercall to VMM). More details about it can be found in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication Interface (GHCI) specification, section TDVMCALL[Instruction.HLT]. In TDX guests, executing HLT instruction will generate a #VE, which is used to emulate the HLT instruction. But #VE based emulation will not work for the safe_halt() flavor, because it requires STI instruction to be executed just before the TDCALL. Since idle loop is the only user of safe_halt() variant, handle it as a special case. To avoid *safe_halt() call in the idle function, define the tdx_guest_idle() and use it to override the "x86_idle" function pointer for a valid TDX guest. Alternative choices like PV ops have been considered for adding safe_halt() support. But it was rejected because HLT paravirt calls only exist under PARAVIRT_XXL, and enabling it in TDX guest just for safe_halt() use case is not worth the cost. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 07:29:17 +08:00
static u64 __cpuidle __halt(const bool irq_disabled, const bool do_sti)
{
struct tdx_hypercall_args args = {
.r10 = TDX_HYPERCALL_STANDARD,
.r11 = hcall_func(EXIT_REASON_HLT),
.r12 = irq_disabled,
};
/*
* Emulate HLT operation via hypercall. More info about ABI
* can be found in TDX Guest-Host-Communication Interface
* (GHCI), section 3.8 TDG.VP.VMCALL<Instruction.HLT>.
*
* The VMM uses the "IRQ disabled" param to understand IRQ
* enabled status (RFLAGS.IF) of the TD guest and to determine
* whether or not it should schedule the halted vCPU if an
* IRQ becomes pending. E.g. if IRQs are disabled, the VMM
* can keep the vCPU in virtual HLT, even if an IRQ is
* pending, without hanging/breaking the guest.
*/
return __tdx_hypercall(&args, do_sti ? TDX_HCALL_ISSUE_STI : 0);
}
static int handle_halt(struct ve_info *ve)
x86/tdx: Add HLT support for TDX guests The HLT instruction is a privileged instruction, executing it stops instruction execution and places the processor in a HALT state. It is used in kernel for cases like reboot, idle loop and exception fixup handlers. For the idle case, interrupts will be enabled (using STI) before the HLT instruction (this is also called safe_halt()). To support the HLT instruction in TDX guests, it needs to be emulated using TDVMCALL (hypercall to VMM). More details about it can be found in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication Interface (GHCI) specification, section TDVMCALL[Instruction.HLT]. In TDX guests, executing HLT instruction will generate a #VE, which is used to emulate the HLT instruction. But #VE based emulation will not work for the safe_halt() flavor, because it requires STI instruction to be executed just before the TDCALL. Since idle loop is the only user of safe_halt() variant, handle it as a special case. To avoid *safe_halt() call in the idle function, define the tdx_guest_idle() and use it to override the "x86_idle" function pointer for a valid TDX guest. Alternative choices like PV ops have been considered for adding safe_halt() support. But it was rejected because HLT paravirt calls only exist under PARAVIRT_XXL, and enabling it in TDX guest just for safe_halt() use case is not worth the cost. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 07:29:17 +08:00
{
/*
* Since non safe halt is mainly used in CPU offlining
* and the guest will always stay in the halt state, don't
* call the STI instruction (set do_sti as false).
*/
const bool irq_disabled = irqs_disabled();
const bool do_sti = false;
if (__halt(irq_disabled, do_sti))
return -EIO;
x86/tdx: Add HLT support for TDX guests The HLT instruction is a privileged instruction, executing it stops instruction execution and places the processor in a HALT state. It is used in kernel for cases like reboot, idle loop and exception fixup handlers. For the idle case, interrupts will be enabled (using STI) before the HLT instruction (this is also called safe_halt()). To support the HLT instruction in TDX guests, it needs to be emulated using TDVMCALL (hypercall to VMM). More details about it can be found in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication Interface (GHCI) specification, section TDVMCALL[Instruction.HLT]. In TDX guests, executing HLT instruction will generate a #VE, which is used to emulate the HLT instruction. But #VE based emulation will not work for the safe_halt() flavor, because it requires STI instruction to be executed just before the TDCALL. Since idle loop is the only user of safe_halt() variant, handle it as a special case. To avoid *safe_halt() call in the idle function, define the tdx_guest_idle() and use it to override the "x86_idle" function pointer for a valid TDX guest. Alternative choices like PV ops have been considered for adding safe_halt() support. But it was rejected because HLT paravirt calls only exist under PARAVIRT_XXL, and enabling it in TDX guest just for safe_halt() use case is not worth the cost. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 07:29:17 +08:00
return ve_instr_len(ve);
x86/tdx: Add HLT support for TDX guests The HLT instruction is a privileged instruction, executing it stops instruction execution and places the processor in a HALT state. It is used in kernel for cases like reboot, idle loop and exception fixup handlers. For the idle case, interrupts will be enabled (using STI) before the HLT instruction (this is also called safe_halt()). To support the HLT instruction in TDX guests, it needs to be emulated using TDVMCALL (hypercall to VMM). More details about it can be found in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication Interface (GHCI) specification, section TDVMCALL[Instruction.HLT]. In TDX guests, executing HLT instruction will generate a #VE, which is used to emulate the HLT instruction. But #VE based emulation will not work for the safe_halt() flavor, because it requires STI instruction to be executed just before the TDCALL. Since idle loop is the only user of safe_halt() variant, handle it as a special case. To avoid *safe_halt() call in the idle function, define the tdx_guest_idle() and use it to override the "x86_idle" function pointer for a valid TDX guest. Alternative choices like PV ops have been considered for adding safe_halt() support. But it was rejected because HLT paravirt calls only exist under PARAVIRT_XXL, and enabling it in TDX guest just for safe_halt() use case is not worth the cost. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 07:29:17 +08:00
}
void __cpuidle tdx_safe_halt(void)
{
/*
* For do_sti=true case, __tdx_hypercall() function enables
* interrupts using the STI instruction before the TDCALL. So
* set irq_disabled as false.
*/
const bool irq_disabled = false;
const bool do_sti = true;
/*
* Use WARN_ONCE() to report the failure.
*/
if (__halt(irq_disabled, do_sti))
WARN_ONCE(1, "HLT instruction emulation failed\n");
}
static int read_msr(struct pt_regs *regs, struct ve_info *ve)
x86/tdx: Add MSR support for TDX guests Use hypercall to emulate MSR read/write for the TDX platform. There are two viable approaches for doing MSRs in a TD guest: 1. Execute the RDMSR/WRMSR instructions like most VMs and bare metal do. Some will succeed, others will cause a #VE. All of those that cause a #VE will be handled with a TDCALL. 2. Use paravirt infrastructure. The paravirt hook has to keep a list of which MSRs would cause a #VE and use a TDCALL. All other MSRs execute RDMSR/WRMSR instructions directly. The second option can be ruled out because the list of MSRs was challenging to maintain. That leaves option #1 as the only viable solution for the minimal TDX support. Kernel relies on the exception fixup machinery to handle MSR access errors. #VE handler uses the same exception fixup code as #GP. It covers MSR accesses along with other types of fixups. For performance-critical MSR writes (like TSC_DEADLINE), future patches will replace the WRMSR/#VE sequence with the direct TDCALL. RDMSR and WRMSR specification details can be found in Guest-Host-Communication Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX) specification, sec titled "TDG.VP. VMCALL<Instruction.RDMSR>" and "TDG.VP.VMCALL<Instruction.WRMSR>". Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-10-kirill.shutemov@linux.intel.com
2022-04-06 07:29:18 +08:00
{
struct tdx_hypercall_args args = {
.r10 = TDX_HYPERCALL_STANDARD,
.r11 = hcall_func(EXIT_REASON_MSR_READ),
.r12 = regs->cx,
};
/*
* Emulate the MSR read via hypercall. More info about ABI
* can be found in TDX Guest-Host-Communication Interface
* (GHCI), section titled "TDG.VP.VMCALL<Instruction.RDMSR>".
*/
if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
return -EIO;
x86/tdx: Add MSR support for TDX guests Use hypercall to emulate MSR read/write for the TDX platform. There are two viable approaches for doing MSRs in a TD guest: 1. Execute the RDMSR/WRMSR instructions like most VMs and bare metal do. Some will succeed, others will cause a #VE. All of those that cause a #VE will be handled with a TDCALL. 2. Use paravirt infrastructure. The paravirt hook has to keep a list of which MSRs would cause a #VE and use a TDCALL. All other MSRs execute RDMSR/WRMSR instructions directly. The second option can be ruled out because the list of MSRs was challenging to maintain. That leaves option #1 as the only viable solution for the minimal TDX support. Kernel relies on the exception fixup machinery to handle MSR access errors. #VE handler uses the same exception fixup code as #GP. It covers MSR accesses along with other types of fixups. For performance-critical MSR writes (like TSC_DEADLINE), future patches will replace the WRMSR/#VE sequence with the direct TDCALL. RDMSR and WRMSR specification details can be found in Guest-Host-Communication Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX) specification, sec titled "TDG.VP. VMCALL<Instruction.RDMSR>" and "TDG.VP.VMCALL<Instruction.WRMSR>". Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-10-kirill.shutemov@linux.intel.com
2022-04-06 07:29:18 +08:00
regs->ax = lower_32_bits(args.r11);
regs->dx = upper_32_bits(args.r11);
return ve_instr_len(ve);
x86/tdx: Add MSR support for TDX guests Use hypercall to emulate MSR read/write for the TDX platform. There are two viable approaches for doing MSRs in a TD guest: 1. Execute the RDMSR/WRMSR instructions like most VMs and bare metal do. Some will succeed, others will cause a #VE. All of those that cause a #VE will be handled with a TDCALL. 2. Use paravirt infrastructure. The paravirt hook has to keep a list of which MSRs would cause a #VE and use a TDCALL. All other MSRs execute RDMSR/WRMSR instructions directly. The second option can be ruled out because the list of MSRs was challenging to maintain. That leaves option #1 as the only viable solution for the minimal TDX support. Kernel relies on the exception fixup machinery to handle MSR access errors. #VE handler uses the same exception fixup code as #GP. It covers MSR accesses along with other types of fixups. For performance-critical MSR writes (like TSC_DEADLINE), future patches will replace the WRMSR/#VE sequence with the direct TDCALL. RDMSR and WRMSR specification details can be found in Guest-Host-Communication Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX) specification, sec titled "TDG.VP. VMCALL<Instruction.RDMSR>" and "TDG.VP.VMCALL<Instruction.WRMSR>". Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-10-kirill.shutemov@linux.intel.com
2022-04-06 07:29:18 +08:00
}
static int write_msr(struct pt_regs *regs, struct ve_info *ve)
x86/tdx: Add MSR support for TDX guests Use hypercall to emulate MSR read/write for the TDX platform. There are two viable approaches for doing MSRs in a TD guest: 1. Execute the RDMSR/WRMSR instructions like most VMs and bare metal do. Some will succeed, others will cause a #VE. All of those that cause a #VE will be handled with a TDCALL. 2. Use paravirt infrastructure. The paravirt hook has to keep a list of which MSRs would cause a #VE and use a TDCALL. All other MSRs execute RDMSR/WRMSR instructions directly. The second option can be ruled out because the list of MSRs was challenging to maintain. That leaves option #1 as the only viable solution for the minimal TDX support. Kernel relies on the exception fixup machinery to handle MSR access errors. #VE handler uses the same exception fixup code as #GP. It covers MSR accesses along with other types of fixups. For performance-critical MSR writes (like TSC_DEADLINE), future patches will replace the WRMSR/#VE sequence with the direct TDCALL. RDMSR and WRMSR specification details can be found in Guest-Host-Communication Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX) specification, sec titled "TDG.VP. VMCALL<Instruction.RDMSR>" and "TDG.VP.VMCALL<Instruction.WRMSR>". Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-10-kirill.shutemov@linux.intel.com
2022-04-06 07:29:18 +08:00
{
struct tdx_hypercall_args args = {
.r10 = TDX_HYPERCALL_STANDARD,
.r11 = hcall_func(EXIT_REASON_MSR_WRITE),
.r12 = regs->cx,
.r13 = (u64)regs->dx << 32 | regs->ax,
};
/*
* Emulate the MSR write via hypercall. More info about ABI
* can be found in TDX Guest-Host-Communication Interface
* (GHCI) section titled "TDG.VP.VMCALL<Instruction.WRMSR>".
*/
if (__tdx_hypercall(&args, 0))
return -EIO;
return ve_instr_len(ve);
x86/tdx: Add MSR support for TDX guests Use hypercall to emulate MSR read/write for the TDX platform. There are two viable approaches for doing MSRs in a TD guest: 1. Execute the RDMSR/WRMSR instructions like most VMs and bare metal do. Some will succeed, others will cause a #VE. All of those that cause a #VE will be handled with a TDCALL. 2. Use paravirt infrastructure. The paravirt hook has to keep a list of which MSRs would cause a #VE and use a TDCALL. All other MSRs execute RDMSR/WRMSR instructions directly. The second option can be ruled out because the list of MSRs was challenging to maintain. That leaves option #1 as the only viable solution for the minimal TDX support. Kernel relies on the exception fixup machinery to handle MSR access errors. #VE handler uses the same exception fixup code as #GP. It covers MSR accesses along with other types of fixups. For performance-critical MSR writes (like TSC_DEADLINE), future patches will replace the WRMSR/#VE sequence with the direct TDCALL. RDMSR and WRMSR specification details can be found in Guest-Host-Communication Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX) specification, sec titled "TDG.VP. VMCALL<Instruction.RDMSR>" and "TDG.VP.VMCALL<Instruction.WRMSR>". Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-10-kirill.shutemov@linux.intel.com
2022-04-06 07:29:18 +08:00
}
static int handle_cpuid(struct pt_regs *regs, struct ve_info *ve)
{
struct tdx_hypercall_args args = {
.r10 = TDX_HYPERCALL_STANDARD,
.r11 = hcall_func(EXIT_REASON_CPUID),
.r12 = regs->ax,
.r13 = regs->cx,
};
/*
* Only allow VMM to control range reserved for hypervisor
* communication.
*
* Return all-zeros for any CPUID outside the range. It matches CPU
* behaviour for non-supported leaf.
*/
if (regs->ax < 0x40000000 || regs->ax > 0x4FFFFFFF) {
regs->ax = regs->bx = regs->cx = regs->dx = 0;
return ve_instr_len(ve);
}
/*
* Emulate the CPUID instruction via a hypercall. More info about
* ABI can be found in TDX Guest-Host-Communication Interface
* (GHCI), section titled "VP.VMCALL<Instruction.CPUID>".
*/
if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
return -EIO;
/*
* As per TDX GHCI CPUID ABI, r12-r15 registers contain contents of
* EAX, EBX, ECX, EDX registers after the CPUID instruction execution.
* So copy the register contents back to pt_regs.
*/
regs->ax = args.r12;
regs->bx = args.r13;
regs->cx = args.r14;
regs->dx = args.r15;
return ve_instr_len(ve);
}
x86/tdx: Handle in-kernel MMIO In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - TDX does not allow guests to execute from shared memory. All executed instructions are in TD-private memory. Being private to the TD, VMMs have no way to access TD-private memory and no way to read the instruction to decode and emulate it. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. This approach is bad for performance. But, it has (virtually) no impact on the size of the kernel image and will work for a wide variety of drivers. This allows TDX deployments to use arbitrary devices and device drivers, including virtio. TDX customers have asked for the capability to use random devices in their deployments. In other words, even if all of the work was done to paravirtualize all x86 MMIO users and virtio, this approach would still be needed. There is essentially no way to get rid of this code. This approach is functional for all in-kernel MMIO users current and future and does so with a minimal amount of code and kernel image bloat. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests *and* are performance critical to justify the effrort. Right now, that's limited only to virtio. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This approach will be adopted in the future, removing the bulk of MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 07:29:20 +08:00
static bool mmio_read(int size, unsigned long addr, unsigned long *val)
{
struct tdx_hypercall_args args = {
.r10 = TDX_HYPERCALL_STANDARD,
.r11 = hcall_func(EXIT_REASON_EPT_VIOLATION),
.r12 = size,
.r13 = EPT_READ,
.r14 = addr,
.r15 = *val,
};
if (__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT))
return false;
*val = args.r11;
return true;
}
static bool mmio_write(int size, unsigned long addr, unsigned long val)
{
return !_tdx_hypercall(hcall_func(EXIT_REASON_EPT_VIOLATION), size,
EPT_WRITE, addr, val);
}
static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
x86/tdx: Handle in-kernel MMIO In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - TDX does not allow guests to execute from shared memory. All executed instructions are in TD-private memory. Being private to the TD, VMMs have no way to access TD-private memory and no way to read the instruction to decode and emulate it. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. This approach is bad for performance. But, it has (virtually) no impact on the size of the kernel image and will work for a wide variety of drivers. This allows TDX deployments to use arbitrary devices and device drivers, including virtio. TDX customers have asked for the capability to use random devices in their deployments. In other words, even if all of the work was done to paravirtualize all x86 MMIO users and virtio, this approach would still be needed. There is essentially no way to get rid of this code. This approach is functional for all in-kernel MMIO users current and future and does so with a minimal amount of code and kernel image bloat. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests *and* are performance critical to justify the effrort. Right now, that's limited only to virtio. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This approach will be adopted in the future, removing the bulk of MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 07:29:20 +08:00
{
unsigned long *reg, val, vaddr;
x86/tdx: Handle in-kernel MMIO In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - TDX does not allow guests to execute from shared memory. All executed instructions are in TD-private memory. Being private to the TD, VMMs have no way to access TD-private memory and no way to read the instruction to decode and emulate it. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. This approach is bad for performance. But, it has (virtually) no impact on the size of the kernel image and will work for a wide variety of drivers. This allows TDX deployments to use arbitrary devices and device drivers, including virtio. TDX customers have asked for the capability to use random devices in their deployments. In other words, even if all of the work was done to paravirtualize all x86 MMIO users and virtio, this approach would still be needed. There is essentially no way to get rid of this code. This approach is functional for all in-kernel MMIO users current and future and does so with a minimal amount of code and kernel image bloat. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests *and* are performance critical to justify the effrort. Right now, that's limited only to virtio. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This approach will be adopted in the future, removing the bulk of MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 07:29:20 +08:00
char buffer[MAX_INSN_SIZE];
struct insn insn = {};
enum mmio_type mmio;
int size, extend_size;
u8 extend_val = 0;
/* Only in-kernel MMIO is supported */
if (WARN_ON_ONCE(user_mode(regs)))
return -EFAULT;
x86/tdx: Handle in-kernel MMIO In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - TDX does not allow guests to execute from shared memory. All executed instructions are in TD-private memory. Being private to the TD, VMMs have no way to access TD-private memory and no way to read the instruction to decode and emulate it. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. This approach is bad for performance. But, it has (virtually) no impact on the size of the kernel image and will work for a wide variety of drivers. This allows TDX deployments to use arbitrary devices and device drivers, including virtio. TDX customers have asked for the capability to use random devices in their deployments. In other words, even if all of the work was done to paravirtualize all x86 MMIO users and virtio, this approach would still be needed. There is essentially no way to get rid of this code. This approach is functional for all in-kernel MMIO users current and future and does so with a minimal amount of code and kernel image bloat. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests *and* are performance critical to justify the effrort. Right now, that's limited only to virtio. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This approach will be adopted in the future, removing the bulk of MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 07:29:20 +08:00
if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
return -EFAULT;
x86/tdx: Handle in-kernel MMIO In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - TDX does not allow guests to execute from shared memory. All executed instructions are in TD-private memory. Being private to the TD, VMMs have no way to access TD-private memory and no way to read the instruction to decode and emulate it. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. This approach is bad for performance. But, it has (virtually) no impact on the size of the kernel image and will work for a wide variety of drivers. This allows TDX deployments to use arbitrary devices and device drivers, including virtio. TDX customers have asked for the capability to use random devices in their deployments. In other words, even if all of the work was done to paravirtualize all x86 MMIO users and virtio, this approach would still be needed. There is essentially no way to get rid of this code. This approach is functional for all in-kernel MMIO users current and future and does so with a minimal amount of code and kernel image bloat. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests *and* are performance critical to justify the effrort. Right now, that's limited only to virtio. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This approach will be adopted in the future, removing the bulk of MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 07:29:20 +08:00
if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
return -EINVAL;
x86/tdx: Handle in-kernel MMIO In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - TDX does not allow guests to execute from shared memory. All executed instructions are in TD-private memory. Being private to the TD, VMMs have no way to access TD-private memory and no way to read the instruction to decode and emulate it. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. This approach is bad for performance. But, it has (virtually) no impact on the size of the kernel image and will work for a wide variety of drivers. This allows TDX deployments to use arbitrary devices and device drivers, including virtio. TDX customers have asked for the capability to use random devices in their deployments. In other words, even if all of the work was done to paravirtualize all x86 MMIO users and virtio, this approach would still be needed. There is essentially no way to get rid of this code. This approach is functional for all in-kernel MMIO users current and future and does so with a minimal amount of code and kernel image bloat. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests *and* are performance critical to justify the effrort. Right now, that's limited only to virtio. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This approach will be adopted in the future, removing the bulk of MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 07:29:20 +08:00
mmio = insn_decode_mmio(&insn, &size);
if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED))
return -EINVAL;
x86/tdx: Handle in-kernel MMIO In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - TDX does not allow guests to execute from shared memory. All executed instructions are in TD-private memory. Being private to the TD, VMMs have no way to access TD-private memory and no way to read the instruction to decode and emulate it. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. This approach is bad for performance. But, it has (virtually) no impact on the size of the kernel image and will work for a wide variety of drivers. This allows TDX deployments to use arbitrary devices and device drivers, including virtio. TDX customers have asked for the capability to use random devices in their deployments. In other words, even if all of the work was done to paravirtualize all x86 MMIO users and virtio, this approach would still be needed. There is essentially no way to get rid of this code. This approach is functional for all in-kernel MMIO users current and future and does so with a minimal amount of code and kernel image bloat. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests *and* are performance critical to justify the effrort. Right now, that's limited only to virtio. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This approach will be adopted in the future, removing the bulk of MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 07:29:20 +08:00
if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) {
reg = insn_get_modrm_reg_ptr(&insn, regs);
if (!reg)
return -EINVAL;
x86/tdx: Handle in-kernel MMIO In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - TDX does not allow guests to execute from shared memory. All executed instructions are in TD-private memory. Being private to the TD, VMMs have no way to access TD-private memory and no way to read the instruction to decode and emulate it. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. This approach is bad for performance. But, it has (virtually) no impact on the size of the kernel image and will work for a wide variety of drivers. This allows TDX deployments to use arbitrary devices and device drivers, including virtio. TDX customers have asked for the capability to use random devices in their deployments. In other words, even if all of the work was done to paravirtualize all x86 MMIO users and virtio, this approach would still be needed. There is essentially no way to get rid of this code. This approach is functional for all in-kernel MMIO users current and future and does so with a minimal amount of code and kernel image bloat. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests *and* are performance critical to justify the effrort. Right now, that's limited only to virtio. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This approach will be adopted in the future, removing the bulk of MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 07:29:20 +08:00
}
/*
* Reject EPT violation #VEs that split pages.
*
* MMIO accesses are supposed to be naturally aligned and therefore
* never cross page boundaries. Seeing split page accesses indicates
* a bug or a load_unaligned_zeropad() that stepped into an MMIO page.
*
* load_unaligned_zeropad() will recover using exception fixups.
*/
vaddr = (unsigned long)insn_get_addr_ref(&insn, regs);
if (vaddr / PAGE_SIZE != (vaddr + size - 1) / PAGE_SIZE)
return -EFAULT;
x86/tdx: Handle in-kernel MMIO In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - TDX does not allow guests to execute from shared memory. All executed instructions are in TD-private memory. Being private to the TD, VMMs have no way to access TD-private memory and no way to read the instruction to decode and emulate it. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. This approach is bad for performance. But, it has (virtually) no impact on the size of the kernel image and will work for a wide variety of drivers. This allows TDX deployments to use arbitrary devices and device drivers, including virtio. TDX customers have asked for the capability to use random devices in their deployments. In other words, even if all of the work was done to paravirtualize all x86 MMIO users and virtio, this approach would still be needed. There is essentially no way to get rid of this code. This approach is functional for all in-kernel MMIO users current and future and does so with a minimal amount of code and kernel image bloat. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests *and* are performance critical to justify the effrort. Right now, that's limited only to virtio. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This approach will be adopted in the future, removing the bulk of MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 07:29:20 +08:00
/* Handle writes first */
switch (mmio) {
case MMIO_WRITE:
memcpy(&val, reg, size);
if (!mmio_write(size, ve->gpa, val))
return -EIO;
return insn.length;
x86/tdx: Handle in-kernel MMIO In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - TDX does not allow guests to execute from shared memory. All executed instructions are in TD-private memory. Being private to the TD, VMMs have no way to access TD-private memory and no way to read the instruction to decode and emulate it. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. This approach is bad for performance. But, it has (virtually) no impact on the size of the kernel image and will work for a wide variety of drivers. This allows TDX deployments to use arbitrary devices and device drivers, including virtio. TDX customers have asked for the capability to use random devices in their deployments. In other words, even if all of the work was done to paravirtualize all x86 MMIO users and virtio, this approach would still be needed. There is essentially no way to get rid of this code. This approach is functional for all in-kernel MMIO users current and future and does so with a minimal amount of code and kernel image bloat. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests *and* are performance critical to justify the effrort. Right now, that's limited only to virtio. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This approach will be adopted in the future, removing the bulk of MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 07:29:20 +08:00
case MMIO_WRITE_IMM:
val = insn.immediate.value;
if (!mmio_write(size, ve->gpa, val))
return -EIO;
return insn.length;
x86/tdx: Handle in-kernel MMIO In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - TDX does not allow guests to execute from shared memory. All executed instructions are in TD-private memory. Being private to the TD, VMMs have no way to access TD-private memory and no way to read the instruction to decode and emulate it. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. This approach is bad for performance. But, it has (virtually) no impact on the size of the kernel image and will work for a wide variety of drivers. This allows TDX deployments to use arbitrary devices and device drivers, including virtio. TDX customers have asked for the capability to use random devices in their deployments. In other words, even if all of the work was done to paravirtualize all x86 MMIO users and virtio, this approach would still be needed. There is essentially no way to get rid of this code. This approach is functional for all in-kernel MMIO users current and future and does so with a minimal amount of code and kernel image bloat. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests *and* are performance critical to justify the effrort. Right now, that's limited only to virtio. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This approach will be adopted in the future, removing the bulk of MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 07:29:20 +08:00
case MMIO_READ:
case MMIO_READ_ZERO_EXTEND:
case MMIO_READ_SIGN_EXTEND:
/* Reads are handled below */
break;
case MMIO_MOVS:
case MMIO_DECODE_FAILED:
/*
* MMIO was accessed with an instruction that could not be
* decoded or handled properly. It was likely not using io.h
* helpers or accessed MMIO accidentally.
*/
return -EINVAL;
x86/tdx: Handle in-kernel MMIO In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - TDX does not allow guests to execute from shared memory. All executed instructions are in TD-private memory. Being private to the TD, VMMs have no way to access TD-private memory and no way to read the instruction to decode and emulate it. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. This approach is bad for performance. But, it has (virtually) no impact on the size of the kernel image and will work for a wide variety of drivers. This allows TDX deployments to use arbitrary devices and device drivers, including virtio. TDX customers have asked for the capability to use random devices in their deployments. In other words, even if all of the work was done to paravirtualize all x86 MMIO users and virtio, this approach would still be needed. There is essentially no way to get rid of this code. This approach is functional for all in-kernel MMIO users current and future and does so with a minimal amount of code and kernel image bloat. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests *and* are performance critical to justify the effrort. Right now, that's limited only to virtio. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This approach will be adopted in the future, removing the bulk of MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 07:29:20 +08:00
default:
WARN_ONCE(1, "Unknown insn_decode_mmio() decode value?");
return -EINVAL;
x86/tdx: Handle in-kernel MMIO In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - TDX does not allow guests to execute from shared memory. All executed instructions are in TD-private memory. Being private to the TD, VMMs have no way to access TD-private memory and no way to read the instruction to decode and emulate it. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. This approach is bad for performance. But, it has (virtually) no impact on the size of the kernel image and will work for a wide variety of drivers. This allows TDX deployments to use arbitrary devices and device drivers, including virtio. TDX customers have asked for the capability to use random devices in their deployments. In other words, even if all of the work was done to paravirtualize all x86 MMIO users and virtio, this approach would still be needed. There is essentially no way to get rid of this code. This approach is functional for all in-kernel MMIO users current and future and does so with a minimal amount of code and kernel image bloat. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests *and* are performance critical to justify the effrort. Right now, that's limited only to virtio. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This approach will be adopted in the future, removing the bulk of MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 07:29:20 +08:00
}
/* Handle reads */
if (!mmio_read(size, ve->gpa, &val))
return -EIO;
x86/tdx: Handle in-kernel MMIO In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - TDX does not allow guests to execute from shared memory. All executed instructions are in TD-private memory. Being private to the TD, VMMs have no way to access TD-private memory and no way to read the instruction to decode and emulate it. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. This approach is bad for performance. But, it has (virtually) no impact on the size of the kernel image and will work for a wide variety of drivers. This allows TDX deployments to use arbitrary devices and device drivers, including virtio. TDX customers have asked for the capability to use random devices in their deployments. In other words, even if all of the work was done to paravirtualize all x86 MMIO users and virtio, this approach would still be needed. There is essentially no way to get rid of this code. This approach is functional for all in-kernel MMIO users current and future and does so with a minimal amount of code and kernel image bloat. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests *and* are performance critical to justify the effrort. Right now, that's limited only to virtio. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This approach will be adopted in the future, removing the bulk of MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 07:29:20 +08:00
switch (mmio) {
case MMIO_READ:
/* Zero-extend for 32-bit operation */
extend_size = size == 4 ? sizeof(*reg) : 0;
break;
case MMIO_READ_ZERO_EXTEND:
/* Zero extend based on operand size */
extend_size = insn.opnd_bytes;
break;
case MMIO_READ_SIGN_EXTEND:
/* Sign extend based on operand size */
extend_size = insn.opnd_bytes;
if (size == 1 && val & BIT(7))
extend_val = 0xFF;
else if (size > 1 && val & BIT(15))
extend_val = 0xFF;
break;
default:
/* All other cases has to be covered with the first switch() */
WARN_ON_ONCE(1);
return -EINVAL;
x86/tdx: Handle in-kernel MMIO In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - TDX does not allow guests to execute from shared memory. All executed instructions are in TD-private memory. Being private to the TD, VMMs have no way to access TD-private memory and no way to read the instruction to decode and emulate it. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. This approach is bad for performance. But, it has (virtually) no impact on the size of the kernel image and will work for a wide variety of drivers. This allows TDX deployments to use arbitrary devices and device drivers, including virtio. TDX customers have asked for the capability to use random devices in their deployments. In other words, even if all of the work was done to paravirtualize all x86 MMIO users and virtio, this approach would still be needed. There is essentially no way to get rid of this code. This approach is functional for all in-kernel MMIO users current and future and does so with a minimal amount of code and kernel image bloat. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests *and* are performance critical to justify the effrort. Right now, that's limited only to virtio. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This approach will be adopted in the future, removing the bulk of MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 07:29:20 +08:00
}
if (extend_size)
memset(reg, extend_val, extend_size);
memcpy(reg, &val, size);
return insn.length;
x86/tdx: Handle in-kernel MMIO In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - TDX does not allow guests to execute from shared memory. All executed instructions are in TD-private memory. Being private to the TD, VMMs have no way to access TD-private memory and no way to read the instruction to decode and emulate it. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. This approach is bad for performance. But, it has (virtually) no impact on the size of the kernel image and will work for a wide variety of drivers. This allows TDX deployments to use arbitrary devices and device drivers, including virtio. TDX customers have asked for the capability to use random devices in their deployments. In other words, even if all of the work was done to paravirtualize all x86 MMIO users and virtio, this approach would still be needed. There is essentially no way to get rid of this code. This approach is functional for all in-kernel MMIO users current and future and does so with a minimal amount of code and kernel image bloat. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests *and* are performance critical to justify the effrort. Right now, that's limited only to virtio. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This approach will be adopted in the future, removing the bulk of MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 07:29:20 +08:00
}
x86/tdx: Port I/O: Add runtime hypercalls TDX hypervisors cannot emulate instructions directly. This includes port I/O which is normally emulated in the hypervisor. All port I/O instructions inside TDX trigger the #VE exception in the guest and would be normally emulated there. Use a hypercall to emulate port I/O. Extend the tdx_handle_virt_exception() and add support to handle the #VE due to port I/O instructions. String I/O operations are not supported in TDX. Unroll them by declaring CC_ATTR_GUEST_UNROLL_STRING_IO confidential computing attribute. == Userspace Implications == The ioperm() facility allows userspace access to I/O instructions like inb/outb. Among other things, this allows writing userspace device drivers. This series has no special handling for ioperm(). Users will be able to successfully request I/O permissions but will induce a #VE on their first I/O instruction which leads SIGSEGV. If this is undesirable users can enable kernel lockdown feature with 'lockdown=integrity' kernel command line option. It makes ioperm() fail. More robust handling of this situation (denying ioperm() in all TDX guests) will be addressed in follow-on work. Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-18-kirill.shutemov@linux.intel.com
2022-04-06 07:29:26 +08:00
static bool handle_in(struct pt_regs *regs, int size, int port)
{
struct tdx_hypercall_args args = {
.r10 = TDX_HYPERCALL_STANDARD,
.r11 = hcall_func(EXIT_REASON_IO_INSTRUCTION),
.r12 = size,
.r13 = PORT_READ,
.r14 = port,
};
u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
bool success;
/*
* Emulate the I/O read via hypercall. More info about ABI can be found
* in TDX Guest-Host-Communication Interface (GHCI) section titled
* "TDG.VP.VMCALL<Instruction.IO>".
*/
success = !__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT);
/* Update part of the register affected by the emulated instruction */
regs->ax &= ~mask;
if (success)
regs->ax |= args.r11 & mask;
return success;
}
static bool handle_out(struct pt_regs *regs, int size, int port)
{
u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
/*
* Emulate the I/O write via hypercall. More info about ABI can be found
* in TDX Guest-Host-Communication Interface (GHCI) section titled
* "TDG.VP.VMCALL<Instruction.IO>".
*/
return !_tdx_hypercall(hcall_func(EXIT_REASON_IO_INSTRUCTION), size,
PORT_WRITE, port, regs->ax & mask);
}
/*
* Emulate I/O using hypercall.
*
* Assumes the IO instruction was using ax, which is enforced
* by the standard io.h macros.
*
* Return True on success or False on failure.
*/
static int handle_io(struct pt_regs *regs, struct ve_info *ve)
x86/tdx: Port I/O: Add runtime hypercalls TDX hypervisors cannot emulate instructions directly. This includes port I/O which is normally emulated in the hypervisor. All port I/O instructions inside TDX trigger the #VE exception in the guest and would be normally emulated there. Use a hypercall to emulate port I/O. Extend the tdx_handle_virt_exception() and add support to handle the #VE due to port I/O instructions. String I/O operations are not supported in TDX. Unroll them by declaring CC_ATTR_GUEST_UNROLL_STRING_IO confidential computing attribute. == Userspace Implications == The ioperm() facility allows userspace access to I/O instructions like inb/outb. Among other things, this allows writing userspace device drivers. This series has no special handling for ioperm(). Users will be able to successfully request I/O permissions but will induce a #VE on their first I/O instruction which leads SIGSEGV. If this is undesirable users can enable kernel lockdown feature with 'lockdown=integrity' kernel command line option. It makes ioperm() fail. More robust handling of this situation (denying ioperm() in all TDX guests) will be addressed in follow-on work. Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-18-kirill.shutemov@linux.intel.com
2022-04-06 07:29:26 +08:00
{
u32 exit_qual = ve->exit_qual;
x86/tdx: Port I/O: Add runtime hypercalls TDX hypervisors cannot emulate instructions directly. This includes port I/O which is normally emulated in the hypervisor. All port I/O instructions inside TDX trigger the #VE exception in the guest and would be normally emulated there. Use a hypercall to emulate port I/O. Extend the tdx_handle_virt_exception() and add support to handle the #VE due to port I/O instructions. String I/O operations are not supported in TDX. Unroll them by declaring CC_ATTR_GUEST_UNROLL_STRING_IO confidential computing attribute. == Userspace Implications == The ioperm() facility allows userspace access to I/O instructions like inb/outb. Among other things, this allows writing userspace device drivers. This series has no special handling for ioperm(). Users will be able to successfully request I/O permissions but will induce a #VE on their first I/O instruction which leads SIGSEGV. If this is undesirable users can enable kernel lockdown feature with 'lockdown=integrity' kernel command line option. It makes ioperm() fail. More robust handling of this situation (denying ioperm() in all TDX guests) will be addressed in follow-on work. Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-18-kirill.shutemov@linux.intel.com
2022-04-06 07:29:26 +08:00
int size, port;
bool in, ret;
x86/tdx: Port I/O: Add runtime hypercalls TDX hypervisors cannot emulate instructions directly. This includes port I/O which is normally emulated in the hypervisor. All port I/O instructions inside TDX trigger the #VE exception in the guest and would be normally emulated there. Use a hypercall to emulate port I/O. Extend the tdx_handle_virt_exception() and add support to handle the #VE due to port I/O instructions. String I/O operations are not supported in TDX. Unroll them by declaring CC_ATTR_GUEST_UNROLL_STRING_IO confidential computing attribute. == Userspace Implications == The ioperm() facility allows userspace access to I/O instructions like inb/outb. Among other things, this allows writing userspace device drivers. This series has no special handling for ioperm(). Users will be able to successfully request I/O permissions but will induce a #VE on their first I/O instruction which leads SIGSEGV. If this is undesirable users can enable kernel lockdown feature with 'lockdown=integrity' kernel command line option. It makes ioperm() fail. More robust handling of this situation (denying ioperm() in all TDX guests) will be addressed in follow-on work. Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-18-kirill.shutemov@linux.intel.com
2022-04-06 07:29:26 +08:00
if (VE_IS_IO_STRING(exit_qual))
return -EIO;
x86/tdx: Port I/O: Add runtime hypercalls TDX hypervisors cannot emulate instructions directly. This includes port I/O which is normally emulated in the hypervisor. All port I/O instructions inside TDX trigger the #VE exception in the guest and would be normally emulated there. Use a hypercall to emulate port I/O. Extend the tdx_handle_virt_exception() and add support to handle the #VE due to port I/O instructions. String I/O operations are not supported in TDX. Unroll them by declaring CC_ATTR_GUEST_UNROLL_STRING_IO confidential computing attribute. == Userspace Implications == The ioperm() facility allows userspace access to I/O instructions like inb/outb. Among other things, this allows writing userspace device drivers. This series has no special handling for ioperm(). Users will be able to successfully request I/O permissions but will induce a #VE on their first I/O instruction which leads SIGSEGV. If this is undesirable users can enable kernel lockdown feature with 'lockdown=integrity' kernel command line option. It makes ioperm() fail. More robust handling of this situation (denying ioperm() in all TDX guests) will be addressed in follow-on work. Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-18-kirill.shutemov@linux.intel.com
2022-04-06 07:29:26 +08:00
in = VE_IS_IO_IN(exit_qual);
size = VE_GET_IO_SIZE(exit_qual);
port = VE_GET_PORT_NUM(exit_qual);
if (in)
ret = handle_in(regs, size, port);
x86/tdx: Port I/O: Add runtime hypercalls TDX hypervisors cannot emulate instructions directly. This includes port I/O which is normally emulated in the hypervisor. All port I/O instructions inside TDX trigger the #VE exception in the guest and would be normally emulated there. Use a hypercall to emulate port I/O. Extend the tdx_handle_virt_exception() and add support to handle the #VE due to port I/O instructions. String I/O operations are not supported in TDX. Unroll them by declaring CC_ATTR_GUEST_UNROLL_STRING_IO confidential computing attribute. == Userspace Implications == The ioperm() facility allows userspace access to I/O instructions like inb/outb. Among other things, this allows writing userspace device drivers. This series has no special handling for ioperm(). Users will be able to successfully request I/O permissions but will induce a #VE on their first I/O instruction which leads SIGSEGV. If this is undesirable users can enable kernel lockdown feature with 'lockdown=integrity' kernel command line option. It makes ioperm() fail. More robust handling of this situation (denying ioperm() in all TDX guests) will be addressed in follow-on work. Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-18-kirill.shutemov@linux.intel.com
2022-04-06 07:29:26 +08:00
else
ret = handle_out(regs, size, port);
if (!ret)
return -EIO;
return ve_instr_len(ve);
x86/tdx: Port I/O: Add runtime hypercalls TDX hypervisors cannot emulate instructions directly. This includes port I/O which is normally emulated in the hypervisor. All port I/O instructions inside TDX trigger the #VE exception in the guest and would be normally emulated there. Use a hypercall to emulate port I/O. Extend the tdx_handle_virt_exception() and add support to handle the #VE due to port I/O instructions. String I/O operations are not supported in TDX. Unroll them by declaring CC_ATTR_GUEST_UNROLL_STRING_IO confidential computing attribute. == Userspace Implications == The ioperm() facility allows userspace access to I/O instructions like inb/outb. Among other things, this allows writing userspace device drivers. This series has no special handling for ioperm(). Users will be able to successfully request I/O permissions but will induce a #VE on their first I/O instruction which leads SIGSEGV. If this is undesirable users can enable kernel lockdown feature with 'lockdown=integrity' kernel command line option. It makes ioperm() fail. More robust handling of this situation (denying ioperm() in all TDX guests) will be addressed in follow-on work. Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-18-kirill.shutemov@linux.intel.com
2022-04-06 07:29:26 +08:00
}
/*
* Early #VE exception handler. Only handles a subset of port I/O.
* Intended only for earlyprintk. If failed, return false.
*/
__init bool tdx_early_handle_ve(struct pt_regs *regs)
{
struct ve_info ve;
int insn_len;
tdx_get_ve_info(&ve);
if (ve.exit_reason != EXIT_REASON_IO_INSTRUCTION)
return false;
insn_len = handle_io(regs, &ve);
if (insn_len < 0)
return false;
regs->ip += insn_len;
return true;
}
x86/traps: Add #VE support for TDX guest Virtualization Exceptions (#VE) are delivered to TDX guests due to specific guest actions which may happen in either user space or the kernel: * Specific instructions (WBINVD, for example) * Specific MSR accesses * Specific CPUID leaf accesses * Access to specific guest physical addresses Syscall entry code has a critical window where the kernel stack is not yet set up. Any exception in this window leads to hard to debug issues and can be exploited for privilege escalation. Exceptions in the NMI entry code also cause issues. Returning from the exception handler with IRET will re-enable NMIs and nested NMI will corrupt the NMI stack. For these reasons, the kernel avoids #VEs during the syscall gap and the NMI entry code. Entry code paths do not access TD-shared memory, MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves that might generate #VE. VMM can remove memory from TD at any point, but access to unaccepted (or missing) private memory leads to VM termination, not to #VE. Similarly to page faults and breakpoints, #VEs are allowed in NMI handlers once the kernel is ready to deal with nested NMIs. During #VE delivery, all interrupts, including NMIs, are blocked until TDGETVEINFO is called. It prevents #VE nesting until the kernel reads the VE info. TDGETVEINFO retrieves the #VE info from the TDX module, which also clears the "#VE valid" flag. This must be done before anything else as any #VE that occurs while the valid flag is set escalates to #DF by TDX module. It will result in an oops. Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be delivered until TDGETVEINFO is called. For now, convert unhandled #VE's (everything, until later in this series) so that they appear just like a #GP by calling the ve_raise_fault() directly. The ve_raise_fault() function is similar to #GP handler and is responsible for sending SIGSEGV to userspace and CPU die and notifying debuggers and other die chain users. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220405232939.73860-8-kirill.shutemov@linux.intel.com
2022-04-06 07:29:16 +08:00
void tdx_get_ve_info(struct ve_info *ve)
{
struct tdx_module_output out;
/*
* Called during #VE handling to retrieve the #VE info from the
* TDX module.
*
* This has to be called early in #VE handling. A "nested" #VE which
* occurs before this will raise a #DF and is not recoverable.
*
* The call retrieves the #VE info from the TDX module, which also
* clears the "#VE valid" flag. This must be done before anything else
* because any #VE that occurs while the valid flag is set will lead to
* #DF.
*
* Note, the TDX module treats virtual NMIs as inhibited if the #VE
* valid flag is set. It means that NMI=>#VE will not result in a #DF.
*/
tdx_module_call(TDX_GET_VEINFO, 0, 0, 0, 0, &out);
/* Transfer the output parameters */
ve->exit_reason = out.rcx;
ve->exit_qual = out.rdx;
ve->gla = out.r8;
ve->gpa = out.r9;
ve->instr_len = lower_32_bits(out.r10);
ve->instr_info = upper_32_bits(out.r10);
}
/*
* Handle the user initiated #VE.
*
* On success, returns the number of bytes RIP should be incremented (>=0)
* or -errno on error.
*/
static int virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
{
switch (ve->exit_reason) {
case EXIT_REASON_CPUID:
return handle_cpuid(regs, ve);
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
return -EIO;
}
}
/*
* Handle the kernel #VE.
*
* On success, returns the number of bytes RIP should be incremented (>=0)
* or -errno on error.
*/
static int virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
x86/tdx: Add HLT support for TDX guests The HLT instruction is a privileged instruction, executing it stops instruction execution and places the processor in a HALT state. It is used in kernel for cases like reboot, idle loop and exception fixup handlers. For the idle case, interrupts will be enabled (using STI) before the HLT instruction (this is also called safe_halt()). To support the HLT instruction in TDX guests, it needs to be emulated using TDVMCALL (hypercall to VMM). More details about it can be found in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication Interface (GHCI) specification, section TDVMCALL[Instruction.HLT]. In TDX guests, executing HLT instruction will generate a #VE, which is used to emulate the HLT instruction. But #VE based emulation will not work for the safe_halt() flavor, because it requires STI instruction to be executed just before the TDCALL. Since idle loop is the only user of safe_halt() variant, handle it as a special case. To avoid *safe_halt() call in the idle function, define the tdx_guest_idle() and use it to override the "x86_idle" function pointer for a valid TDX guest. Alternative choices like PV ops have been considered for adding safe_halt() support. But it was rejected because HLT paravirt calls only exist under PARAVIRT_XXL, and enabling it in TDX guest just for safe_halt() use case is not worth the cost. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 07:29:17 +08:00
{
switch (ve->exit_reason) {
case EXIT_REASON_HLT:
return handle_halt(ve);
x86/tdx: Add MSR support for TDX guests Use hypercall to emulate MSR read/write for the TDX platform. There are two viable approaches for doing MSRs in a TD guest: 1. Execute the RDMSR/WRMSR instructions like most VMs and bare metal do. Some will succeed, others will cause a #VE. All of those that cause a #VE will be handled with a TDCALL. 2. Use paravirt infrastructure. The paravirt hook has to keep a list of which MSRs would cause a #VE and use a TDCALL. All other MSRs execute RDMSR/WRMSR instructions directly. The second option can be ruled out because the list of MSRs was challenging to maintain. That leaves option #1 as the only viable solution for the minimal TDX support. Kernel relies on the exception fixup machinery to handle MSR access errors. #VE handler uses the same exception fixup code as #GP. It covers MSR accesses along with other types of fixups. For performance-critical MSR writes (like TSC_DEADLINE), future patches will replace the WRMSR/#VE sequence with the direct TDCALL. RDMSR and WRMSR specification details can be found in Guest-Host-Communication Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX) specification, sec titled "TDG.VP. VMCALL<Instruction.RDMSR>" and "TDG.VP.VMCALL<Instruction.WRMSR>". Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-10-kirill.shutemov@linux.intel.com
2022-04-06 07:29:18 +08:00
case EXIT_REASON_MSR_READ:
return read_msr(regs, ve);
x86/tdx: Add MSR support for TDX guests Use hypercall to emulate MSR read/write for the TDX platform. There are two viable approaches for doing MSRs in a TD guest: 1. Execute the RDMSR/WRMSR instructions like most VMs and bare metal do. Some will succeed, others will cause a #VE. All of those that cause a #VE will be handled with a TDCALL. 2. Use paravirt infrastructure. The paravirt hook has to keep a list of which MSRs would cause a #VE and use a TDCALL. All other MSRs execute RDMSR/WRMSR instructions directly. The second option can be ruled out because the list of MSRs was challenging to maintain. That leaves option #1 as the only viable solution for the minimal TDX support. Kernel relies on the exception fixup machinery to handle MSR access errors. #VE handler uses the same exception fixup code as #GP. It covers MSR accesses along with other types of fixups. For performance-critical MSR writes (like TSC_DEADLINE), future patches will replace the WRMSR/#VE sequence with the direct TDCALL. RDMSR and WRMSR specification details can be found in Guest-Host-Communication Interface (GHCI) for Intel Trust Domain Extensions (Intel TDX) specification, sec titled "TDG.VP. VMCALL<Instruction.RDMSR>" and "TDG.VP.VMCALL<Instruction.WRMSR>". Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-10-kirill.shutemov@linux.intel.com
2022-04-06 07:29:18 +08:00
case EXIT_REASON_MSR_WRITE:
return write_msr(regs, ve);
case EXIT_REASON_CPUID:
return handle_cpuid(regs, ve);
x86/tdx: Handle in-kernel MMIO In non-TDX VMs, MMIO is implemented by providing the guest a mapping which will cause a VMEXIT on access and then the VMM emulating the instruction that caused the VMEXIT. That's not possible for TDX VM. To emulate an instruction an emulator needs two things: - R/W access to the register file to read/modify instruction arguments and see RIP of the faulted instruction. - Read access to memory where instruction is placed to see what to emulate. In this case it is guest kernel text. Both of them are not available to VMM in TDX environment: - Register file is never exposed to VMM. When a TD exits to the module, it saves registers into the state-save area allocated for that TD. The module then scrubs these registers before returning execution control to the VMM, to help prevent leakage of TD state. - TDX does not allow guests to execute from shared memory. All executed instructions are in TD-private memory. Being private to the TD, VMMs have no way to access TD-private memory and no way to read the instruction to decode and emulate it. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. Add #VE handling that emulates the MMIO instruction inside the guest and converts it into a controlled hypercall to the host. This approach is bad for performance. But, it has (virtually) no impact on the size of the kernel image and will work for a wide variety of drivers. This allows TDX deployments to use arbitrary devices and device drivers, including virtio. TDX customers have asked for the capability to use random devices in their deployments. In other words, even if all of the work was done to paravirtualize all x86 MMIO users and virtio, this approach would still be needed. There is essentially no way to get rid of this code. This approach is functional for all in-kernel MMIO users current and future and does so with a minimal amount of code and kernel image bloat. MMIO addresses can be used with any CPU instruction that accesses memory. Address only MMIO accesses done via io.h helpers, such as 'readl()' or 'writeq()'. Any CPU instruction that accesses memory can also be used to access MMIO. However, by convention, MMIO access are typically performed via io.h helpers such as 'readl()' or 'writeq()'. The io.h helpers intentionally use a limited set of instructions when accessing MMIO. This known, limited set of instructions makes MMIO instruction decoding and emulation feasible in KVM hosts and SEV guests today. MMIO accesses performed without the io.h helpers are at the mercy of the compiler. Compilers can and will generate a much more broad set of instructions which can not practically be decoded and emulated. TDX guests will oops if they encounter one of these decoding failures. This means that TDX guests *must* use the io.h helpers to access MMIO. This requirement is not new. Both KVM hosts and AMD SEV guests have the same limitations on MMIO access. === Potential alternative approaches === == Paravirtualizing all MMIO == An alternative to letting MMIO induce a #VE exception is to avoid the #VE in the first place. Similar to the port I/O case, it is theoretically possible to paravirtualize MMIO accesses. Like the exception-based approach offered here, a fully paravirtualized approach would be limited to MMIO users that leverage common infrastructure like the io.h macros. However, any paravirtual approach would be patching approximately 120k call sites. Any paravirtual approach would need to replace a bare memory access instruction with (at least) a function call. With a conservative overhead estimation of 5 bytes per call site (CALL instruction), it leads to bloating code by 600k. Many drivers will never be used in the TDX environment and the bloat cannot be justified. == Patching TDX drivers == Rather than touching the entire kernel, it might also be possible to just go after drivers that use MMIO in TDX guests *and* are performance critical to justify the effrort. Right now, that's limited only to virtio. All virtio MMIO appears to be done through a single function, which makes virtio eminently easy to patch. This approach will be adopted in the future, removing the bulk of MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 07:29:20 +08:00
case EXIT_REASON_EPT_VIOLATION:
return handle_mmio(regs, ve);
x86/tdx: Port I/O: Add runtime hypercalls TDX hypervisors cannot emulate instructions directly. This includes port I/O which is normally emulated in the hypervisor. All port I/O instructions inside TDX trigger the #VE exception in the guest and would be normally emulated there. Use a hypercall to emulate port I/O. Extend the tdx_handle_virt_exception() and add support to handle the #VE due to port I/O instructions. String I/O operations are not supported in TDX. Unroll them by declaring CC_ATTR_GUEST_UNROLL_STRING_IO confidential computing attribute. == Userspace Implications == The ioperm() facility allows userspace access to I/O instructions like inb/outb. Among other things, this allows writing userspace device drivers. This series has no special handling for ioperm(). Users will be able to successfully request I/O permissions but will induce a #VE on their first I/O instruction which leads SIGSEGV. If this is undesirable users can enable kernel lockdown feature with 'lockdown=integrity' kernel command line option. It makes ioperm() fail. More robust handling of this situation (denying ioperm() in all TDX guests) will be addressed in follow-on work. Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-18-kirill.shutemov@linux.intel.com
2022-04-06 07:29:26 +08:00
case EXIT_REASON_IO_INSTRUCTION:
return handle_io(regs, ve);
x86/tdx: Add HLT support for TDX guests The HLT instruction is a privileged instruction, executing it stops instruction execution and places the processor in a HALT state. It is used in kernel for cases like reboot, idle loop and exception fixup handlers. For the idle case, interrupts will be enabled (using STI) before the HLT instruction (this is also called safe_halt()). To support the HLT instruction in TDX guests, it needs to be emulated using TDVMCALL (hypercall to VMM). More details about it can be found in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication Interface (GHCI) specification, section TDVMCALL[Instruction.HLT]. In TDX guests, executing HLT instruction will generate a #VE, which is used to emulate the HLT instruction. But #VE based emulation will not work for the safe_halt() flavor, because it requires STI instruction to be executed just before the TDCALL. Since idle loop is the only user of safe_halt() variant, handle it as a special case. To avoid *safe_halt() call in the idle function, define the tdx_guest_idle() and use it to override the "x86_idle" function pointer for a valid TDX guest. Alternative choices like PV ops have been considered for adding safe_halt() support. But it was rejected because HLT paravirt calls only exist under PARAVIRT_XXL, and enabling it in TDX guest just for safe_halt() use case is not worth the cost. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 07:29:17 +08:00
default:
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
return -EIO;
x86/tdx: Add HLT support for TDX guests The HLT instruction is a privileged instruction, executing it stops instruction execution and places the processor in a HALT state. It is used in kernel for cases like reboot, idle loop and exception fixup handlers. For the idle case, interrupts will be enabled (using STI) before the HLT instruction (this is also called safe_halt()). To support the HLT instruction in TDX guests, it needs to be emulated using TDVMCALL (hypercall to VMM). More details about it can be found in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication Interface (GHCI) specification, section TDVMCALL[Instruction.HLT]. In TDX guests, executing HLT instruction will generate a #VE, which is used to emulate the HLT instruction. But #VE based emulation will not work for the safe_halt() flavor, because it requires STI instruction to be executed just before the TDCALL. Since idle loop is the only user of safe_halt() variant, handle it as a special case. To avoid *safe_halt() call in the idle function, define the tdx_guest_idle() and use it to override the "x86_idle" function pointer for a valid TDX guest. Alternative choices like PV ops have been considered for adding safe_halt() support. But it was rejected because HLT paravirt calls only exist under PARAVIRT_XXL, and enabling it in TDX guest just for safe_halt() use case is not worth the cost. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 07:29:17 +08:00
}
}
x86/traps: Add #VE support for TDX guest Virtualization Exceptions (#VE) are delivered to TDX guests due to specific guest actions which may happen in either user space or the kernel: * Specific instructions (WBINVD, for example) * Specific MSR accesses * Specific CPUID leaf accesses * Access to specific guest physical addresses Syscall entry code has a critical window where the kernel stack is not yet set up. Any exception in this window leads to hard to debug issues and can be exploited for privilege escalation. Exceptions in the NMI entry code also cause issues. Returning from the exception handler with IRET will re-enable NMIs and nested NMI will corrupt the NMI stack. For these reasons, the kernel avoids #VEs during the syscall gap and the NMI entry code. Entry code paths do not access TD-shared memory, MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves that might generate #VE. VMM can remove memory from TD at any point, but access to unaccepted (or missing) private memory leads to VM termination, not to #VE. Similarly to page faults and breakpoints, #VEs are allowed in NMI handlers once the kernel is ready to deal with nested NMIs. During #VE delivery, all interrupts, including NMIs, are blocked until TDGETVEINFO is called. It prevents #VE nesting until the kernel reads the VE info. TDGETVEINFO retrieves the #VE info from the TDX module, which also clears the "#VE valid" flag. This must be done before anything else as any #VE that occurs while the valid flag is set escalates to #DF by TDX module. It will result in an oops. Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be delivered until TDGETVEINFO is called. For now, convert unhandled #VE's (everything, until later in this series) so that they appear just like a #GP by calling the ve_raise_fault() directly. The ve_raise_fault() function is similar to #GP handler and is responsible for sending SIGSEGV to userspace and CPU die and notifying debuggers and other die chain users. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220405232939.73860-8-kirill.shutemov@linux.intel.com
2022-04-06 07:29:16 +08:00
bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
{
int insn_len;
x86/tdx: Add HLT support for TDX guests The HLT instruction is a privileged instruction, executing it stops instruction execution and places the processor in a HALT state. It is used in kernel for cases like reboot, idle loop and exception fixup handlers. For the idle case, interrupts will be enabled (using STI) before the HLT instruction (this is also called safe_halt()). To support the HLT instruction in TDX guests, it needs to be emulated using TDVMCALL (hypercall to VMM). More details about it can be found in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication Interface (GHCI) specification, section TDVMCALL[Instruction.HLT]. In TDX guests, executing HLT instruction will generate a #VE, which is used to emulate the HLT instruction. But #VE based emulation will not work for the safe_halt() flavor, because it requires STI instruction to be executed just before the TDCALL. Since idle loop is the only user of safe_halt() variant, handle it as a special case. To avoid *safe_halt() call in the idle function, define the tdx_guest_idle() and use it to override the "x86_idle" function pointer for a valid TDX guest. Alternative choices like PV ops have been considered for adding safe_halt() support. But it was rejected because HLT paravirt calls only exist under PARAVIRT_XXL, and enabling it in TDX guest just for safe_halt() use case is not worth the cost. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 07:29:17 +08:00
if (user_mode(regs))
insn_len = virt_exception_user(regs, ve);
x86/tdx: Add HLT support for TDX guests The HLT instruction is a privileged instruction, executing it stops instruction execution and places the processor in a HALT state. It is used in kernel for cases like reboot, idle loop and exception fixup handlers. For the idle case, interrupts will be enabled (using STI) before the HLT instruction (this is also called safe_halt()). To support the HLT instruction in TDX guests, it needs to be emulated using TDVMCALL (hypercall to VMM). More details about it can be found in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication Interface (GHCI) specification, section TDVMCALL[Instruction.HLT]. In TDX guests, executing HLT instruction will generate a #VE, which is used to emulate the HLT instruction. But #VE based emulation will not work for the safe_halt() flavor, because it requires STI instruction to be executed just before the TDCALL. Since idle loop is the only user of safe_halt() variant, handle it as a special case. To avoid *safe_halt() call in the idle function, define the tdx_guest_idle() and use it to override the "x86_idle" function pointer for a valid TDX guest. Alternative choices like PV ops have been considered for adding safe_halt() support. But it was rejected because HLT paravirt calls only exist under PARAVIRT_XXL, and enabling it in TDX guest just for safe_halt() use case is not worth the cost. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 07:29:17 +08:00
else
insn_len = virt_exception_kernel(regs, ve);
if (insn_len < 0)
return false;
x86/tdx: Add HLT support for TDX guests The HLT instruction is a privileged instruction, executing it stops instruction execution and places the processor in a HALT state. It is used in kernel for cases like reboot, idle loop and exception fixup handlers. For the idle case, interrupts will be enabled (using STI) before the HLT instruction (this is also called safe_halt()). To support the HLT instruction in TDX guests, it needs to be emulated using TDVMCALL (hypercall to VMM). More details about it can be found in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication Interface (GHCI) specification, section TDVMCALL[Instruction.HLT]. In TDX guests, executing HLT instruction will generate a #VE, which is used to emulate the HLT instruction. But #VE based emulation will not work for the safe_halt() flavor, because it requires STI instruction to be executed just before the TDCALL. Since idle loop is the only user of safe_halt() variant, handle it as a special case. To avoid *safe_halt() call in the idle function, define the tdx_guest_idle() and use it to override the "x86_idle" function pointer for a valid TDX guest. Alternative choices like PV ops have been considered for adding safe_halt() support. But it was rejected because HLT paravirt calls only exist under PARAVIRT_XXL, and enabling it in TDX guest just for safe_halt() use case is not worth the cost. Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 07:29:17 +08:00
/* After successful #VE handling, move the IP */
regs->ip += insn_len;
x86/traps: Add #VE support for TDX guest Virtualization Exceptions (#VE) are delivered to TDX guests due to specific guest actions which may happen in either user space or the kernel: * Specific instructions (WBINVD, for example) * Specific MSR accesses * Specific CPUID leaf accesses * Access to specific guest physical addresses Syscall entry code has a critical window where the kernel stack is not yet set up. Any exception in this window leads to hard to debug issues and can be exploited for privilege escalation. Exceptions in the NMI entry code also cause issues. Returning from the exception handler with IRET will re-enable NMIs and nested NMI will corrupt the NMI stack. For these reasons, the kernel avoids #VEs during the syscall gap and the NMI entry code. Entry code paths do not access TD-shared memory, MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves that might generate #VE. VMM can remove memory from TD at any point, but access to unaccepted (or missing) private memory leads to VM termination, not to #VE. Similarly to page faults and breakpoints, #VEs are allowed in NMI handlers once the kernel is ready to deal with nested NMIs. During #VE delivery, all interrupts, including NMIs, are blocked until TDGETVEINFO is called. It prevents #VE nesting until the kernel reads the VE info. TDGETVEINFO retrieves the #VE info from the TDX module, which also clears the "#VE valid" flag. This must be done before anything else as any #VE that occurs while the valid flag is set escalates to #DF by TDX module. It will result in an oops. Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be delivered until TDGETVEINFO is called. For now, convert unhandled #VE's (everything, until later in this series) so that they appear just like a #GP by calling the ve_raise_fault() directly. The ve_raise_fault() function is similar to #GP handler and is responsible for sending SIGSEGV to userspace and CPU die and notifying debuggers and other die chain users. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220405232939.73860-8-kirill.shutemov@linux.intel.com
2022-04-06 07:29:16 +08:00
return true;
x86/traps: Add #VE support for TDX guest Virtualization Exceptions (#VE) are delivered to TDX guests due to specific guest actions which may happen in either user space or the kernel: * Specific instructions (WBINVD, for example) * Specific MSR accesses * Specific CPUID leaf accesses * Access to specific guest physical addresses Syscall entry code has a critical window where the kernel stack is not yet set up. Any exception in this window leads to hard to debug issues and can be exploited for privilege escalation. Exceptions in the NMI entry code also cause issues. Returning from the exception handler with IRET will re-enable NMIs and nested NMI will corrupt the NMI stack. For these reasons, the kernel avoids #VEs during the syscall gap and the NMI entry code. Entry code paths do not access TD-shared memory, MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves that might generate #VE. VMM can remove memory from TD at any point, but access to unaccepted (or missing) private memory leads to VM termination, not to #VE. Similarly to page faults and breakpoints, #VEs are allowed in NMI handlers once the kernel is ready to deal with nested NMIs. During #VE delivery, all interrupts, including NMIs, are blocked until TDGETVEINFO is called. It prevents #VE nesting until the kernel reads the VE info. TDGETVEINFO retrieves the #VE info from the TDX module, which also clears the "#VE valid" flag. This must be done before anything else as any #VE that occurs while the valid flag is set escalates to #DF by TDX module. It will result in an oops. Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be delivered until TDGETVEINFO is called. For now, convert unhandled #VE's (everything, until later in this series) so that they appear just like a #GP by calling the ve_raise_fault() directly. The ve_raise_fault() function is similar to #GP handler and is responsible for sending SIGSEGV to userspace and CPU die and notifying debuggers and other die chain users. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220405232939.73860-8-kirill.shutemov@linux.intel.com
2022-04-06 07:29:16 +08:00
}
static bool tdx_tlb_flush_required(bool private)
{
/*
* TDX guest is responsible for flushing TLB on private->shared
* transition. VMM is responsible for flushing on shared->private.
*
* The VMM _can't_ flush private addresses as it can't generate PAs
* with the guest's HKID. Shared memory isn't subject to integrity
* checking, i.e. the VMM doesn't need to flush for its own protection.
*
* There's no need to flush when converting from shared to private,
* as flushing is the VMM's responsibility in this case, e.g. it must
* flush to avoid integrity failures in the face of a buggy or
* malicious guest.
*/
return !private;
}
static bool tdx_cache_flush_required(void)
{
/*
* AMD SME/SEV can avoid cache flushing if HW enforces cache coherence.
* TDX doesn't have such capability.
*
* Flush cache unconditionally.
*/
return true;
}
static bool try_accept_one(phys_addr_t *start, unsigned long len,
enum pg_level pg_level)
{
unsigned long accept_size = page_level_size(pg_level);
u64 tdcall_rcx;
u8 page_size;
if (!IS_ALIGNED(*start, accept_size))
return false;
if (len < accept_size)
return false;
/*
* Pass the page physical address to the TDX module to accept the
* pending, private page.
*
* Bits 2:0 of RCX encode page size: 0 - 4K, 1 - 2M, 2 - 1G.
*/
switch (pg_level) {
case PG_LEVEL_4K:
page_size = 0;
break;
case PG_LEVEL_2M:
page_size = 1;
break;
case PG_LEVEL_1G:
page_size = 2;
break;
default:
return false;
}
tdcall_rcx = *start | page_size;
if (__tdx_module_call(TDX_ACCEPT_PAGE, tdcall_rcx, 0, 0, 0, NULL))
return false;
*start += accept_size;
return true;
}
/*
* Inform the VMM of the guest's intent for this physical page: shared with
* the VMM or private to the guest. The VMM is expected to change its mapping
* of the page in response.
*/
static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
{
phys_addr_t start = __pa(vaddr);
phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
if (!enc) {
/* Set the shared (decrypted) bits: */
start |= cc_mkdec(0);
end |= cc_mkdec(0);
}
/*
* Notify the VMM about page mapping conversion. More info about ABI
* can be found in TDX Guest-Host-Communication Interface (GHCI),
* section "TDG.VP.VMCALL<MapGPA>"
*/
if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
return false;
/* private->shared conversion requires only MapGPA call */
if (!enc)
return true;
/*
* For shared->private conversion, accept the page using
* TDX_ACCEPT_PAGE TDX module call.
*/
while (start < end) {
unsigned long len = end - start;
/*
* Try larger accepts first. It gives chance to VMM to keep
* 1G/2M SEPT entries where possible and speeds up process by
* cutting number of hypercalls (if successful).
*/
if (try_accept_one(&start, len, PG_LEVEL_1G))
continue;
if (try_accept_one(&start, len, PG_LEVEL_2M))
continue;
if (!try_accept_one(&start, len, PG_LEVEL_4K))
return false;
}
return true;
}
void __init tdx_early_init(void)
{
u64 cc_mask;
u32 eax, sig[3];
cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]);
if (memcmp(TDX_IDENT, sig, sizeof(sig)))
return;
setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
cc_set_vendor(CC_VENDOR_INTEL);
tdx_parse_tdinfo(&cc_mask);
cc_set_mask(cc_mask);
/*
* All bits above GPA width are reserved and kernel treats shared bit
* as flag, not as part of physical address.
*
* Adjust physical mask to only cover valid GPA bits.
*/
physical_mask &= cc_mask - 1;
x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required;
x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;
x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;
pr_info("Guest detected\n");
}