linux-sg2042

Commit Graph

Author	SHA1	Message	Date
Sean Christopherson	8df9f1af2e	KVM: x86/mmu: Skip !MMU-present SPTEs when removing SP in exclusive mode If mmu_lock is held for write, don't bother setting !PRESENT SPTEs to REMOVED_SPTE when recursively zapping SPTEs as part of shadow page removal. The concurrent write protections provided by REMOVED_SPTE are not needed, there are no backing page side effects to record, and MMIO SPTEs can be left as is since they are protected by the memslot generation, not by ensuring that the MMIO SPTE is unreachable (which is racy with respect to lockless walks regardless of zapping behavior). Skipping !PRESENT drastically reduces the number of updates needed to tear down sparsely populated MMUs, e.g. when tearing down a 6gb VM that didn't touch much memory, 6929/7168 (~96.6%) of SPTEs were '0' and could be skipped. Avoiding the write itself is likely close to a wash, but avoiding __handle_changed_spte() is a clear-cut win as that involves saving and restoring all non-volatile GPRs (it's a subtly big function), as well as several conditional branches before bailing out. Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210310003029.1250571-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-12 13:18:52 -05:00
Wanpeng Li	d7eb79c629	KVM: kvmclock: Fix vCPUs > 64 can't be online/hotpluged # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 88 On-line CPU(s) list: 0-63 Off-line CPU(s) list: 64-87 # cat /proc/cmdline BOOT_IMAGE=/vmlinuz-5.10.0-rc3-tlinux2-0050+ root=/dev/mapper/cl-root ro rd.lvm.lv=cl/root rhgb quiet console=ttyS0 LANG=en_US .UTF-8 no-kvmclock-vsyscall # echo 1 > /sys/devices/system/cpu/cpu76/online -bash: echo: write error: Cannot allocate memory The per-cpu vsyscall pvclock data pointer assigns either an element of the static array hv_clock_boot (#vCPU <= 64) or dynamically allocated memory hvclock_mem (vCPU > 64), the dynamically memory will not be allocated if kvmclock vsyscall is disabled, this can result in cpu hotpluged fails in kvmclock_setup_percpu() which returns -ENOMEM. It's broken for no-vsyscall and sometimes you end up with vsyscall disabled if the host does something strange. This patch fixes it by allocating this dynamically memory unconditionally even if vsyscall is disabled. Fixes: `6a1cac56f4` ("x86/kvm: Use __bss_decrypted attribute in shared variables") Reported-by: Zelin Deng <zelin.deng@linux.alibaba.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: stable@vger.kernel.org#v4.19-rc5+ Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1614130683-24137-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-12 13:18:16 -05:00
Muhammad Usama Anjum	6fcd9cbc6a	kvm: x86: annotate RCU pointers This patch adds the annotation to fix the following sparse errors: arch/x86/kvm//x86.c:8147:15: error: incompatible types in comparison expression (different address spaces): arch/x86/kvm//x86.c:8147:15: struct kvm_apic_map [noderef] __rcu * arch/x86/kvm//x86.c:8147:15: struct kvm_apic_map * arch/x86/kvm//x86.c:10628:16: error: incompatible types in comparison expression (different address spaces): arch/x86/kvm//x86.c:10628:16: struct kvm_apic_map [noderef] __rcu * arch/x86/kvm//x86.c:10628:16: struct kvm_apic_map * arch/x86/kvm//x86.c:10629:15: error: incompatible types in comparison expression (different address spaces): arch/x86/kvm//x86.c:10629:15: struct kvm_pmu_event_filter [noderef] __rcu * arch/x86/kvm//x86.c:10629:15: struct kvm_pmu_event_filter * arch/x86/kvm//lapic.c:267:15: error: incompatible types in comparison expression (different address spaces): arch/x86/kvm//lapic.c:267:15: struct kvm_apic_map [noderef] __rcu * arch/x86/kvm//lapic.c:267:15: struct kvm_apic_map * arch/x86/kvm//lapic.c:269:9: error: incompatible types in comparison expression (different address spaces): arch/x86/kvm//lapic.c:269:9: struct kvm_apic_map [noderef] __rcu * arch/x86/kvm//lapic.c:269:9: struct kvm_apic_map * arch/x86/kvm//lapic.c:637:15: error: incompatible types in comparison expression (different address spaces): arch/x86/kvm//lapic.c:637:15: struct kvm_apic_map [noderef] __rcu * arch/x86/kvm//lapic.c:637:15: struct kvm_apic_map * arch/x86/kvm//lapic.c:994:15: error: incompatible types in comparison expression (different address spaces): arch/x86/kvm//lapic.c:994:15: struct kvm_apic_map [noderef] __rcu * arch/x86/kvm//lapic.c:994:15: struct kvm_apic_map * arch/x86/kvm//lapic.c:1036:15: error: incompatible types in comparison expression (different address spaces): arch/x86/kvm//lapic.c:1036:15: struct kvm_apic_map [noderef] __rcu * arch/x86/kvm//lapic.c:1036:15: struct kvm_apic_map * arch/x86/kvm//lapic.c:1173:15: error: incompatible types in comparison expression (different address spaces): arch/x86/kvm//lapic.c:1173:15: struct kvm_apic_map [noderef] __rcu * arch/x86/kvm//lapic.c:1173:15: struct kvm_apic_map * arch/x86/kvm//pmu.c:190:18: error: incompatible types in comparison expression (different address spaces): arch/x86/kvm//pmu.c:190:18: struct kvm_pmu_event_filter [noderef] __rcu * arch/x86/kvm//pmu.c:190:18: struct kvm_pmu_event_filter * arch/x86/kvm//pmu.c:251:18: error: incompatible types in comparison expression (different address spaces): arch/x86/kvm//pmu.c:251:18: struct kvm_pmu_event_filter [noderef] __rcu * arch/x86/kvm//pmu.c:251:18: struct kvm_pmu_event_filter * arch/x86/kvm//pmu.c:522:18: error: incompatible types in comparison expression (different address spaces): arch/x86/kvm//pmu.c:522:18: struct kvm_pmu_event_filter [noderef] __rcu * arch/x86/kvm//pmu.c:522:18: struct kvm_pmu_event_filter * arch/x86/kvm//pmu.c:522:18: error: incompatible types in comparison expression (different address spaces): arch/x86/kvm//pmu.c:522:18: struct kvm_pmu_event_filter [noderef] __rcu * arch/x86/kvm//pmu.c:522:18: struct kvm_pmu_event_filter * Signed-off-by: Muhammad Usama Anjum <musamaanjum@gmail.com> Message-Id: <20210305191123.GA497469@LEGION> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-12 13:17:41 -05:00
Muhammad Usama Anjum	4691453406	kvm: x86: use NULL instead of using plain integer as pointer Sparse warnings removed: warning: Using plain integer as NULL pointer Signed-off-by: Muhammad Usama Anjum <musamaanjum@gmail.com> Message-Id: <20210305180816.GA488770@LEGION> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-06 04:18:39 -05:00
Sean Christopherson	99840a7545	KVM: SVM: Connect 'npt' module param to KVM's internal 'npt_enabled' Directly connect the 'npt' param to the 'npt_enabled' variable so that runtime adjustments to npt_enabled are reflected in sysfs. Move the !PAE restriction to a runtime check to ensure NPT is forced off if the host is using 2-level paging, and add a comment explicitly stating why NPT requires a 64-bit kernel or a kernel with PAE enabled. Opportunistically switch the param to octal permissions. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305021637.3768573-1-seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-05 08:33:15 -05:00
Sean Christopherson	beda430177	KVM: x86: Ensure deadline timer has truly expired before posting its IRQ When posting a deadline timer interrupt, open code the checks guarding __kvm_wait_lapic_expire() in order to skip the lapic_timer_int_injected() check in kvm_wait_lapic_expire(). The injection check will always fail since the interrupt has not yet be injected. Moving the call after injection would also be wrong as that wouldn't actually delay delivery of the IRQ if it is indeed sent via posted interrupt. Fixes: `010fd37fdd` ("KVM: LAPIC: Reduce world switch latency caused by timer_advance_ns") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305021808.3769732-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-05 08:30:21 -05:00
Babu Moger	9e46f6c6c9	KVM: SVM: Clear the CR4 register on reset This problem was reported on a SVM guest while executing kexec. Kexec fails to load the new kernel when the PCID feature is enabled. When kexec starts loading the new kernel, it starts the process by resetting the vCPU's and then bringing each vCPU online one by one. The vCPU reset is supposed to reset all the register states before the vCPUs are brought online. However, the CR4 register is not reset during this process. If this register is already setup during the last boot, all the flags can remain intact. The X86_CR4_PCIDE bit can only be enabled in long mode. So, it must be enabled much later in SMP initialization. Having the X86_CR4_PCIDE bit set during SMP boot can cause a boot failures. Fix the issue by resetting the CR4 register in init_vmcb(). Signed-off-by: Babu Moger <babu.moger@amd.com> Message-Id: <161471109108.30811.6392805173629704166.stgit@bmoger-ubuntu> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-02 14:39:11 -05:00
David Woodhouse	30b5c851af	KVM: x86/xen: Add support for vCPU runstate information This is how Xen guests do steal time accounting. The hypervisor records the amount of time spent in each of running/runnable/blocked/offline states. In the Xen accounting, a vCPU is still in state RUNSTATE_running while in Xen for a hypercall or I/O trap, etc. Only if Xen explicitly schedules does the state become RUNSTATE_blocked. In KVM this means that even when the vCPU exits the kvm_run loop, the state remains RUNSTATE_running. The VMM can explicitly set the vCPU to RUNSTATE_blocked by using the KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_CURRENT attribute, and can also use KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST to retrospectively add a given amount of time to the blocked state and subtract it from the running state. The state_entry_time corresponds to get_kvmclock_ns() at the time the vCPU entered the current state, and the total times of all four states should always add up to state_entry_time. Co-developed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20210301125309.874953-2-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-02 14:30:54 -05:00
David Woodhouse	7d7c5f76e5	KVM: x86/xen: Fix return code when clearing vcpu_info and vcpu_time_info When clearing the per-vCPU shared regions, set the return value to zero to indicate success. This was causing spurious errors to be returned to userspace on soft reset. Also add a paranoid BUILD_BUG_ON() for compat structure compatibility. Fixes: `0c165b3c01` ("KVM: x86/xen: Allow reset of Xen attributes") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20210301125309.874953-1-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-02 14:30:54 -05:00
Paolo Bonzini	b59b153d10	KVM: x86: allow compiling out the Xen hypercall interface The Xen hypercall interface adds to the attack surface of the hypervisor and will be used quite rarely. Allow compiling it out. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-02 14:30:45 -05:00
Paolo Bonzini	c462f859f8	KVM: xen: flush deferred static key before checking it A missing flush would cause the static branch to trigger incorrectly. Cc: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-26 05:13:02 -05:00
Sean Christopherson	44ac5958a6	KVM: x86/mmu: Set SPTE_AD_WRPROT_ONLY_MASK if and only if PML is enabled Check that PML is actually enabled before setting the mask to force a SPTE to be write-protected. The bits used for the !AD_ENABLED case are in the upper half of the SPTE. With 64-bit paging and EPT, these bits are ignored, but with 32-bit PAE paging they are reserved. Setting them for L2 SPTEs without checking PML breaks NPT on 32-bit KVM. Fixes: `1f4e5fc83a` ("KVM: x86: fix nested guest live migration with PML") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-26 04:43:29 -05:00
Wanpeng Li	919f4ebc59	KVM: x86: hyper-v: Fix Hyper-V context null-ptr-deref Reported by syzkaller: KASAN: null-ptr-deref in range [0x0000000000000140-0x0000000000000147] CPU: 1 PID: 8370 Comm: syz-executor859 Not tainted 5.11.0-syzkaller #0 RIP: 0010:synic_get arch/x86/kvm/hyperv.c:165 [inline] RIP: 0010:kvm_hv_set_sint_gsi arch/x86/kvm/hyperv.c:475 [inline] RIP: 0010:kvm_hv_irq_routing_update+0x230/0x460 arch/x86/kvm/hyperv.c:498 Call Trace: kvm_set_irq_routing+0x69b/0x940 arch/x86/kvm/../../../virt/kvm/irqchip.c:223 kvm_vm_ioctl+0x12d0/0x2800 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3959 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xae Hyper-V context is lazily allocated until Hyper-V specific MSRs are accessed or SynIC is enabled. However, the syzkaller testcase sets irq routing table directly w/o enabling SynIC. This results in null-ptr-deref when accessing SynIC Hyper-V context. This patch fixes it. syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=163342ccd00000 Reported-by: syzbot+6987f3b2dbd9eda95f12@syzkaller.appspotmail.com Fixes: `8f014550df` ("KVM: x86: hyper-v: Make Hyper-V emulation enablement conditional") Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1614326399-5762-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-26 03:16:50 -05:00
Dongli Zhang	ffe76c24c5	KVM: x86: remove misplaced comment on active_mmu_pages The 'mmu_page_hash' is used as hash table while 'active_mmu_pages' is a list. Remove the misplaced comment as it's mostly stating the obvious anyways. Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210226061945.1222-1-dongli.zhang@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-26 03:03:29 -05:00
Sean Christopherson	2df8d3807c	KVM: SVM: Fix nested VM-Exit on #GP interception handling Fix the interpreation of nested_svm_vmexit()'s return value when synthesizing a nested VM-Exit after intercepting an SVM instruction while L2 was running. The helper returns '0' on success, whereas a return value of '0' in the exit handler path means "exit to userspace". The incorrect return value causes KVM to exit to userspace without filling the run state, e.g. QEMU logs "KVM: unknown exit, hardware reason 0". Fixes: `14c2bf81fc` ("KVM: SVM: Fix #GP handling for doubly-nested virtualization") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210224005627.657028-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-25 05:13:05 -05:00
Like Xu	67b45af946	KVM: vmx/pmu: Fix dummy check if lbr_desc->event is created If lbr_desc->event is successfully created, the intel_pmu_create_ guest_lbr_event() will return 0, otherwise it will return -ENOENT, and then jump to LBR msrs dummy handling. Fixes: `1b5ac3226a` ("KVM: vmx/pmu: Pass-through LBR msrs when the guest LBR event is ACTIVE") Signed-off-by: Like Xu <like.xu@linux.intel.com> Message-Id: <20210223013958.1280444-1-like.xu@linux.intel.com> [Add "< 0" and PTR_ERR to make the code clearer. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-23 12:07:35 -05:00
David Stevens	4a42d848db	KVM: x86/mmu: Consider the hva in mmu_notifier retry Track the range being invalidated by mmu_notifier and skip page fault retries if the fault address is not affected by the in-progress invalidation. Handle concurrent invalidations by finding the minimal range which includes all ranges being invalidated. Although the combined range may include unrelated addresses and cannot be shrunk as individual invalidation operations complete, it is unlikely the marginal gains of proper range tracking are worth the additional complexity. The primary benefit of this change is the reduction in the likelihood of extreme latency when handing a page fault due to another thread having been preempted while modifying host virtual addresses. Signed-off-by: David Stevens <stevensd@chromium.org> Message-Id: <20210222024522.1751719-3-stevensd@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-22 13:16:53 -05:00
Sean Christopherson	5f8a7cf25a	KVM: x86/mmu: Skip mmu_notifier check when handling MMIO page fault Don't retry a page fault due to an mmu_notifier invalidation when handling a page fault for a GPA that did not resolve to a memslot, i.e. an MMIO page fault. Invalidations from the mmu_notifier signal a change in a host virtual address (HVA) mapping; without a memslot, there is no HVA and thus no possibility that the invalidation is relevant to the page fault being handled. Note, the MMIO vs. memslot generation checks handle the case where a pending memslot will create a memslot overlapping the faulting GPA. The mmu_notifier checks are orthogonal to memslot updates. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210222024522.1751719-2-stevensd@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-22 13:13:30 -05:00
Paolo Bonzini	d2df592fd8	KVM: nSVM: prepare guest save area while is_guest_mode is true Right now, enter_svm_guest_mode is calling nested_prepare_vmcb_save and nested_prepare_vmcb_control. This results in is_guest_mode being false until the end of nested_prepare_vmcb_control. This is a problem because nested_prepare_vmcb_save can in turn cause changes to the intercepts and these have to be applied to the "host VMCB" (stored in svm->nested.hsave) and then merged with the VMCB12 intercepts into svm->vmcb. In particular, without this change we forget to set the CR0 read and CR0 write intercepts when running a real mode L2 guest with NPT disabled. The guest is therefore able to see the CR0.PG bit that KVM sets to enable "paged real mode". This patch fixes the svm.flat mode_switch test case with npt=0. There are no other problematic calls in nested_prepare_vmcb_save. Moving is_guest_mode to the end is done since commit `06fc777269` ("KVM: SVM: Activate nested state only when guest state is complete", 2010-04-25). However, back then KVM didn't grab a different VMCB when updating the intercepts, it had already copied/merged L1's stuff to L0's VMCB, and then updated L0's VMCB regardless of is_nested(). Later recalc_intercepts was introduced in commit `384c636843` ("KVM: SVM: Add function to recalculate intercept masks", 2011-01-12). This introduced the bug, because recalc_intercepts now throws away the intercept manipulations that svm_set_cr0 had done in the meanwhile to svm->vmcb. [1] https://lore.kernel.org/kvm/1266493115-28386-1-git-send-email-joerg.roedel@amd.com/ Reviewed-by: Sean Christopherson <seanjc@google.com> Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-22 13:11:56 -05:00
Sean Christopherson	96ad91ae4e	KVM: x86/mmu: Remove a variety of unnecessary exports Remove several exports from the MMU that are no longer necessary. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210213005015.1651772-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-19 03:08:35 -05:00
Sean Christopherson	a1419f8b5b	KVM: x86: Fold "write-protect large" use case into generic write-protect Drop kvm_mmu_slot_largepage_remove_write_access() and refactor its sole caller to use kvm_mmu_slot_remove_write_access(). Remove the now-unused slot_handle_large_level() and slot_handle_all_level() helpers. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210213005015.1651772-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-19 03:08:35 -05:00
Sean Christopherson	b6e16ae5d9	KVM: x86/mmu: Don't set dirty bits when disabling dirty logging w/ PML Stop setting dirty bits for MMU pages when dirty logging is disabled for a memslot, as PML is now completely disabled when there are no memslots with dirty logging enabled. This means that spurious PML entries will be created for memslots with dirty logging disabled if at least one other memslot has dirty logging enabled. However, spurious PML entries are already possible since dirty bits are set only when a dirty logging is turned off, i.e. memslots that are never dirty logged will have dirty bits cleared. In the end, it's faster overall to eat a few spurious PML entries in the window where dirty logging is being disabled across all memslots. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210213005015.1651772-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-19 03:08:35 -05:00
Makarand Sonare	a85863c2ec	KVM: VMX: Dynamically enable/disable PML based on memslot dirty logging Currently, if enable_pml=1 PML remains enabled for the entire lifetime of the VM irrespective of whether dirty logging is enable or disabled. When dirty logging is disabled, all the pages of the VM are manually marked dirty, so that PML is effectively non-operational. Setting the dirty bits is an expensive operation which can cause severe MMU lock contention in a performance sensitive path when dirty logging is disabled after a failed or canceled live migration. Manually setting dirty bits also fails to prevent PML activity if some code path clears dirty bits, which can incur unnecessary VM-Exits. In order to avoid this extra overhead, dynamically enable/disable PML when dirty logging gets turned on/off for the first/last memslot. Signed-off-by: Makarand Sonare <makarandsonare@google.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210213005015.1651772-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-19 03:08:34 -05:00
Sean Christopherson	52f4607940	KVM: x86: Further clarify the logic and comments for toggling log dirty Add a sanity check in kvm_mmu_slot_apply_flags to assert that the LOG_DIRTY_PAGES flag is indeed being toggled, and explicitly rely on that holding true when zapping collapsible SPTEs. Manipulating the CPU dirty log (PML) and write-protection also relies on this assertion, but that's not obvious in the current code. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210213005015.1651772-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-19 03:08:34 -05:00
Sean Christopherson	a018eba538	KVM: x86: Move MMU's PML logic to common code Drop the facade of KVM's PML logic being vendor specific and move the bits that aren't truly VMX specific into common x86 code. The MMU logic for dealing with PML is tightly coupled to the feature and to VMX's implementation, bouncing through kvm_x86_ops obfuscates the code without providing any meaningful separation of concerns or encapsulation. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210213005015.1651772-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-19 03:08:34 -05:00
Sean Christopherson	6dd03800b1	KVM: x86/mmu: Make dirty log size hook (PML) a value, not a function Store the vendor-specific dirty log size in a variable, there's no need to wrap it in a function since the value is constant after hardware_setup() runs. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210213005015.1651772-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-19 03:08:33 -05:00
Sean Christopherson	2855f98265	KVM: x86/mmu: Expand on the comment in kvm_vcpu_ad_need_write_protect() Expand the comment about need to use write-protection for nested EPT when PML is enabled to clarify that the tagging is a nop when PML is _not_ enabled. Without the clarification, omitting the PML check looks wrong at first^Wfifth glance. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210213005015.1651772-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-19 03:08:33 -05:00
Sean Christopherson	c3bb9a2083	KVM: nVMX: Disable PML in hardware when running L2 Unconditionally disable PML in vmcs02, KVM emulates PML purely in the MMU, e.g. vmx_flush_pml_buffer() doesn't even try to copy the L2 GPAs from vmcs02's buffer to vmcs12. At best, enabling PML is a nop. At worst, it will cause vmx_flush_pml_buffer() to record bogus GFNs in the dirty logs. Initialize vmcs02.GUEST_PML_INDEX such that PML writes would trigger VM-Exit if PML was somehow enabled, skip flushing the buffer for guest mode since the index is bogus, and freak out if a PML full exit occurs when L2 is active. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210213005015.1651772-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-19 03:08:32 -05:00
Sean Christopherson	9eba50f8d7	KVM: x86/mmu: Consult max mapping level when zapping collapsible SPTEs When zapping SPTEs in order to rebuild them as huge pages, use the new helper that computes the max mapping level to detect whether or not a SPTE should be zapped. Doing so avoids zapping SPTEs that can't possibly be rebuilt as huge pages, e.g. due to hardware constraints, memslot alignment, etc... This also avoids zapping SPTEs that are still large, e.g. if migration was canceled before write-protected huge pages were shattered to enable dirty logging. Note, such pages are still write-protected at this time, i.e. a page fault VM-Exit will still occur. This will hopefully be addressed in a future patch. Sadly, TDP MMU loses its const on the memslot, but that's a pervasive problem that's been around for quite some time. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210213005015.1651772-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-19 03:08:28 -05:00
Sean Christopherson	0a234f5dd0	KVM: x86/mmu: Pass the memslot to the rmap callbacks Pass the memslot to the rmap callbacks, it will be used when zapping collapsible SPTEs to verify the memslot is compatible with hugepages before zapping its SPTEs. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210213005015.1651772-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-19 03:08:12 -05:00
Sean Christopherson	1b6d9d9ed5	KVM: x86/mmu: Split out max mapping level calculation to helper Factor out the logic for determining the maximum mapping level given a memslot and a gpa. The helper will be used when zapping collapsible SPTEs when disabling dirty logging, e.g. to avoid zapping SPTEs that can't possibly be rebuilt as hugepages. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210213005015.1651772-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-19 03:08:11 -05:00
Sean Christopherson	c060c72ffe	KVM: x86/mmu: Expand collapsible SPTE zap for TDP MMU to ZONE_DEVICE and HugeTLB pages Zap SPTEs that are backed by ZONE_DEVICE pages when zappings SPTEs to rebuild them as huge pages in the TDP MMU. ZONE_DEVICE huge pages are managed differently than "regular" pages and are not compound pages. Likewise, PageTransCompoundMap() will not detect HugeTLB, so switch to PageCompound(). This matches the similar check in kvm_mmu_zap_collapsible_spte. Cc: Ben Gardon <bgardon@google.com> Fixes: `1488199856` ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210213005015.1651772-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-19 03:07:16 -05:00
Paolo Bonzini	78e550bad2	KVM: nVMX: no need to undo inject_page_fault change on nested vmexit This is not needed because the tweak was done on the guest_mmu, while nested_ept_uninit_mmu_context has just changed vcpu->arch.walk_mmu back to the root_mmu. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-18 07:33:31 -05:00
Paolo Bonzini	a04aead144	KVM: nSVM: fix running nested guests when npt=0 In case of npt=0 on host, nSVM needs the same .inject_page_fault tweak as VMX has, to make sure that shadow mmu faults are injected as vmexits. It is not clear why this is needed at all, but for now keep the same code as VMX and we'll fix it for both. Based on a patch by Maxim Levitsky <mlevitsk@redhat.com>. Fixes: `7c86663b68` ("KVM: nSVM: inject exceptions via svm_check_nested_events") Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-18 07:33:30 -05:00
Maxim Levitsky	954f419ba8	KVM: nSVM: move nested vmrun tracepoint to enter_svm_guest_mode This way trace will capture all the nested mode entries (including entries after migration, and from smm) Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210217145718.1217358-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-18 07:33:30 -05:00
Maxim Levitsky	f5c59b575b	KVM: VMX: read idt_vectoring_info a bit earlier trace_kvm_exit prints this value (using vmx_get_exit_info) so it makes sense to read it before the trace point. Fixes: `dcf068da7e` ("KVM: VMX: Introduce generic fastpath handler") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210217145718.1217358-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-18 07:33:29 -05:00
Sean Christopherson	1aaca37e1e	KVM: VMX: Allow INVPCID in guest without PCID Remove the restriction that prevents VMX from exposing INVPCID to the guest without PCID also being exposed to the guest. The justification of the restriction is that INVPCID will #UD if it's disabled in the VMCS. While that is a true statement, it's also true that RDTSCP will #UD if it's disabled in the VMCS. Neither of those things has any dependency whatsoever on the guest being able to set CR4.PCIDE=1, which is what is effectively allowed by exposing PCID to the guest. Removing the bogus restriction aligns VMX with SVM, and also allows for an interesting configuration. INVPCID is that fastest way to do a global TLB flush, e.g. see native_flush_tlb_global(). Allowing INVPCID without PCID would let a guest use the expedited flush while also limiting the number of ASIDs consumed by the guest. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210212003411.1102677-4-seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-18 07:33:29 -05:00
Sean Christopherson	e420333422	KVM: x86: Advertise INVPCID by default Advertise INVPCID by default (if supported by the host kernel) instead of having both SVM and VMX opt in. INVPCID was opt in when it was a VMX only feature so that KVM wouldn't prematurely advertise support if/when it showed up in the kernel on AMD hardware. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210212003411.1102677-3-seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-18 07:33:29 -05:00
Sean Christopherson	0a8ed2eaac	KVM: SVM: Intercept INVPCID when it's disabled to inject #UD Intercept INVPCID if it's disabled in the guest, even when using NPT, as KVM needs to inject #UD in this case. Fixes: `4407a797e9` ("KVM: SVM: Enable INVPCID feature on AMD") Cc: Babu Moger <babu.moger@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210212003411.1102677-2-seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-18 07:33:28 -05:00
Paolo Bonzini	8c6e67bec3	KVM/arm64 updates for Linux 5.12 - Make the nVHE EL2 object relocatable, resulting in much more maintainable code - Handle concurrent translation faults hitting the same page in a more elegant way - Support for the standard TRNG hypervisor call - A bunch of small PMU/Debug fixes - Allow the disabling of symbol export from assembly code - Simplification of the early init hypercall handling -----BEGIN PGP SIGNATURE----- iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmAmjqEPHG1hekBrZXJu ZWwub3JnAAoJECPQ0LrRPXpDoUEQAIrJ7YF4v4gz06a0HG9+b6fbmykHyxlG7jfm trvctfaiKzOybKoY5odPpNFzhbYOOdXXqYipyTHGwBYtGSy9G/9SjMKSUrfln2Ni lr1wBqapr9TE+SVKoR8pWWuZxGGbHVa7brNuMbMsMi1wwAsM2/n70H9PXrdq3QiK Ge1DWLso2oEfhtTwqNKa4dwB2MHjBhBFhhq+Nq5pslm6mmxJaYqz7pyBmw/C+2cc oU/6kpAa1yPAauptWXtYXJYOMHihxgEa1IdK3Gl0hUyFyu96xVkwH/KFsj+bRs23 QGGCSdy4313hzaoGaSOTK22R98Aeg0wI9a6tcCBvVVjTAztnlu1FPtUZr8e/F7uc +r8xVJUJFiywt3Zktf/D7YDK9LuMMqFnj0BkI4U9nIBY59XZRNhENsBCmjru5lnL iXa5cuta03H4emfssIChLpgn0XHFas6t5dFXBPGbXyw0qsQchTw98iQX9LVxefUK rOUGPIN4nE9ESRIZe0SPlAVeCtNP8cLH7+0YG9MJ1QeDVYaUsnvy9Ln/ox+514mR 5y2KJ6y7xnLB136SKCzPDDloYtz7BDiJq6a/RPiXKGheKoxy+N+BSe58yWCqFZYE Fx/cGUr7oSg39U7gCboog6BDp5e2CXBfbRllg6P47bZFfdPNwzNEzHvk49VltMxx Rl2W05bk =6EwV -----END PGP SIGNATURE----- Merge tag 'kvmarm-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for Linux 5.12 - Make the nVHE EL2 object relocatable, resulting in much more maintainable code - Handle concurrent translation faults hitting the same page in a more elegant way - Support for the standard TRNG hypervisor call - A bunch of small PMU/Debug fixes - Allow the disabling of symbol export from assembly code - Simplification of the early init hypercall handling	2021-02-12 11:23:44 -05:00
Sean Christopherson	7137b7ae6f	KVM: x86/xen: Explicitly pad struct compat_vcpu_info to 64 bytes Add a 2 byte pad to struct compat_vcpu_info so that the sum size of its fields is actually 64 bytes. The effective size without the padding is also 64 bytes due to the compiler aligning evtchn_pending_sel to a 4-byte boundary, but depending on compiler alignment is subtle and unnecessary. Opportunistically replace spaces with tables in the other fields. Cc: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210210182609.435200-6-seanjc@google.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-11 08:03:02 -05:00
Wei Yongjun	2e215216d6	KVM: SVM: Make symbol 'svm_gp_erratum_intercept' static The sparse tool complains as follows: arch/x86/kvm/svm/svm.c:204:6: warning: symbol 'svm_gp_erratum_intercept' was not declared. Should it be static? This symbol is not used outside of svm.c, so this commit marks it static. Fixes: `82a11e9c6f` ("KVM: SVM: Add emulation support for #GP triggered by SVM instructions") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Message-Id: <20210210075958.1096317-1-weiyongjun1@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-11 08:02:08 -05:00
David Woodhouse	0c165b3c01	KVM: x86/xen: Allow reset of Xen attributes In order to support Xen SHUTDOWN_soft_reset (for guest kexec, etc.) the VMM needs to be able to tear everything down and return the Xen features to a clean slate. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20210208232326.1830370-1-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-09 08:42:10 -05:00
Maciej S. Szmigiero	8f5c44f953	KVM: x86/mmu: Make HVA handler retpoline-friendly When retpolines are enabled they have high overhead in the inner loop inside kvm_handle_hva_range() that iterates over the provided memory area. Let's mark this function and its TDP MMU equivalent __always_inline so compiler will be able to change the call to the actual handler function inside each of them into a direct one. This significantly improves performance on the unmap test on the existing kernel memslot code (tested on a Xeon 8167M machine): 30 slots in use: Test Before After Improvement Unmap 0.0353s 0.0334s 5% Unmap 2M 0.00104s 0.000407s 61% 509 slots in use: Test Before After Improvement Unmap 0.0742s 0.0740s None Unmap 2M 0.00221s 0.00159s 28% Looks like having an indirect call in these functions (and, so, a retpoline) might have interfered with unrolling of the whole loop in the CPU. Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <732d3fe9eb68aa08402a638ab0309199fa89ae56.1612810129.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-09 08:42:09 -05:00
Vitaly Kuznetsov	b9ce0f86d9	KVM: x86: hyper-v: Drop hv_vcpu_to_vcpu() helper hv_vcpu_to_vcpu() helper is only used by other helpers and is not very complex, we can drop it without much regret. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-16-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-09 08:42:09 -05:00
Vitaly Kuznetsov	fc08b628d7	KVM: x86: hyper-v: Allocate Hyper-V context lazily Hyper-V context is only needed for guests which use Hyper-V emulation in KVM (e.g. Windows/Hyper-V guests) so we don't actually need to allocate it in kvm_arch_vcpu_create(), we can postpone the action until Hyper-V specific MSRs are accessed or SynIC is enabled. Once allocated, let's keep the context alive for the lifetime of the vCPU as an attempt to free it would require additional synchronization with other vCPUs and normally it is not supposed to happen. Note, Hyper-V style hypercall enablement is done by writing to HV_X64_MSR_GUEST_OS_ID so we don't need to worry about allocating Hyper-V context from kvm_hv_hypercall(). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-15-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-09 08:40:50 -05:00
Vitaly Kuznetsov	8f014550df	KVM: x86: hyper-v: Make Hyper-V emulation enablement conditional Hyper-V emulation is enabled in KVM unconditionally. This is bad at least from security standpoint as it is an extra attack surface. Ideally, there should be a per-VM capability explicitly enabled by VMM but currently it is not the case and we can't mandate one without breaking backwards compatibility. We can, however, check guest visible CPUIDs and only enable Hyper-V emulation when "Hv#1" interface was exposed in HYPERV_CPUID_INTERFACE. Note, VMMs are free to act in any sequence they like, e.g. they can try to set MSRs first and CPUIDs later so we still need to allow the host to read/write Hyper-V specific MSRs unconditionally. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-14-vkuznets@redhat.com> [Add selftest vcpu_set_hv_cpuid API to avoid breaking xen_vmcall_test. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-09 08:39:56 -05:00
Vitaly Kuznetsov	4592b7eaa8	KVM: x86: hyper-v: Allocate 'struct kvm_vcpu_hv' dynamically Hyper-V context is only needed for guests which use Hyper-V emulation in KVM (e.g. Windows/Hyper-V guests). 'struct kvm_vcpu_hv' is, however, quite big, it accounts for more than 1/4 of the total 'struct kvm_vcpu_arch' which is also quite big already. This all looks like a waste. Allocate 'struct kvm_vcpu_hv' dynamically. This patch does not bring any (intentional) functional change as we still allocate the context unconditionally but it paves the way to doing that only when needed. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-13-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-09 08:17:15 -05:00
Vitaly Kuznetsov	f2bc14b69c	KVM: x86: hyper-v: Prepare to meet unallocated Hyper-V context Currently, Hyper-V context is part of 'struct kvm_vcpu_arch' and is always available. As a preparation to allocating it dynamically, check that it is not NULL at call sites which can normally proceed without it i.e. the behavior is identical to the situation when Hyper-V emulation is not being used by the guest. When Hyper-V context for a particular vCPU is not allocated, we may still need to get 'vp_index' from there. E.g. in a hypothetical situation when Hyper-V emulation was enabled on one CPU and wasn't on another, Hyper-V style send-IPI hypercall may still be used. Luckily, vp_index is always initialized to kvm_vcpu_get_idx() and can only be changed when Hyper-V context is present. Introduce kvm_hv_get_vpindex() helper for simplification. No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-12-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-09 08:17:14 -05:00
Vitaly Kuznetsov	9ff5e0304e	KVM: x86: hyper-v: Always use to_hv_vcpu() accessor to get to 'struct kvm_vcpu_hv' As a preparation to allocating Hyper-V context dynamically, make it clear who's the user of the said context. No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-11-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-02-09 08:17:13 -05:00

1 2 3 4 5 ...

37460 Commits