Commit Graph

915877 Commits

Author SHA1 Message Date
Sean Christopherson 6ce347af14 KVM: nVMX: Preserve exception priority irrespective of exiting behavior
Short circuit vmx_check_nested_events() if an exception is pending and
needs to be injected into L2, priority between coincident events is not
dependent on exiting behavior.  This fixes a bug where a single-step #DB
that is not intercepted by L1 is incorrectly dropped due to servicing a
VMX Preemption Timer VM-Exit.

Injected exceptions also need to be blocked if nested VM-Enter is
pending or an exception was already injected, otherwise injecting the
exception could overwrite an existing event injection from L1.
Technically, this scenario should be impossible, i.e. KVM shouldn't
inject its own exception during nested VM-Enter.  This will be addressed
in a future patch.

Note, event priority between SMI, NMI and INTR is incorrect for L2, e.g.
SMI should take priority over VM-Exit on NMI/INTR, and NMI that is
injected into L2 should take priority over VM-Exit INTR.  This will also
be addressed in a future patch.

Fixes: b6b8a1451f ("KVM: nVMX: Rework interception of IRQs and NMIs")
Reported-by: Jim Mattson <jmattson@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Peter Shier <pshier@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200423022550.15113-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13 12:14:25 -04:00
Cathy Avery 9c3d370a8e KVM: SVM: Implement check_nested_events for NMI
Migrate nested guest NMI intercept processing
to new check_nested_events.

Signed-off-by: Cathy Avery <cavery@redhat.com>
Message-Id: <20200414201107.22952-2-cavery@redhat.com>
[Reorder clauses as NMIs have higher priority than IRQs; inject
 immediate vmexit as is now done for IRQ vmexits. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13 12:14:24 -04:00
Paolo Bonzini 6e085cbfb0 KVM: SVM: immediately inject INTR vmexit
We can immediately leave SVM guest mode in svm_check_nested_events
now that we have the nested_run_pending mechanism.  This makes
things easier because we can run the rest of inject_pending_event
with GIF=0, and KVM will naturally end up requesting the next
interrupt window.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13 12:14:23 -04:00
Paolo Bonzini 38c0b192bd KVM: SVM: leave halted state on vmexit
Similar to VMX, we need to leave the halted state when performing a vmexit.
Failure to do so will cause a hang after vmexit.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13 12:14:22 -04:00
Paolo Bonzini f74f94140f KVM: SVM: introduce nested_run_pending
We want to inject vmexits immediately from svm_check_nested_events,
so that the interrupt/NMI window requests happen in inject_pending_event
right after it returns.

This however has the same issue as in vmx_check_nested_events, so
introduce a nested_run_pending flag with the exact same purpose
of delaying vmexit injection after the vmentry.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13 12:14:21 -04:00
Paolo Bonzini 4aef2ec902 Merge branch 'kvm-amd-fixes' into HEAD 2020-05-13 12:14:05 -04:00
Babu Moger 37486135d3 KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c
Though rdpkru and wrpkru are contingent upon CR4.PKE, the PKRU
resource isn't. It can be read with XSAVE and written with XRSTOR.
So, if we don't set the guest PKRU value here(kvm_load_guest_xsave_state),
the guest can read the host value.

In case of kvm_load_host_xsave_state, guest with CR4.PKE clear could
potentially use XRSTOR to change the host PKRU value.

While at it, move pkru state save/restore to common code and the
host_pkru field to kvm_vcpu_arch.  This will let SVM support protection keys.

Cc: stable@vger.kernel.org
Reported-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Message-Id: <158932794619.44260.14508381096663848853.stgit@naples-babu.amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13 11:27:41 -04:00
Suravee Suthikulpanit 7d611233b0 KVM: SVM: Disable AVIC before setting V_IRQ
The commit 64b5bd2704 ("KVM: nSVM: ignore L1 interrupt window
while running L2 with V_INTR_MASKING=1") introduced a WARN_ON,
which checks if AVIC is enabled when trying to set V_IRQ
in the VMCB for enabling irq window.

The following warning is triggered because the requesting vcpu
(to deactivate AVIC) does not get to process APICv update request
for itself until the next #vmexit.

WARNING: CPU: 0 PID: 118232 at arch/x86/kvm/svm/svm.c:1372 enable_irq_window+0x6a/0xa0 [kvm_amd]
 RIP: 0010:enable_irq_window+0x6a/0xa0 [kvm_amd]
 Call Trace:
  kvm_arch_vcpu_ioctl_run+0x6e3/0x1b50 [kvm]
  ? kvm_vm_ioctl_irq_line+0x27/0x40 [kvm]
  ? _copy_to_user+0x26/0x30
  ? kvm_vm_ioctl+0xb3e/0xd90 [kvm]
  ? set_next_entity+0x78/0xc0
  kvm_vcpu_ioctl+0x236/0x610 [kvm]
  ksys_ioctl+0x8a/0xc0
  __x64_sys_ioctl+0x1a/0x20
  do_syscall_64+0x58/0x210
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes by sending APICV update request to all other vcpus, and
immediately update APIC for itself.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lkml.org/lkml/2020/5/2/167
Fixes: 64b5bd2704 ("KVM: nSVM: ignore L1 interrupt window while running L2 with V_INTR_MASKING=1")
Message-Id: <1588818939-54264-1-git-send-email-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-08 07:44:32 -04:00
Suravee Suthikulpanit 54163a346d KVM: Introduce kvm_make_all_cpus_request_except()
This allows making request to all other vcpus except the one
specified in the parameter.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <1588771076-73790-2-git-send-email-suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-08 07:44:32 -04:00
Paolo Bonzini 45981dedf5 KVM: VMX: pass correct DR6 for GD userspace exit
When KVM_EXIT_DEBUG is raised for the disabled-breakpoints case (DR7.GD),
DR6 was incorrectly copied from the value in the VM.  Instead,
DR6.BD should be set in order to catch this case.

On AMD this does not need any special code because the processor triggers
a #DB exception that is intercepted.  However, the testcase would fail
without the previous patch because both DR6.BS and DR6.BD would be set.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-08 07:44:31 -04:00
Paolo Bonzini d67668e9dd KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6
There are two issues with KVM_EXIT_DEBUG on AMD, whose root cause is the
different handling of DR6 on intercepted #DB exceptions on Intel and AMD.

On Intel, #DB exceptions transmit the DR6 value via the exit qualification
field of the VMCS, and the exit qualification only contains the description
of the precise event that caused a vmexit.

On AMD, instead the DR6 field of the VMCB is filled in as if the #DB exception
was to be injected into the guest.  This has two effects when guest debugging
is in use:

* the guest DR6 is clobbered

* the kvm_run->debug.arch.dr6 field can accumulate more debug events, rather
than just the last one that happened (the testcase in the next patch covers
this issue).

This patch fixes both issues by emulating, so to speak, the Intel behavior
on AMD processors.  The important observation is that (after the previous
patches) the VMCB value of DR6 is only ever observable from the guest is
KVM_DEBUGREG_WONT_EXIT is set.  Therefore we can actually set vmcb->save.dr6
to any value we want as long as KVM_DEBUGREG_WONT_EXIT is clear, which it
will be if guest debugging is enabled.

Therefore it is possible to enter the guest with an all-zero DR6,
reconstruct the #DB payload from the DR6 we get at exit time, and let
kvm_deliver_exception_payload move the newly set bits into vcpu->arch.dr6.
Some extra bits may be included in the payload if KVM_DEBUGREG_WONT_EXIT
is set, but this is harmless.

This may not be the most optimized way to deal with this, but it is
simple and, being confined within SVM code, it gets rid of the set_dr6
callback and kvm_update_dr6.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-08 07:44:31 -04:00
Paolo Bonzini 5679b803e4 KVM: SVM: keep DR6 synchronized with vcpu->arch.dr6
kvm_x86_ops.set_dr6 is only ever called with vcpu->arch.dr6 as the
second argument.  Ensure that the VMCB value is synchronized to
vcpu->arch.dr6 on #DB (both "normal" and nested) and nested vmentry, so
that the current value of DR6 is always available in vcpu->arch.dr6.
The get_dr6 callback can just access vcpu->arch.dr6 and becomes redundant.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-08 07:43:47 -04:00
Paolo Bonzini 2c19dba680 KVM: nSVM: trap #DB and #BP to userspace if guest debugging is on
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-07 07:45:16 -04:00
Peter Xu 449aa906e6 KVM: selftests: Add KVM_SET_GUEST_DEBUG test
Covers fundamental tests for KVM_SET_GUEST_DEBUG. It is very close to the debug
test in kvm-unit-test, but doing it from outside the guest.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20200505205000.188252-4-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-07 06:13:42 -04:00
Peter Xu d5d260c5ff KVM: X86: Fix single-step with KVM_SET_GUEST_DEBUG
When single-step triggered with KVM_SET_GUEST_DEBUG, we should fill in the pc
value with current linear RIP rather than the cached singlestep address.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20200505205000.188252-3-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-07 06:13:41 -04:00
Peter Xu 13196638d5 KVM: X86: Set RTM for DB_VECTOR too for KVM_EXIT_DEBUG
RTM should always been set even with KVM_EXIT_DEBUG on #DB.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20200505205000.188252-2-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-07 06:13:41 -04:00
Paolo Bonzini 4d5523cfd5 KVM: x86: fix DR6 delivery for various cases of #DB injection
Go through kvm_queue_exception_p so that the payload is correctly delivered
through the exit qualification, and add a kvm_update_dr6 call to
kvm_deliver_exception_payload that is needed on AMD.

Reported-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-07 06:13:41 -04:00
Peter Xu b9b2782cd5 KVM: X86: Declare KVM_CAP_SET_GUEST_DEBUG properly
KVM_CAP_SET_GUEST_DEBUG should be supported for x86 however it's not declared
as supported.  My wild guess is that userspaces like QEMU are using "#ifdef
KVM_CAP_SET_GUEST_DEBUG" to check for the capability instead, but that could be
wrong because the compilation host may not be the runtime host.

The userspace might still want to keep the old "#ifdef" though to not break the
guest debug on old kernels.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20200505154750.126300-1-peterx@redhat.com>
[Do the same for PPC and s390. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-07 06:13:40 -04:00
Peter Xu 8ffdaf9155 KVM: selftests: Fix build for evmcs.h
I got this error when building kvm selftests:

/usr/bin/ld: /home/xz/git/linux/tools/testing/selftests/kvm/libkvm.a(vmx.o):/home/xz/git/linux/tools/testing/selftests/kvm/include/evmcs.h:222: multiple definition of `current_evmcs'; /tmp/cco1G48P.o:/home/xz/git/linux/tools/testing/selftests/kvm/include/evmcs.h:222: first defined here
/usr/bin/ld: /home/xz/git/linux/tools/testing/selftests/kvm/libkvm.a(vmx.o):/home/xz/git/linux/tools/testing/selftests/kvm/include/evmcs.h:223: multiple definition of `current_vp_assist'; /tmp/cco1G48P.o:/home/xz/git/linux/tools/testing/selftests/kvm/include/evmcs.h:223: first defined here

I think it's because evmcs.h is included both in a test file and a lib file so
the structs have multiple declarations when linking.  After all it's not a good
habit to declare structs in the header files.

Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20200504220607.99627-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-06 06:51:36 -04:00
Paolo Bonzini 139f7425fd kvm: x86: Use KVM CPU capabilities to determine CR4 reserved bits
Using CPUID data can be useful for the processor compatibility
check, but that's it.  Using it to compute guest-reserved bits
can have both false positives (such as LA57 and UMIP which we
are already handling) and false negatives: in particular, with
this patch we don't allow anymore a KVM guest to set CR4.PKE
when CR4.PKE is clear on the host.

Fixes: b9dd21e104 ("KVM: x86: simplify handling of PKRU")
Reported-by: Jim Mattson <jmattson@google.com>
Tested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-06 06:51:36 -04:00
Sean Christopherson c7cb2d650c KVM: VMX: Explicitly clear RFLAGS.CF and RFLAGS.ZF in VM-Exit RSB path
Clear CF and ZF in the VM-Exit path after doing __FILL_RETURN_BUFFER so
that KVM doesn't interpret clobbered RFLAGS as a VM-Fail.  Filling the
RSB has always clobbered RFLAGS, its current incarnation just happens
clear CF and ZF in the processs.  Relying on the macro to clear CF and
ZF is extremely fragile, e.g. commit 089dd8e531 ("x86/speculation:
Change FILL_RETURN_BUFFER to work with objtool") tweaks the loop such
that the ZF flag is always set.

Reported-by: Qian Cai <cai@lca.pw>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: stable@vger.kernel.org
Fixes: f2fde6a5bc ("KVM: VMX: Move RSB stuffing to before the first RET after VM-Exit")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200506035355.2242-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-06 06:51:35 -04:00
Kashyap Chamarthy 27abe57770 docs/virt/kvm: Document configuring and running nested guests
This is a rewrite of this[1] Wiki page with further enhancements.  The
doc also includes a section on debugging problems in nested
environments, among other improvements.

[1] https://www.linux-kvm.org/page/Nested_Guests

Signed-off-by: Kashyap Chamarthy <kchamart@redhat.com>
Message-Id: <20200505112839.30534-1-kchamart@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-06 05:45:47 -04:00
Paolo Bonzini 8be8f932e3 kvm: ioapic: Restrict lazy EOI update to edge-triggered interrupts
Commit f458d039db ("kvm: ioapic: Lazy update IOAPIC EOI") introduces
the following infinite loop:

BUG: stack guard page was hit at 000000008f595917 \
(stack is 00000000bdefe5a4..00000000ae2b06f5)
kernel stack overflow (double-fault): 0000 [#1] SMP NOPTI
RIP: 0010:kvm_set_irq+0x51/0x160 [kvm]
Call Trace:
 irqfd_resampler_ack+0x32/0x90 [kvm]
 kvm_notify_acked_irq+0x62/0xd0 [kvm]
 kvm_ioapic_update_eoi_one.isra.0+0x30/0x120 [kvm]
 ioapic_set_irq+0x20e/0x240 [kvm]
 kvm_ioapic_set_irq+0x5c/0x80 [kvm]
 kvm_set_irq+0xbb/0x160 [kvm]
 ? kvm_hv_set_sint+0x20/0x20 [kvm]
 irqfd_resampler_ack+0x32/0x90 [kvm]
 kvm_notify_acked_irq+0x62/0xd0 [kvm]
 kvm_ioapic_update_eoi_one.isra.0+0x30/0x120 [kvm]
 ioapic_set_irq+0x20e/0x240 [kvm]
 kvm_ioapic_set_irq+0x5c/0x80 [kvm]
 kvm_set_irq+0xbb/0x160 [kvm]
 ? kvm_hv_set_sint+0x20/0x20 [kvm]
....

The re-entrancy happens because the irq state is the OR of
the interrupt state and the resamplefd state.  That is, we don't
want to show the state as 0 until we've had a chance to set the
resamplefd.  But if the interrupt has _not_ gone low then
ioapic_set_irq is invoked again, causing an infinite loop.

This can only happen for a level-triggered interrupt, otherwise
irqfd_inject would immediately set the KVM_USERSPACE_IRQ_SOURCE_ID high
and then low.  Fortunately, in the case of level-triggered interrupts the VMEXIT already happens because
TMR is set.  Thus, fix the bug by restricting the lazy invocation
of the ack notifier to edge-triggered interrupts, the only ones that
need it.

Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reported-by: borisvk@bstnet.org
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://www.spinics.net/lists/kvm/msg213512.html
Fixes: f458d039db ("kvm: ioapic: Lazy update IOAPIC EOI")
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=207489
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-04 12:29:05 -04:00
Suravee Suthikulpanit 637543a8d6 KVM: x86: Fixes posted interrupt check for IRQs delivery modes
Current logic incorrectly uses the enum ioapic_irq_destination_types
to check the posted interrupt destination types. However, the value was
set using APIC_DM_XXX macros, which are left-shifted by 8 bits.

Fixes by using the APIC_DM_FIXED and APIC_DM_LOWEST instead.

Fixes: (fdcf756213 'KVM: x86: Disable posted interrupts for non-standard IRQs delivery modes')
Cc: Alexander Graf <graf@amazon.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <1586239989-58305-1-git-send-email-suravee.suthikulpanit@amd.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-04 12:16:51 -04:00
Paolo Bonzini 7134fa0709 KVM/arm fixes for Linux 5.7, take #2
- Fix compilation with Clang
 - Correctly initialize GICv4.1 in the absence of a virtual ITS
 - Move SP_EL0 save/restore to the guest entry/exit code
 - Handle PC wrap around on 32bit guests, and narrow all 32bit
   registers on userspace access
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl6r7LMPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDOioQAI3MzfQ/sGaJN/83ZLOdKdqvSmRwJrwoHH/K
 qX0HgDky3/OMPD+uIlYpo5f1RLM2R/pDj6rhpg8IcWhfVXWEZZHU9Z8xqc3o8Hpo
 Hp4if0pe9+6iaUPuGyzP0Di5Dj+6eNglHoSsvyeeGsH1b7YzE812wN0VnGHB7+T5
 /lEMfCSDWmtMa63FvcX9oxqKCWr1pjpUJ46u0D2uszcbYpIPXm4AMZgX0ZxnlreT
 IPQ6uvG7bBeTjrkucScwqoH8L2/xBP2y6D2HoC7ANmvn4Wv8neJNYh0LQt0zgsTI
 DTNwy2E1R27lxtQtp9Y05itA1N1qkj6hRowgEWgtMtlLQyz0PUT+xFHl+T1iBQjz
 zcEoL49/A4x01fw6JVqDraItEBW6g8fjnJul/FZ7K6Psncxz9oRjSSz+sSVLgn/W
 wthA2ChVlGVzpQsfByVmARTFew65Ls/rm1h9TzZcMWZsEdQRLi5NtyFkLBq2aMMz
 D15//aFQf7jmiSv+uVALZcnU1dBxqqzGBY8pwSrNSv4LsZAcDOsKRpgoe3zFVj48
 rzbUOWXthEpXo4RipOoEeNavuFwetwcCKlyO5hnvUhlR5Yc0ofQiWKZE5vZ6yGm4
 cg2CUMBy7Mjcg+80vo5qnRS5E6S+xQHgBnzwau0DOTIZDerKjH69gsn8JxiRNRbo
 Ix9uMPY8
 =455e
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-5.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master

KVM/arm fixes for Linux 5.7, take #2

- Fix compilation with Clang
- Correctly initialize GICv4.1 in the absence of a virtual ITS
- Move SP_EL0 save/restore to the guest entry/exit code
- Handle PC wrap around on 32bit guests, and narrow all 32bit
  registers on userspace access
2020-05-04 12:01:37 -04:00
Paolo Bonzini 9e5e19f585 KVM/arm fixes for Linux 5.7, take #1
- Prevent the userspace API from interacting directly with the HW
   stage of the virtual GIC
 - Fix a couple of vGIC memory leaks
 - Tighten the rules around the use of the 32bit PSCI functions
   for 64bit guest, as well as the opposite situation (matches the
   specification)
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl6htMcPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDR2AP/jjTgHDfwEFhODZ6Wi2tF8r6tnkvn4U1TOoN
 5WXulhn5eCCGuoTLWocX3/6Qh2kVY97KaXLhB6VUFD6RZDwn6BQ9K7QR/sAVAv7i
 GRILq/sbXWIz3sb+JiNK/T/6nZ/BxBTmDEE+ZovOvq3RCPqjCoYkHaqZ4pgIO7Ty
 UDtfbLtczhgpZfGgZJkvU7k+fHmWITDN/XHB+tJoCYzinowO7AU315D+UOh/1df3
 9MfTw7IyrypMhm59g3TeTo4ca1H3dLU0lyNQbNbSV6ttDs7ZZgxXO0ZoNyHMEYaZ
 ZO7r4Tf6hCekrrhoG8gSg3GiBJ//UvvlireoEvAaCxQquoBAq+RseL+upVutwyPq
 I00ShagbL0jRV48MbwayJ7ctzhohecRGrCpPIRCJm7WEg7lH7QmWD9xEoFlUNka5
 Nc6hpob0zXPBIK+k92Czl2XNllInwhthPkvRb/nQs/0Z6dZE73eAltYv7eaggiuA
 Ud4+1aRAxU1oXvHQGnHWCO9pkkLpjeB5Xnh90tmZyvfx+ZdL/ZOHvlKzL9+8Iyx0
 HGmf/R72XhKyBQ1/CMpS1HAG4uDS2Cw6L9ullpzLaKD37FSzK/ZXZsTXdGG1IG5Q
 ywipbO8tPRp/D5oML9DwFcep5nVf/Y3EViMzzrznlNq1SwK2ekM2t717U5ODtTdt
 KGtpPXCi
 =fh9d
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-5.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master

KVM/arm fixes for Linux 5.7, take #1

- Prevent the userspace API from interacting directly with the HW
  stage of the virtual GIC
- Fix a couple of vGIC memory leaks
- Tighten the rules around the use of the 32bit PSCI functions
  for 64bit guest, as well as the opposite situation (matches the
  specification)
2020-05-04 12:01:12 -04:00
Paolo Bonzini dee919d15d KVM: SVM: fill in kvm_run->debug.arch.dr[67]
The corresponding code was added for VMX in commit 42dbaa5a05
("KVM: x86: Virtualize debug registers, 2008-12-15) but never for AMD.
Fix this.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-04 11:59:03 -04:00
Sean Christopherson f9336e3281 KVM: nVMX: Replace a BUG_ON(1) with BUG() to squash clang warning
Use BUG() in the impossible-to-hit default case when switching on the
scope of INVEPT to squash a warning with clang 11 due to clang treating
the BUG_ON() as conditional.

  >> arch/x86/kvm/vmx/nested.c:5246:3: warning: variable 'roots_to_free'
     is used uninitialized whenever 'if' condition is false
     [-Wsometimes-uninitialized]
                   BUG_ON(1);

Reported-by: kbuild test robot <lkp@intel.com>
Fixes: ce8fe7b77b ("KVM: nVMX: Free only the affected contexts when emulating INVEPT")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200504153506.28898-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-04 11:58:55 -04:00
Marc Zyngier 0225fd5e0a KVM: arm64: Fix 32bit PC wrap-around
In the unlikely event that a 32bit vcpu traps into the hypervisor
on an instruction that is located right at the end of the 32bit
range, the emulation of that instruction is going to increment
PC past the 32bit range. This isn't great, as userspace can then
observe this value and get a bit confused.

Conversly, userspace can do things like (in the context of a 64bit
guest that is capable of 32bit EL0) setting PSTATE to AArch64-EL0,
set PC to a 64bit value, change PSTATE to AArch32-USR, and observe
that PC hasn't been truncated. More confusion.

Fix both by:
- truncating PC increments for 32bit guests
- sanitizing all 32bit regs every time a core reg is changed by
  userspace, and that PSTATE indicates a 32bit mode.

Cc: stable@vger.kernel.org
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2020-05-01 09:51:08 +01:00
Marc Zyngier 958e8e14fd KVM: arm64: vgic-v4: Initialize GICv4.1 even in the absence of a virtual ITS
KVM now expects to be able to use HW-accelerated delivery of vSGIs
as soon as the guest has enabled thm. Unfortunately, we only
initialize the GICv4 context if we have a virtual ITS exposed to
the guest.

Fix it by always initializing the GICv4.1 context if it is
available on the host.

Fixes: 2291ff2f2a ("KVM: arm64: GICv4.1: Plumb SGI implementation selection in the distributor")
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2020-04-30 12:50:23 +01:00
Marc Zyngier 6e977984f6 KVM: arm64: Save/restore sp_el0 as part of __guest_enter
We currently save/restore sp_el0 in C code. This is a bit unsafe,
as a lot of the C code expects 'current' to be accessible from
there (and the opportunity to run kernel code in HYP is specially
great with VHE).

Instead, let's move the save/restore of sp_el0 to the assembly
code (in __guest_enter), making sure that sp_el0 is correct
very early on when we exit the guest, and is preserved as long
as possible to its host value when we enter the guest.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2020-04-30 12:01:35 +01:00
Fangrui Song 6aea9e0503 KVM: arm64: Delete duplicated label in invalid_vector
SYM_CODE_START defines \label , so it is redundant to define \label again.
A redefinition at the same place is accepted by GNU as
(https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=159fbb6088f17a341bcaaac960623cab881b4981)
but rejected by the clang integrated assembler.

Fixes: 617a2f392c ("arm64: kvm: Annotate assembly using modern annoations")
Signed-off-by: Fangrui Song <maskray@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://github.com/ClangBuiltLinux/linux/issues/988
Link: https://lore.kernel.org/r/20200413231016.250737-1-maskray@google.com
2020-04-30 11:17:00 +01:00
Paolo Bonzini 7c67f54661 KVM: SVM: do not allow VMRUN inside SMM
VMRUN is not supported inside the SMM handler and the behavior is undefined.
Just raise a #UD.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-24 12:53:18 -04:00
David Matlack acd05785e4 kvm: add capability for halt polling
KVM_CAP_HALT_POLL is a per-VM capability that lets userspace
control the halt-polling time, allowing halt-polling to be tuned or
disabled on particular VMs.

With dynamic halt-polling, a VM's VCPUs can poll from anywhere from
[0, halt_poll_ns] on each halt. KVM_CAP_HALT_POLL sets the
upper limit on the poll time.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Jon Cargille <jcargill@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20200417221446.108733-1-jcargill@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-24 12:53:17 -04:00
Sean Christopherson 87796555d4 KVM: nVMX: Store vmcs.EXIT_QUALIFICATION as an unsigned long, not u32
Use an unsigned long for 'exit_qual' in nested_vmx_reflect_vmexit(), the
EXIT_QUALIFICATION field is naturally sized, not a 32-bit field.

The bug is most easily observed by doing VMXON (or any VMX instruction)
in L2 with a negative displacement, in which case dropping the upper
bits on nested VM-Exit results in L1 calculating the wrong virtual
address for the memory operand, e.g. "vmxon -0x8(%rbp)" yields:

  Unhandled cpu exception 14 #PF at ip 0000000000400553
  rbp=0000000000537000 cr2=0000000100536ff8

Fixes: fbdd502503 ("KVM: nVMX: Move VM-Fail check out of nested_vmx_exit_reflected()")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200423001127.13490-1-sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-24 12:51:21 -04:00
Sean Christopherson 9bd4af240f KVM: nVMX: Drop a redundant call to vmx_get_intr_info()
Drop nested_vmx_l1_wants_exit()'s initialization of intr_info from
vmx_get_intr_info() that was inadvertantly introduced along with the
caching mechanism.  EXIT_REASON_EXCEPTION_NMI, the only consumer of
intr_info, populates the variable before using it.

Fixes: bb53120d67cd ("KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200421075328.14458-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-23 18:24:28 -04:00
Marc Zyngier 446c0768f5 Merge branch 'kvm-arm64/vgic-fixes-5.7' into kvmarm-master/master 2020-04-23 16:27:33 +01:00
Marc Zyngier 66f63474da Merge branch 'kvm-arm64/psci-fixes-5.7' into kvmarm-master/master 2020-04-23 16:27:26 +01:00
Zenghui Yu 57bdb436ce KVM: arm64: vgic-its: Fix memory leak on the error path of vgic_add_lpi()
If we're going to fail out the vgic_add_lpi(), let's make sure the
allocated vgic_irq memory is also freed. Though it seems that both
cases are unlikely to fail.

Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200414030349.625-3-yuzenghui@huawei.com
2020-04-23 16:26:56 +01:00
Zenghui Yu 969ce8b526 KVM: arm64: vgic-v3: Retire all pending LPIs on vcpu destroy
It's likely that the vcpu fails to handle all virtual interrupts if
userspace decides to destroy it, leaving the pending ones stay in the
ap_list. If the un-handled one is a LPI, its vgic_irq structure will
be eventually leaked because of an extra refcount increment in
vgic_queue_irq_unlock().

This was detected by kmemleak on almost every guest destroy, the
backtrace is as follows:

unreferenced object 0xffff80725aed5500 (size 128):
comm "CPU 5/KVM", pid 40711, jiffies 4298024754 (age 166366.512s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 08 01 a9 73 6d 80 ff ff ...........sm...
c8 61 ee a9 00 20 ff ff 28 1e 55 81 6c 80 ff ff .a... ..(.U.l...
backtrace:
[<000000004bcaa122>] kmem_cache_alloc_trace+0x2dc/0x418
[<0000000069c7dabb>] vgic_add_lpi+0x88/0x418
[<00000000bfefd5c5>] vgic_its_cmd_handle_mapi+0x4dc/0x588
[<00000000cf993975>] vgic_its_process_commands.part.5+0x484/0x1198
[<000000004bd3f8e3>] vgic_its_process_commands+0x50/0x80
[<00000000b9a65b2b>] vgic_mmio_write_its_cwriter+0xac/0x108
[<0000000009641ebb>] dispatch_mmio_write+0xd0/0x188
[<000000008f79d288>] __kvm_io_bus_write+0x134/0x240
[<00000000882f39ac>] kvm_io_bus_write+0xe0/0x150
[<0000000078197602>] io_mem_abort+0x484/0x7b8
[<0000000060954e3c>] kvm_handle_guest_abort+0x4cc/0xa58
[<00000000e0d0cd65>] handle_exit+0x24c/0x770
[<00000000b44a7fad>] kvm_arch_vcpu_ioctl_run+0x460/0x1988
[<0000000025fb897c>] kvm_vcpu_ioctl+0x4f8/0xee0
[<000000003271e317>] do_vfs_ioctl+0x160/0xcd8
[<00000000e7f39607>] ksys_ioctl+0x98/0xd8

Fix it by retiring all pending LPIs in the ap_list on the destroy path.

p.s. I can also reproduce it on a normal guest shutdown. It is because
userspace still send LPIs to vcpu (through KVM_SIGNAL_MSI ioctl) while
the guest is being shutdown and unable to handle it. A little strange
though and haven't dig further...

Reviewed-by: James Morse <james.morse@arm.com>
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
[maz: moved the distributor deallocation down to avoid an UAF splat]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200414030349.625-2-yuzenghui@huawei.com
2020-04-23 16:26:56 +01:00
Marc Zyngier ba1ed9e17b KVM: arm: vgic-v2: Only use the virtual state when userspace accesses pending bits
There is no point in accessing the HW when writing to any of the
ISPENDR/ICPENDR registers from userspace, as only the guest should
be allowed to change the HW state.

Introduce new userspace-specific accessors that deal solely with
the virtual state. Note that the API differs from that of GICv3,
where userspace exclusively uses ISPENDR to set the state. Too
bad we can't reuse it.

Fixes: 82e40f558d ("KVM: arm/arm64: vgic-v2: Handle SGI bits in GICD_I{S,C}PENDR0 as WI")
Reviewed-by: James Morse <james.morse@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2020-04-23 16:26:31 +01:00
Paolo Bonzini 33b2217245 KVM: x86: move nested-related kvm_x86_ops to a separate struct
Clean up some of the patching of kvm_x86_ops, by moving kvm_x86_ops related to
nested virtualization into a separate struct.

As a result, these ops will always be non-NULL on VMX.  This is not a problem:

* check_nested_events is only called if is_guest_mode(vcpu) returns true

* get_nested_state treats VMXOFF state the same as nested being disabled

* set_nested_state fails if you attempt to set nested state while
  nesting is disabled

* nested_enable_evmcs could already be called on a CPU without VMX enabled
  in CPUID.

* nested_get_evmcs_version was fixed in the previous patch

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-23 09:04:57 -04:00
Paolo Bonzini 25091990ef KVM: eVMCS: check if nesting is enabled
In the next patch nested_get_evmcs_version will be always set in kvm_x86_ops for
VMX, even if nesting is disabled.  Therefore, check whether VMX (aka nesting)
is available in the function, the caller will not do the check anymore.

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-23 09:04:56 -04:00
Paolo Bonzini 56083bdf67 KVM: x86: check_nested_events is never NULL
Both Intel and AMD now implement it, so there is no need to check if the
callback is implemented.

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-23 09:04:56 -04:00
Marc Zyngier 41ee52ecbc KVM: arm: vgic: Only use the virtual state when userspace accesses enable bits
There is no point in accessing the HW when writing to any of the
ISENABLER/ICENABLER registers from userspace, as only the guest
should be allowed to change the HW state.

Introduce new userspace-specific accessors that deal solely with
the virtual state.

Reported-by: James Morse <james.morse@arm.com>
Tested-by: James Morse <james.morse@arm.com>
Reviewed-by: James Morse <james.morse@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2020-04-22 17:13:30 +01:00
Marc Zyngier 9a50ebbffa KVM: arm: vgic: Synchronize the whole guest on GIC{D,R}_I{S,C}ACTIVER read
When a guest tries to read the active state of its interrupts,
we currently just return whatever state we have in memory. This
means that if such an interrupt lives in a List Register on another
CPU, we fail to obsertve the latest active state for this interrupt.

In order to remedy this, stop all the other vcpus so that they exit
and we can observe the most recent value for the state. This is
similar to what we are doing for the write side of the same
registers, and results in new MMIO handlers for userspace (which
do not need to stop the guest, as it is supposed to be stopped
already).

Reported-by: Julien Grall <julien@xen.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2020-04-22 17:13:16 +01:00
Paolo Bonzini 00a6a5ef39 PPC KVM fix for 5.7
- Fix a regression introduced in the last merge window, which results
   in guests in HPT mode dying randomly.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJeni/pAAoJEJ2a6ncsY3GfTRoIANAQjIZi96AfJcfnrYQ4yUF7
 scxawTiJ9VavvsEJLJ7vsozrJ4xxmvmA0fFWC84uw9+BwPqoLFFvZTjazbGEDVvF
 FGwNBR/k7nfFVMIHS3K9iy9KjvYL3xkL26AgFTDJFq8hmOO9pH0txuk4r7SXb+NX
 bGG0mScAD/Dg/HwAHAS6EP3jT35QtGTK62p8foqVTziTNcmBn9Ywtg0lEzAcq2iY
 Y1BUD4Ov3cggshMI9SqHE8Yyq0XA2Wi6ggcyz/gVzvcbdFQmtg57Tri8nN8661LX
 XKh+VTpYSIxNs5GgjwlNesJzJ9h6CSynJF556qrjQ0XsXcNqvn8fcZdNQ+hnRYw=
 =Y19W
 -----END PGP SIGNATURE-----

Merge tag 'kvm-ppc-fixes-5.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-master

PPC KVM fix for 5.7

- Fix a regression introduced in the last merge window, which results
  in guests in HPT mode dying randomly.
2020-04-21 09:39:55 -04:00
Paolo Bonzini 3bda03865f KVM: s390: Fix for 5.7 and maintainer update
- Silence false positive lockdep warning
 - add Claudio as reviewer
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJenY6AAAoJEBF7vIC1phx8bykQAK+QZyD+H/zGNuqeUVn0sh8e
 yKUVMR+kuE+l57q77nt2AYVxqpCD9xSKRR+SOSLzhVH/HJf625nm+Ny/WOWMebwJ
 EA/KK+v15T5rga8gFza+4cPg4v/pHwjHhSbjTb1JWg+8cJR1BTj6OxRuTtWr5+25
 GF4RhkJOit/VhNbCo1aIgs7/7F1pPALstdPAUsHYe1PeULdRMVqSVluXT2KTPhpi
 /kzDw8sKKcYgv/eaVdcNoHv+VX1AWIRDAKEttCywyocfbu0ESwadmR7C0qlm1446
 HqowP6F0xCF0Whi/65aN4ZOv7wjO/qrV08DZ7JLA3/oKlXtZ1ieyiE2q/P1frSo1
 gvmuHiH5/UI6t6a/BSCpJwqcilxKYArqAAYBKoGiJhTbsJStqw0wl41klWTKXlTq
 VrCvjoUxQ9JMjFCQ1GXOU+ODNyX2IwZYptJ5vF24HYzBJwUBe3HPG9/BA8YcodzG
 qGQ5IKv0Q1IFTwOqnt557H0MjcBtNIEx54aLJrPy3wldsiNSj39Ft0cuvnbR+Q4F
 QhKk88dHtd7NW1IirfgYmLGe0rB1ANKM7wUGEdM5w2y5Eg8wCs8/P4KeGh0YyFI9
 xPqZDfwof6KkDjOGFXr/CeD/thi+km0/FpePb7cL5Ow4a+JmrCvqQiXrf0TbnFpv
 t5ZlHnGzoSHsEaRgmJ+X
 =d46L
 -----END PGP SIGNATURE-----

Merge tag 'kvm-s390-master-5.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-master

KVM: s390: Fix for 5.7 and maintainer update

- Silence false positive lockdep warning
- add Claudio as reviewer
2020-04-21 09:37:13 -04:00
Paolo Bonzini 1d2c6c9bd4 selftests: kvm/set_memory_region_test: do not check RIP if the guest shuts down
On AMD, the state of the VMCB is undefined after a shutdown VMEXIT.  KVM
takes a very conservative approach to that and resets the guest altogether
when that happens.  This causes the set_memory_region_test to fail
because the RIP is 0xfff0 (the reset vector).  Restrict the RIP test
to KVM_EXIT_INTERNAL_ERROR in order to fix this.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21 09:13:13 -04:00
Paolo Bonzini e72436bc3a KVM: SVM: avoid infinite loop on NPF from bad address
When a nested page fault is taken from an address that does not have
a memslot associated to it, kvm_mmu_do_page_fault returns RET_PF_EMULATE
(via mmu_set_spte) and kvm_mmu_page_fault then invokes svm_need_emulation_on_page_fault.

The default answer there is to return false, but in this case this just
causes the page fault to be retried ad libitum.  Since this is not a
fast path, and the only other case where it is taken is an erratum,
just stick a kvm_vcpu_gfn_to_memslot check in there to detect the
common case where the erratum is not happening.

This fixes an infinite loop in the new set_memory_region_test.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21 09:13:13 -04:00