Commit Graph

1441 Commits

Author SHA1 Message Date
Linus Torvalds 15303ba5d1 KVM changes for 4.16
ARM:
 - Include icache invalidation optimizations, improving VM startup time
 
 - Support for forwarded level-triggered interrupts, improving
   performance for timers and passthrough platform devices
 
 - A small fix for power-management notifiers, and some cosmetic changes
 
 PPC:
 - Add MMIO emulation for vector loads and stores
 
 - Allow HPT guests to run on a radix host on POWER9 v2.2 CPUs without
   requiring the complex thread synchronization of older CPU versions
 
 - Improve the handling of escalation interrupts with the XIVE interrupt
   controller
 
 - Support decrement register migration
 
 - Various cleanups and bugfixes.
 
 s390:
 - Cornelia Huck passed maintainership to Janosch Frank
 
 - Exitless interrupts for emulated devices
 
 - Cleanup of cpuflag handling
 
 - kvm_stat counter improvements
 
 - VSIE improvements
 
 - mm cleanup
 
 x86:
 - Hypervisor part of SEV
 
 - UMIP, RDPID, and MSR_SMI_COUNT emulation
 
 - Paravirtualized TLB shootdown using the new KVM_VCPU_PREEMPTED bit
 
 - Allow guests to see TOPOEXT, GFNI, VAES, VPCLMULQDQ, and more AVX512
   features
 
 - Show vcpu id in its anonymous inode name
 
 - Many fixes and cleanups
 
 - Per-VCPU MSR bitmaps (already merged through x86/pti branch)
 
 - Stable KVM clock when nesting on Hyper-V (merged through x86/hyperv)
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJafvMtAAoJEED/6hsPKofo6YcH/Rzf2RmshrWaC3q82yfIV0Qz
 Z8N8yJHSaSdc3Jo6cmiVj0zelwAxdQcyjwlT7vxt5SL2yML+/Q0st9Hc3EgGGXPm
 Il99eJEl+2MYpZgYZqV8ff3mHS5s5Jms+7BITAeh6Rgt+DyNbykEAvzt+MCHK9cP
 xtsIZQlvRF7HIrpOlaRzOPp3sK2/MDZJ1RBE7wYItK3CUAmsHim/LVYKzZkRTij3
 /9b4LP1yMMbziG+Yxt1o682EwJB5YIat6fmDG9uFeEVI5rWWN7WFubqs8gCjYy/p
 FX+BjpOdgTRnX+1m9GIj0Jlc/HKMXryDfSZS07Zy4FbGEwSiI5SfKECub4mDhuE=
 =C/uD
 -----END PGP SIGNATURE-----

Merge tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Radim Krčmář:
 "ARM:

   - icache invalidation optimizations, improving VM startup time

   - support for forwarded level-triggered interrupts, improving
     performance for timers and passthrough platform devices

   - a small fix for power-management notifiers, and some cosmetic
     changes

  PPC:

   - add MMIO emulation for vector loads and stores

   - allow HPT guests to run on a radix host on POWER9 v2.2 CPUs without
     requiring the complex thread synchronization of older CPU versions

   - improve the handling of escalation interrupts with the XIVE
     interrupt controller

   - support decrement register migration

   - various cleanups and bugfixes.

  s390:

   - Cornelia Huck passed maintainership to Janosch Frank

   - exitless interrupts for emulated devices

   - cleanup of cpuflag handling

   - kvm_stat counter improvements

   - VSIE improvements

   - mm cleanup

  x86:

   - hypervisor part of SEV

   - UMIP, RDPID, and MSR_SMI_COUNT emulation

   - paravirtualized TLB shootdown using the new KVM_VCPU_PREEMPTED bit

   - allow guests to see TOPOEXT, GFNI, VAES, VPCLMULQDQ, and more
     AVX512 features

   - show vcpu id in its anonymous inode name

   - many fixes and cleanups

   - per-VCPU MSR bitmaps (already merged through x86/pti branch)

   - stable KVM clock when nesting on Hyper-V (merged through
     x86/hyperv)"

* tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (197 commits)
  KVM: PPC: Book3S: Add MMIO emulation for VMX instructions
  KVM: PPC: Book3S HV: Branch inside feature section
  KVM: PPC: Book3S HV: Make HPT resizing work on POWER9
  KVM: PPC: Book3S HV: Fix handling of secondary HPTEG in HPT resizing code
  KVM: PPC: Book3S PR: Fix broken select due to misspelling
  KVM: x86: don't forget vcpu_put() in kvm_arch_vcpu_ioctl_set_sregs()
  KVM: PPC: Book3S PR: Fix svcpu copying with preemption enabled
  KVM: PPC: Book3S HV: Drop locks before reading guest memory
  kvm: x86: remove efer_reload entry in kvm_vcpu_stat
  KVM: x86: AMD Processor Topology Information
  x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested
  kvm: embed vcpu id to dentry of vcpu anon inode
  kvm: Map PFN-type memory regions as writable (if possible)
  x86/kvm: Make it compile on 32bit and with HYPYERVISOR_GUEST=n
  KVM: arm/arm64: Fixup userspace irqchip static key optimization
  KVM: arm/arm64: Fix userspace_irqchip_in_use counting
  KVM: arm/arm64: Fix incorrect timer_is_pending logic
  MAINTAINERS: update KVM/s390 maintainers
  MAINTAINERS: add Halil as additional vfio-ccw maintainer
  MAINTAINERS: add David as a reviewer for KVM/s390
  ...
2018-02-10 13:16:35 -08:00
Linus Torvalds 35277995e1 Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull spectre/meltdown updates from Thomas Gleixner:
 "The next round of updates related to melted spectrum:

   - The initial set of spectre V1 mitigations:

       - Array index speculation blocker and its usage for syscall,
         fdtable and the n180211 driver.

       - Speculation barrier and its usage in user access functions

   - Make indirect calls in KVM speculation safe

   - Blacklisting of known to be broken microcodes so IPBP/IBSR are not
     touched.

   - The initial IBPB support and its usage in context switch

   - The exposure of the new speculation MSRs to KVM guests.

   - A fix for a regression in x86/32 related to the cpu entry area

   - Proper whitelisting for known to be safe CPUs from the mitigations.

   - objtool fixes to deal proper with retpolines and alternatives

   - Exclude __init functions from retpolines which speeds up the boot
     process.

   - Removal of the syscall64 fast path and related cleanups and
     simplifications

   - Removal of the unpatched paravirt mode which is yet another source
     of indirect unproteced calls.

   - A new and undisputed version of the module mismatch warning

   - A couple of cleanup and correctness fixes all over the place

  Yet another step towards full mitigation. There are a few things still
  missing like the RBS underflow mitigation for Skylake and other small
  details, but that's being worked on.

  That said, I'm taking a belated christmas vacation for a week and hope
  that everything is magically solved when I'm back on Feb 12th"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
  KVM/SVM: Allow direct access to MSR_IA32_SPEC_CTRL
  KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES
  KVM/x86: Add IBPB support
  KVM/x86: Update the reverse_cpuid list to include CPUID_7_EDX
  x86/speculation: Fix typo IBRS_ATT, which should be IBRS_ALL
  x86/pti: Mark constant arrays as __initconst
  x86/spectre: Simplify spectre_v2 command line parsing
  x86/retpoline: Avoid retpolines for built-in __init functions
  x86/kvm: Update spectre-v1 mitigation
  KVM: VMX: make MSR bitmaps per-VCPU
  x86/paravirt: Remove 'noreplace-paravirt' cmdline option
  x86/speculation: Use Indirect Branch Prediction Barrier in context switch
  x86/cpuid: Fix up "virtual" IBRS/IBPB/STIBP feature bits on Intel
  x86/spectre: Fix spelling mistake: "vunerable"-> "vulnerable"
  x86/spectre: Report get_user mitigation for spectre_v1
  nl80211: Sanitize array index in parse_txq_params
  vfs, fdtable: Prevent bounds-check bypass via speculative execution
  x86/syscall: Sanitize syscall table de-references under speculation
  x86/get_user: Use pointer masking to limit speculation
  ...
2018-02-04 11:45:55 -08:00
Linus Torvalds 617aebe6a9 Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
 available to be copied to/from userspace in the face of bugs. To further
 restrict what memory is available for copying, this creates a way to
 whitelist specific areas of a given slab cache object for copying to/from
 userspace, allowing much finer granularity of access control. Slab caches
 that are never exposed to userspace can declare no whitelist for their
 objects, thereby keeping them unavailable to userspace via dynamic copy
 operations. (Note, an implicit form of whitelisting is the use of constant
 sizes in usercopy operations and get_user()/put_user(); these bypass all
 hardened usercopy checks since these sizes cannot change at runtime.)
 
 This new check is WARN-by-default, so any mistakes can be found over the
 next several releases without breaking anyone's system.
 
 The series has roughly the following sections:
 - remove %p and improve reporting with offset
 - prepare infrastructure and whitelist kmalloc
 - update VFS subsystem with whitelists
 - update SCSI subsystem with whitelists
 - update network subsystem with whitelists
 - update process memory with whitelists
 - update per-architecture thread_struct with whitelists
 - update KVM with whitelists and fix ioctl bug
 - mark all other allocations as not whitelisted
 - update lkdtm for more sensible test overage
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJabvleAAoJEIly9N/cbcAmO1kQAJnjVPutnLSbnUteZxtsv7W4
 43Cggvokfxr6l08Yh3hUowNxZVKjhF9uwMVgRRg9Nl5WdYCN+vCQbHz+ZdzGJXKq
 cGqdKWgexMKX+aBdNDrK7BphUeD46sH7JWR+a/lDV/BgPxBCm9i5ZZCgXbPP89AZ
 NpLBji7gz49wMsnm/x135xtNlZ3dG0oKETzi7MiR+NtKtUGvoIszSKy5JdPZ4m8q
 9fnXmHqmwM6uQFuzDJPt1o+D1fusTuYnjI7EgyrJRRhQ+BB3qEFZApXnKNDRS9Dm
 uB7jtcwefJCjlZVCf2+PWTOEifH2WFZXLPFlC8f44jK6iRW2Nc+wVRisJ3vSNBG1
 gaRUe/FSge68eyfQj5OFiwM/2099MNkKdZ0fSOjEBeubQpiFChjgWgcOXa5Bhlrr
 C4CIhFV2qg/tOuHDAF+Q5S96oZkaTy5qcEEwhBSW15ySDUaRWFSrtboNt6ZVOhug
 d8JJvDCQWoNu1IQozcbv6xW/Rk7miy8c0INZ4q33YUvIZpH862+vgDWfTJ73Zy9H
 jR/8eG6t3kFHKS1vWdKZzOX1bEcnd02CGElFnFYUEewKoV7ZeeLsYX7zodyUAKyi
 Yp5CImsDbWWTsptBg6h9nt2TseXTxYCt2bbmpJcqzsqSCUwOQNQ4/YpuzLeG0ihc
 JgOmUnQNJWCTwUUw5AS1
 =tzmJ
 -----END PGP SIGNATURE-----

Merge tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardened usercopy whitelisting from Kees Cook:
 "Currently, hardened usercopy performs dynamic bounds checking on slab
  cache objects. This is good, but still leaves a lot of kernel memory
  available to be copied to/from userspace in the face of bugs.

  To further restrict what memory is available for copying, this creates
  a way to whitelist specific areas of a given slab cache object for
  copying to/from userspace, allowing much finer granularity of access
  control.

  Slab caches that are never exposed to userspace can declare no
  whitelist for their objects, thereby keeping them unavailable to
  userspace via dynamic copy operations. (Note, an implicit form of
  whitelisting is the use of constant sizes in usercopy operations and
  get_user()/put_user(); these bypass all hardened usercopy checks since
  these sizes cannot change at runtime.)

  This new check is WARN-by-default, so any mistakes can be found over
  the next several releases without breaking anyone's system.

  The series has roughly the following sections:
   - remove %p and improve reporting with offset
   - prepare infrastructure and whitelist kmalloc
   - update VFS subsystem with whitelists
   - update SCSI subsystem with whitelists
   - update network subsystem with whitelists
   - update process memory with whitelists
   - update per-architecture thread_struct with whitelists
   - update KVM with whitelists and fix ioctl bug
   - mark all other allocations as not whitelisted
   - update lkdtm for more sensible test overage"

* tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (38 commits)
  lkdtm: Update usercopy tests for whitelisting
  usercopy: Restrict non-usercopy caches to size 0
  kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
  kvm: whitelist struct kvm_vcpu_arch
  arm: Implement thread_struct whitelist for hardened usercopy
  arm64: Implement thread_struct whitelist for hardened usercopy
  x86: Implement thread_struct whitelist for hardened usercopy
  fork: Provide usercopy whitelisting for task_struct
  fork: Define usercopy region in thread_stack slab caches
  fork: Define usercopy region in mm_struct slab caches
  net: Restrict unwhitelisted proto caches to size 0
  sctp: Copy struct sctp_sock.autoclose to userspace using put_user()
  sctp: Define usercopy region in SCTP proto slab cache
  caif: Define usercopy region in caif proto slab cache
  ip: Define usercopy region in IP proto slab cache
  net: Define usercopy region in struct proto slab cache
  scsi: Define usercopy region in scsi_sense_cache slab cache
  cifs: Define usercopy region in cifs_request slab cache
  vxfs: Define usercopy region in vxfs_inode slab cache
  ufs: Define usercopy region in ufs_inode_cache slab cache
  ...
2018-02-03 16:25:42 -08:00
KarimAllah Ahmed d28b387fb7 KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL
[ Based on a patch from Ashok Raj <ashok.raj@intel.com> ]

Add direct access to MSR_IA32_SPEC_CTRL for guests. This is needed for
guests that will only mitigate Spectre V2 through IBRS+IBPB and will not
be using a retpoline+IBPB based approach.

To avoid the overhead of saving and restoring the MSR_IA32_SPEC_CTRL for
guests that do not actually use the MSR, only start saving and restoring
when a non-zero is written to it.

No attempt is made to handle STIBP here, intentionally. Filtering STIBP
may be added in a future patch, which may require trapping all writes
if we don't want to pass it through directly to the guest.

[dwmw2: Clean up CPUID bits, save/restore manually, handle reset]

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: kvm@vger.kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ashok Raj <ashok.raj@intel.com>
Link: https://lkml.kernel.org/r/1517522386-18410-5-git-send-email-karahmed@amazon.de
2018-02-03 23:06:52 +01:00
KarimAllah Ahmed 28c1c9fabf KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES
Intel processors use MSR_IA32_ARCH_CAPABILITIES MSR to indicate RDCL_NO
(bit 0) and IBRS_ALL (bit 1). This is a read-only MSR. By default the
contents will come directly from the hardware, but user-space can still
override it.

[dwmw2: The bit in kvm_cpuid_7_0_edx_x86_features can be unconditional]

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: kvm@vger.kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Link: https://lkml.kernel.org/r/1517522386-18410-4-git-send-email-karahmed@amazon.de
2018-02-03 23:06:52 +01:00
Eric Biggers 8dbfb2bf1b KVM: x86: don't forget vcpu_put() in kvm_arch_vcpu_ioctl_set_sregs()
Due to a bad merge resolution between commit f298103359 ("KVM/x86:
Check input paging mode when cs.l is set") and commit b4ef9d4e8c
("KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_sregs"),
there is a case in kvm_arch_vcpu_ioctl_set_sregs() where vcpu_put() is
not called after vcpu_get().  Fix it.

Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-02-02 17:49:55 +01:00
Radim Krčmář 7bf14c28ee Merge branch 'x86/hyperv' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Topic branch for stable KVM clockource under Hyper-V.

Thanks to Christoffer Dall for resolving the ARM conflict.
2018-02-01 15:04:17 +01:00
Longpeng(Mike) 87cedc6be5 kvm: x86: remove efer_reload entry in kvm_vcpu_stat
The efer_reload is never used since
commit 26bb0981b3 ("KVM: VMX: Use shared msr infrastructure"),
so remove it.

Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-31 18:25:35 +01:00
Vitaly Kuznetsov d391f12070 x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested
I was investigating an issue with seabios >= 1.10 which stopped working
for nested KVM on Hyper-V. The problem appears to be in
handle_ept_violation() function: when we do fast mmio we need to skip
the instruction so we do kvm_skip_emulated_instruction(). This, however,
depends on VM_EXIT_INSTRUCTION_LEN field being set correctly in VMCS.
However, this is not the case.

Intel's manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set when
EPT MISCONFIG occurs. While on real hardware it was observed to be set,
some hypervisors follow the spec and don't set it; we end up advancing
IP with some random value.

I checked with Microsoft and they confirmed they don't fill
VM_EXIT_INSTRUCTION_LEN on EPT MISCONFIG.

Fix the issue by doing instruction skip through emulator when running
nested.

Fixes: 68c3b4d167
Suggested-by: Radim Krčmář <rkrcmar@redhat.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-31 18:25:34 +01:00
Thomas Gleixner 5fa4ec9cb2 x86/kvm: Make it compile on 32bit and with HYPYERVISOR_GUEST=n
The reenlightment support for hyperv slapped a direct reference to
x86_hyper_type into the kvm code which results in the following build
failure when CONFIG_HYPERVISOR_GUEST=n:

arch/x86/kvm/x86.c:6259:6: error: ‘x86_hyper_type’ undeclared (first use in this function)
arch/x86/kvm/x86.c:6259:6: note: each undeclared identifier is reported only once for each function it appears in

Use the proper helper function to cure that.

The 32bit compile fails because of:

arch/x86/kvm/x86.c:5936:13: warning: ‘kvm_hyperv_tsc_notifier’ defined but not used [-Wunused-function]

which is a real trainwreck engineering artwork. The callsite is wrapped
into #ifdef CONFIG_X86_64, but the function itself has the #ifdef inside
the function body. Make the function itself wrapped into the ifdef to cure
that.

Qualiteee....

Fixes: 0092e4346f ("x86/kvm: Support Hyper-V reenlightenment")
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: kvm@vger.kernel.org
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com>
Cc: Roman Kagan <rkagan@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: devel@linuxdriverproject.org
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Cathy Avery <cavery@redhat.com>
Cc: Mohammed Gamal <mmorsy@redhat.com>
2018-01-31 10:29:40 +01:00
Vitaly Kuznetsov 0092e4346f x86/kvm: Support Hyper-V reenlightenment
When running nested KVM on Hyper-V guests its required to update
masterclocks for all guests when L1 migrates to a host with different TSC
frequency.

Implement the procedure in the following way:
  - Pause all guests.
  - Tell the host (Hyper-V) to stop emulating TSC accesses.
  - Update the gtod copy, recompute clocks.
  - Unpause all guests.

This is somewhat similar to cpufreq but there are two important differences:
 - TSC emulation can only be disabled globally (on all CPUs)
 - The new TSC frequency is not known until emulation is turned off so
   there is no way to 'prepare' for the event upfront.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: kvm@vger.kernel.org
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com>
Cc: Roman Kagan <rkagan@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: devel@linuxdriverproject.org
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Cathy Avery <cavery@redhat.com>
Cc: Mohammed Gamal <mmorsy@redhat.com>
Link: https://lkml.kernel.org/r/20180124132337.30138-8-vkuznets@redhat.com
2018-01-30 23:55:34 +01:00
Vitaly Kuznetsov b0c39dc68e x86/kvm: Pass stable clocksource to guests when running nested on Hyper-V
Currently, KVM is able to work in 'masterclock' mode passing
PVCLOCK_TSC_STABLE_BIT to guests when the clocksource which is used on the
host is TSC.

When running nested on Hyper-V the guest normally uses a different one: TSC
page which is resistant to TSC frequency changes on events like L1
migration. Add support for it in KVM.

The only non-trivial change is in vgettsc(): when updating the gtod copy
both the clock readout and tsc value have to be updated now.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: kvm@vger.kernel.org
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com>
Cc: Roman Kagan <rkagan@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: devel@linuxdriverproject.org
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Cathy Avery <cavery@redhat.com>
Cc: Mohammed Gamal <mmorsy@redhat.com>
Link: https://lkml.kernel.org/r/20180124132337.30138-7-vkuznets@redhat.com
2018-01-30 23:55:34 +01:00
Tianyu Lan 37b95951c5 KVM/x86: Fix wrong macro references of X86_CR0_PG_BIT and X86_CR4_PAE_BIT in kvm_valid_sregs()
kvm_valid_sregs() should use X86_CR0_PG and X86_CR4_PAE to check bit
status rather than X86_CR0_PG_BIT and X86_CR4_PAE_BIT. This patch is
to fix it.

Fixes: f29810335965a(KVM/x86: Check input paging mode when cs.l is set)
Reported-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-17 15:01:11 +01:00
Liran Alon fa59cc0038 KVM: x86: Optimization: Create SVM stubs for sync_pir_to_irr()
sync_pir_to_irr() is only called if vcpu->arch.apicv_active()==true.
In case it is false, VMX code make sure to set sync_pir_to_irr
to NULL.

Therefore, having SVM stubs allows to remove check for if
sync_pir_to_irr != NULL from all calling sites.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
[Return highest IRR in the SVM case. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:40:09 +01:00
Paolo Bonzini 65e38583c3 Merge branch 'sev-v9-p2' of https://github.com/codomania/kvm
This part of Secure Encrypted Virtualization (SEV) patch series focuses on KVM
changes required to create and manage SEV guests.

SEV is an extension to the AMD-V architecture which supports running encrypted
virtual machine (VMs) under the control of a hypervisor. Encrypted VMs have their
pages (code and data) secured such that only the guest itself has access to
unencrypted version. Each encrypted VM is associated with a unique encryption key;
if its data is accessed to a different entity using a different key the encrypted
guest's data will be incorrectly decrypted, leading to unintelligible data.
This security model ensures that hypervisor will no longer able to inspect or
alter any guest code or data.

The key management of this feature is handled by a separate processor known as
the AMD Secure Processor (AMD-SP) which is present on AMD SOCs. The SEV Key
Management Specification (see below) provides a set of commands which can be
used by hypervisor to load virtual machine keys through the AMD-SP driver.

The patch series adds a new ioctl in KVM driver (KVM_MEMORY_ENCRYPT_OP). The
ioctl will be used by qemu to issue SEV guest-specific commands defined in Key
Management Specification.

The following links provide additional details:

AMD Memory Encryption white paper:
http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf

AMD64 Architecture Programmer's Manual:
    http://support.amd.com/TechDocs/24593.pdf
    SME is section 7.10
    SEV is section 15.34

SEV Key Management:
http://support.amd.com/TechDocs/55766_SEV-KM API_Specification.pdf

KVM Forum Presentation:
http://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf

SEV Guest BIOS support:
  SEV support has been add to EDKII/OVMF BIOS
  https://github.com/tianocore/edk2

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-01-16 16:35:32 +01:00
Paolo Bonzini 476b7adaa3 KVM: x86: avoid unnecessary XSETBV on guest entry
xsetbv can be expensive when running on nested virtualization, try to
avoid it.

Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>
Reviewed-by: Quan Xu <quan.xu0@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:34:13 +01:00
Wanpeng Li efdab99281 KVM: x86: fix escape of guest dr6 to the host
syzkaller reported:

   WARNING: CPU: 0 PID: 12927 at arch/x86/kernel/traps.c:780 do_debug+0x222/0x250
   CPU: 0 PID: 12927 Comm: syz-executor Tainted: G           OE    4.15.0-rc2+ #16
   RIP: 0010:do_debug+0x222/0x250
   Call Trace:
    <#DB>
    debug+0x3e/0x70
   RIP: 0010:copy_user_enhanced_fast_string+0x10/0x20
    </#DB>
    _copy_from_user+0x5b/0x90
    SyS_timer_create+0x33/0x80
    entry_SYSCALL_64_fastpath+0x23/0x9a

The testcase sets a watchpoint (with perf_event_open) on a buffer that is
passed to timer_create() as the struct sigevent argument.  In timer_create(),
copy_from_user()'s rep movsb triggers the BP.  The testcase also sets
the debug registers for the guest.

However, KVM only restores host debug registers when the host has active
watchpoints, which triggers a race condition when running the testcase with
multiple threads.  The guest's DR6.BS bit can escape to the host before
another thread invokes timer_create(), and do_debug() complains.

The fix is to respect do_debug()'s dr6 invariant when leaving KVM.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:34:13 +01:00
Wanpeng Li f38a7b7526 KVM: X86: support paravirtualized help for TLB shootdowns
When running on a virtual machine, IPIs are expensive when the target
CPU is sleeping.  Thus, it is nice to be able to avoid them for TLB
shootdowns.  KVM can just do the flush via INVVPID on the guest's behalf
the next time the CPU is scheduled.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
[Use "&" to test the bit instead of "==". - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:34:13 +01:00
Wanpeng Li c2ba05ccfd KVM: X86: introduce invalidate_gpa argument to tlb flush
Introduce a new bool invalidate_gpa argument to kvm_x86_ops->tlb_flush,
it will be used by later patches to just flush guest tlb.

For VMX, this will use INVVPID instead of INVEPT, which will invalidate
combined mappings while keeping guest-physical mappings.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:34:13 +01:00
Wanpeng Li fa55eedd63 KVM: X86: Add KVM_VCPU_PREEMPTED
The next patch will add another bit to the preempted field in
kvm_steal_time.  Define a constant for bit 0 (the only one that is
currently used).

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-01-16 16:34:13 +01:00
Paolo Bonzini 51776043af kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
This ioctl is obsolete (it was used by Xenner as far as I know) but
still let's not break it gratuitously...  Its handler is copying
directly into struct kvm.  Go through a bounce buffer instead, with
the added benefit that we can actually do something useful with the
flags argument---the previous code was exiting with -EINVAL but still
doing the copy.

This technically is a userspace ABI breakage, but since no one should be
using the ioctl, it's a good occasion to see if someone actually
complains.

Cc: kernel-hardening@lists.openwall.com
Cc: Kees Cook <keescook@chromium.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-01-15 12:08:07 -08:00
Wanpeng Li e39d200fa5 KVM: Fix stack-out-of-bounds read in write_mmio
Reported by syzkaller:

  BUG: KASAN: stack-out-of-bounds in write_mmio+0x11e/0x270 [kvm]
  Read of size 8 at addr ffff8803259df7f8 by task syz-executor/32298

  CPU: 6 PID: 32298 Comm: syz-executor Tainted: G           OE    4.15.0-rc2+ #18
  Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
  Call Trace:
   dump_stack+0xab/0xe1
   print_address_description+0x6b/0x290
   kasan_report+0x28a/0x370
   write_mmio+0x11e/0x270 [kvm]
   emulator_read_write_onepage+0x311/0x600 [kvm]
   emulator_read_write+0xef/0x240 [kvm]
   emulator_fix_hypercall+0x105/0x150 [kvm]
   em_hypercall+0x2b/0x80 [kvm]
   x86_emulate_insn+0x2b1/0x1640 [kvm]
   x86_emulate_instruction+0x39a/0xb90 [kvm]
   handle_exception+0x1b4/0x4d0 [kvm_intel]
   vcpu_enter_guest+0x15a0/0x2640 [kvm]
   kvm_arch_vcpu_ioctl_run+0x549/0x7d0 [kvm]
   kvm_vcpu_ioctl+0x479/0x880 [kvm]
   do_vfs_ioctl+0x142/0x9a0
   SyS_ioctl+0x74/0x80
   entry_SYSCALL_64_fastpath+0x23/0x9a

The path of patched vmmcall will patch 3 bytes opcode 0F 01 C1(vmcall)
to the guest memory, however, write_mmio tracepoint always prints 8 bytes
through *(u64 *)val since kvm splits the mmio access into 8 bytes. This
leaks 5 bytes from the kernel stack (CVE-2017-17741).  This patch fixes
it by just accessing the bytes which we operate on.

Before patch:

syz-executor-5567  [007] .... 51370.561696: kvm_mmio: mmio write len 3 gpa 0x10 val 0x1ffff10077c1010f

After patch:

syz-executor-13416 [002] .... 51302.299573: kvm_mmio: mmio write len 3 gpa 0x10 val 0xc1010f

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-18 12:57:01 +01:00
Lan Tianyu f298103359 KVM/x86: Check input paging mode when cs.l is set
Reported by syzkaller:
    WARNING: CPU: 0 PID: 27962 at arch/x86/kvm/emulate.c:5631 x86_emulate_insn+0x557/0x15f0 [kvm]
    Modules linked in: kvm_intel kvm [last unloaded: kvm]
    CPU: 0 PID: 27962 Comm: syz-executor Tainted: G    B   W        4.15.0-rc2-next-20171208+ #32
    Hardware name: Intel Corporation S1200SP/S1200SP, BIOS S1200SP.86B.01.03.0006.040720161253 04/07/2016
    RIP: 0010:x86_emulate_insn+0x557/0x15f0 [kvm]
    RSP: 0018:ffff8807234476d0 EFLAGS: 00010282
    RAX: 0000000000000000 RBX: ffff88072d0237a0 RCX: ffffffffa0065c4d
    RDX: 1ffff100e5a046f9 RSI: 0000000000000003 RDI: ffff88072d0237c8
    RBP: ffff880723447728 R08: ffff88072d020000 R09: ffffffffa008d240
    R10: 0000000000000002 R11: ffffed00e7d87db3 R12: ffff88072d0237c8
    R13: ffff88072d023870 R14: ffff88072d0238c2 R15: ffffffffa008d080
    FS:  00007f8a68666700(0000) GS:ffff880802200000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000000002009506c CR3: 000000071fec4005 CR4: 00000000003626f0
    Call Trace:
     x86_emulate_instruction+0x3bc/0xb70 [kvm]
     ? reexecute_instruction.part.162+0x130/0x130 [kvm]
     vmx_handle_exit+0x46d/0x14f0 [kvm_intel]
     ? trace_event_raw_event_kvm_entry+0xe7/0x150 [kvm]
     ? handle_vmfunc+0x2f0/0x2f0 [kvm_intel]
     ? wait_lapic_expire+0x25/0x270 [kvm]
     vcpu_enter_guest+0x720/0x1ef0 [kvm]
     ...

When CS.L is set, vcpu should run in the 64 bit paging mode.
Current kvm set_sregs function doesn't have such check when
userspace inputs sreg values. This will lead unexpected behavior.
This patch is to add checks for CS.L, EFER.LME, EFER.LMA and
CR4.PAE when get SREG inputs from userspace in order to avoid
unexpected behavior.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Tianyu Lan <tianyu.lan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-15 10:01:46 +01:00
Christoffer Dall 9b062471e5 KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl
Move the calls to vcpu_load() and vcpu_put() in to the architecture
specific implementations of kvm_arch_vcpu_ioctl() which dispatches
further architecture-specific ioctls on to other functions.

Some architectures support asynchronous vcpu ioctls which cannot call
vcpu_load() or take the vcpu->mutex, because that would prevent
concurrent execution with a running VCPU, which is the intended purpose
of these ioctls, for example because they inject interrupts.

We repeat the separate checks for these specifics in the architecture
code for MIPS, S390 and PPC, and avoid taking the vcpu->mutex and
calling vcpu_load for these ioctls.

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:58 +01:00
Christoffer Dall 6a96bc7fa0 KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_fpu
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_set_fpu().

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:57 +01:00
Christoffer Dall 1393123e1e KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_get_fpu
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_get_fpu().

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:56 +01:00
Christoffer Dall 66b5656222 KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_guest_debug
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_set_guest_debug().

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:56 +01:00
Christoffer Dall 1da5b61dac KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_translate
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_translate().

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:55 +01:00
Christoffer Dall e83dff5edf KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_mpstate
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_set_mpstate().

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:54 +01:00
Christoffer Dall fd2325612c KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_get_mpstate
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_get_mpstate().

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:54 +01:00
Christoffer Dall b4ef9d4e8c KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_sregs
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_set_sregs().

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:53 +01:00
Christoffer Dall bcdec41cef KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_get_sregs
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_get_sregs().

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:52 +01:00
Christoffer Dall 875656fe0c KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_regs
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_set_regs().

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:52 +01:00
Christoffer Dall 1fc9b76b3d KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_get_regs
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_get_regs().

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:51 +01:00
Christoffer Dall accb757d79 KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_run
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_run().

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> # s390 parts
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
[Rebased. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:50 +01:00
Christoffer Dall ec7660ccdd KVM: Take vcpu->mutex outside vcpu_load
As we're about to call vcpu_load() from architecture-specific
implementations of the KVM vcpu ioctls, but yet we access data
structures protected by the vcpu->mutex in the generic code, factor
this logic out from vcpu_load().

x86 is the only architecture which calls vcpu_load() outside of the main
vcpu ioctl function, and these calls will no longer take the vcpu mutex
following this patch.  However, with the exception of
kvm_arch_vcpu_postcreate (see below), the callers are either in the
creation or destruction path of the VCPU, which means there cannot be
any concurrent access to the data structure, because the file descriptor
is not yet accessible, or is already gone.

kvm_arch_vcpu_postcreate makes the newly created vcpu potentially
accessible by other in-kernel threads through the kvm->vcpus array, and
we therefore take the vcpu mutex in this case directly.

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:49 +01:00
Wanpeng Li 9c48d517ce KVM: X86: Reduce the overhead when lapic_timer_advance is disabled
When I run ebizzy in a 32 vCPUs guest on a 32 pCPUs Xeon box, I can observe
~8000 kvm_wait_lapic_expire CurAvg/s through kvm_stat tool even if the advance
tscdeadline hrtimer expiration is disabled. Each call to wait_lapic_expire()
will consume ~70 cycles when a timer fires since apic_timer_expire() will
set expired_tscdeadline and then wait_lapic_expire() will do some caculation
before bailing out. So total ~175us per second is lost on this 3.2Ghz machine.
This patch reduces the overhead by skipping the function wait_lapic_expire()
when lapic_timer_advance is disabled.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:48 +01:00
Liran Alon 52797bf9a8 KVM: x86: Add emulation of MSR_SMI_COUNT
This MSR returns the number of #SMIs that occurred on CPU since
boot.

It was seen to be used frequently by ESXi guest.

Patch adds a new vcpu-arch specific var called smi_count to
save the number of #SMIs which occurred on CPU since boot.
It is exposed as a read-only MSR to guest (causing #GP
on wrmsr) in RDMSR/WRMSR emulation code.
MSR_SMI_COUNT is also added to emulated_msrs[] to make sure
user-space can save/restore it for migration purposes.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-12-14 09:26:44 +01:00
Paolo Bonzini ae3e61e1c2 KVM: x86: add support for UMIP
Add the CPUID bits, make the CR4.UMIP bit not reserved anymore, and
add UMIP support for instructions that are already emulated by KVM.

Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:38 +01:00
Peter Xu 5663d8f9bb kvm: x86: fix WARN due to uninitialized guest FPU state
------------[ cut here ]------------
 Bad FPU state detected at kvm_put_guest_fpu+0xd8/0x2d0 [kvm], reinitializing FPU registers.
 WARNING: CPU: 1 PID: 4594 at arch/x86/mm/extable.c:103 ex_handler_fprestore+0x88/0x90
 CPU: 1 PID: 4594 Comm: qemu-system-x86 Tainted: G    B      OE    4.15.0-rc2+ #10
 RIP: 0010:ex_handler_fprestore+0x88/0x90
 Call Trace:
  fixup_exception+0x4e/0x60
  do_general_protection+0xff/0x270
  general_protection+0x22/0x30
 RIP: 0010:kvm_put_guest_fpu+0xd8/0x2d0 [kvm]
 RSP: 0018:ffff8803d5627810 EFLAGS: 00010246
  kvm_vcpu_reset+0x3b4/0x3c0 [kvm]
  kvm_apic_accept_events+0x1c0/0x240 [kvm]
  kvm_arch_vcpu_ioctl_run+0x1658/0x2fb0 [kvm]
  kvm_vcpu_ioctl+0x479/0x880 [kvm]
  do_vfs_ioctl+0x142/0x9a0
  SyS_ioctl+0x74/0x80
  do_syscall_64+0x15f/0x600

where kvm_put_guest_fpu is called without a prior kvm_load_guest_fpu.
To fix it, move kvm_load_guest_fpu to the very beginning of
kvm_arch_vcpu_ioctl_run.

Cc: stable@vger.kernel.org
Fixes: f775b13eed
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:24:35 +01:00
Wanpeng Li d73235d17b KVM: X86: Fix load RFLAGS w/o the fixed bit
*** Guest State ***
 CR0: actual=0x0000000000000030, shadow=0x0000000060000010, gh_mask=fffffffffffffff7
 CR4: actual=0x0000000000002050, shadow=0x0000000000000000, gh_mask=ffffffffffffe871
 CR3 = 0x00000000fffbc000
 RSP = 0x0000000000000000  RIP = 0x0000000000000000
 RFLAGS=0x00000000         DR7 = 0x0000000000000400
        ^^^^^^^^^^

The failed vmentry is triggered by the following testcase when ept=Y:

    #include <unistd.h>
    #include <sys/syscall.h>
    #include <string.h>
    #include <stdint.h>
    #include <linux/kvm.h>
    #include <fcntl.h>
    #include <sys/ioctl.h>

    long r[5];
    int main()
    {
    	r[2] = open("/dev/kvm", O_RDONLY);
    	r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
    	r[4] = ioctl(r[3], KVM_CREATE_VCPU, 7);
    	struct kvm_regs regs = {
    		.rflags = 0,
    	};
    	ioctl(r[4], KVM_SET_REGS, &regs);
    	ioctl(r[4], KVM_RUN, 0);
    }

X86 RFLAGS bit 1 is fixed set, userspace can simply clearing bit 1
of RFLAGS with KVM_SET_REGS ioctl which results in vmentry fails.
This patch fixes it by oring X86_EFLAGS_FIXED during ioctl.

Cc: stable@vger.kernel.org
Suggested-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Quan Xu <quan.xu0@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:24:26 +01:00
Radim Krčmář b1394e745b KVM: x86: fix APIC page invalidation
Implementation of the unpinned APIC page didn't update the VMCS address
cache when invalidation was done through range mmu notifiers.
This became a problem when the page notifier was removed.

Re-introduce the arch-specific helper and call it from ...range_start.

Reported-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Fixes: 38b9917350 ("kvm: vmx: Implement set_apic_access_page_addr")
Fixes: 369ea8242c ("mm/rmap: update to new mmu_notifier semantic v2")
Cc: <stable@vger.kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Wanpeng Li <wanpeng.li@hotmail.com>
Tested-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-12-06 16:10:34 +01:00
Rik van Riel 6ab0b9feb8 x86,kvm: remove KVM emulator get_fpu / put_fpu
Now that get_fpu and put_fpu do nothing, because the scheduler will
automatically load and restore the guest FPU context for us while we
are in this code (deep inside the vcpu_run main loop), we can get rid
of the get_fpu and put_fpu hooks.

Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-05 21:20:24 +01:00
Rik van Riel f775b13eed x86,kvm: move qemu/guest FPU switching out to vcpu_run
Currently, every time a VCPU is scheduled out, the host kernel will
first save the guest FPU/xstate context, then load the qemu userspace
FPU context, only to then immediately save the qemu userspace FPU
context back to memory. When scheduling in a VCPU, the same extraneous
FPU loads and saves are done.

This could be avoided by moving from a model where the guest FPU is
loaded and stored with preemption disabled, to a model where the
qemu userspace FPU is swapped out for the guest FPU context for
the duration of the KVM_RUN ioctl.

This is done under the VCPU mutex, which is also taken when other
tasks inspect the VCPU FPU context, so the code should already be
safe for this change. That should come as no surprise, given that
s390 already has this optimization.

This can fix a bug where KVM calls get_user_pages while owning the
FPU, and the file system ends up requesting the FPU again:

    [258270.527947]  __warn+0xcb/0xf0
    [258270.527948]  warn_slowpath_null+0x1d/0x20
    [258270.527951]  kernel_fpu_disable+0x3f/0x50
    [258270.527953]  __kernel_fpu_begin+0x49/0x100
    [258270.527955]  kernel_fpu_begin+0xe/0x10
    [258270.527958]  crc32c_pcl_intel_update+0x84/0xb0
    [258270.527961]  crypto_shash_update+0x3f/0x110
    [258270.527968]  crc32c+0x63/0x8a [libcrc32c]
    [258270.527975]  dm_bm_checksum+0x1b/0x20 [dm_persistent_data]
    [258270.527978]  node_prepare_for_write+0x44/0x70 [dm_persistent_data]
    [258270.527985]  dm_block_manager_write_callback+0x41/0x50 [dm_persistent_data]
    [258270.527988]  submit_io+0x170/0x1b0 [dm_bufio]
    [258270.527992]  __write_dirty_buffer+0x89/0x90 [dm_bufio]
    [258270.527994]  __make_buffer_clean+0x4f/0x80 [dm_bufio]
    [258270.527996]  __try_evict_buffer+0x42/0x60 [dm_bufio]
    [258270.527998]  dm_bufio_shrink_scan+0xc0/0x130 [dm_bufio]
    [258270.528002]  shrink_slab.part.40+0x1f5/0x420
    [258270.528004]  shrink_node+0x22c/0x320
    [258270.528006]  do_try_to_free_pages+0xf5/0x330
    [258270.528008]  try_to_free_pages+0xe9/0x190
    [258270.528009]  __alloc_pages_slowpath+0x40f/0xba0
    [258270.528011]  __alloc_pages_nodemask+0x209/0x260
    [258270.528014]  alloc_pages_vma+0x1f1/0x250
    [258270.528017]  do_huge_pmd_anonymous_page+0x123/0x660
    [258270.528021]  handle_mm_fault+0xfd3/0x1330
    [258270.528025]  __get_user_pages+0x113/0x640
    [258270.528027]  get_user_pages+0x4f/0x60
    [258270.528063]  __gfn_to_pfn_memslot+0x120/0x3f0 [kvm]
    [258270.528108]  try_async_pf+0x66/0x230 [kvm]
    [258270.528135]  tdp_page_fault+0x130/0x280 [kvm]
    [258270.528149]  kvm_mmu_page_fault+0x60/0x120 [kvm]
    [258270.528158]  handle_ept_violation+0x91/0x170 [kvm_intel]
    [258270.528162]  vmx_handle_exit+0x1ca/0x1400 [kvm_intel]

No performance changes were detected in quick ping-pong tests on
my 4 socket system, which is expected since an FPU+xstate load is
on the order of 0.1us, while ping-ponging between CPUs is on the
order of 20us, and somewhat noisy.

Cc: stable@vger.kernel.org
Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Fixed a bug where reset_vcpu called put_fpu without preceding load_fpu,
 which happened inside from KVM_CREATE_VCPU ioctl. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-12-05 21:16:43 +01:00
Brijesh Singh 69eaedee41 KVM: Introduce KVM_MEMORY_ENCRYPT_{UN,}REG_REGION ioctl
If hardware supports memory encryption then KVM_MEMORY_ENCRYPT_REG_REGION
and KVM_MEMORY_ENCRYPT_UNREG_REGION ioctl's can be used by userspace to
register/unregister the guest memory regions which may contain the encrypted
data (e.g guest RAM, PCI BAR, SMRAM etc).

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Improvements-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
2017-12-04 10:57:26 -06:00
Brijesh Singh 5acc5c0631 KVM: Introduce KVM_MEMORY_ENCRYPT_OP ioctl
If the hardware supports memory encryption then the
KVM_MEMORY_ENCRYPT_OP ioctl can be used by qemu to issue a platform
specific memory encryption commands.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
2017-12-04 10:57:26 -06:00
Jan H. Schönherr 20b7035c66 KVM: Let KVM_SET_SIGNAL_MASK work as advertised
KVM API says for the signal mask you set via KVM_SET_SIGNAL_MASK, that
"any unblocked signal received [...] will cause KVM_RUN to return with
-EINTR" and that "the signal will only be delivered if not blocked by
the original signal mask".

This, however, is only true, when the calling task has a signal handler
registered for a signal. If not, signal evaluation is short-circuited for
SIG_IGN and SIG_DFL, and the signal is either ignored without KVM_RUN
returning or the whole process is terminated.

Make KVM_SET_SIGNAL_MASK behave as advertised by utilizing logic similar
to that in do_sigtimedwait() to avoid short-circuiting of signals.

Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-11-27 17:53:47 +01:00
Wanpeng Li e70b57a6ce KVM: X86: Fix softlockup when get the current kvmclock
watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [qemu-system-x86:10185]
 CPU: 6 PID: 10185 Comm: qemu-system-x86 Tainted: G           OE   4.14.0-rc4+ #4
 RIP: 0010:kvm_get_time_scale+0x4e/0xa0 [kvm]
 Call Trace:
  get_time_ref_counter+0x5a/0x80 [kvm]
  kvm_hv_process_stimers+0x120/0x5f0 [kvm]
  kvm_arch_vcpu_ioctl_run+0x4b4/0x1690 [kvm]
  kvm_vcpu_ioctl+0x33a/0x620 [kvm]
  do_vfs_ioctl+0xa1/0x5d0
  SyS_ioctl+0x79/0x90
  entry_SYSCALL_64_fastpath+0x1e/0xa9

This can be reproduced when running kvm-unit-tests/hyperv_stimer.flat and
cpu-hotplug stress simultaneously. __this_cpu_read(cpu_tsc_khz) returns 0
(set in kvmclock_cpu_down_prep()) when the pCPU is unhotplug which results
in kvm_get_time_scale() gets into an infinite loop.

This patch fixes it by treating the unhotplug pCPU as not using master clock.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-11-27 17:32:53 +01:00
Paolo Bonzini 6ea6e84309 KVM: x86: inject exceptions produced by x86_decode_insn
Sometimes, a processor might execute an instruction while another
processor is updating the page tables for that instruction's code page,
but before the TLB shootdown completes.  The interesting case happens
if the page is in the TLB.

In general, the processor will succeed in executing the instruction and
nothing bad happens.  However, what if the instruction is an MMIO access?
If *that* happens, KVM invokes the emulator, and the emulator gets the
updated page tables.  If the update side had marked the code page as non
present, the page table walk then will fail and so will x86_decode_insn.

Unfortunately, even though kvm_fetch_guest_virt is correctly returning
X86EMUL_PROPAGATE_FAULT, x86_decode_insn's caller treats the failure as
a fatal error if the instruction cannot simply be reexecuted (as is the
case for MMIO).  And this in fact happened sometimes when rebooting
Windows 2012r2 guests.  Just checking ctxt->have_exception and injecting
the exception if true is enough to fix the case.

Thanks to Eduardo Habkost for helping in the debugging of this issue.

Reported-by: Yanan Fu <yfu@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-11-17 13:20:16 +01:00
Eyal Moscovici fab0aa3b77 KVM: x86: Allow suppressing prints on RDMSR/WRMSR of unhandled MSRs
Some guests use these unhandled MSRs very frequently.
This cause dmesg to be populated with lots of aggregated messages on
usage of ignored MSRs. As ignore_msrs=true means that the user is
well-aware his guest use ignored MSRs, allow to also disable the
prints on their usage.

An example of such guest is ESXi which tends to access a lot to MSR
0x34 (MSR_SMI_COUNT) very frequently.

In addition, we have observed this to cause unnecessary delays to
guest execution. Such an example is ESXi which experience networking
delays in it's guests (L2 guests) because of these prints (even when
prints are rate-limited). This can easily be reproduced by pinging
from one L2 guest to another.  Once in a while, a peak in ping RTT
will be observed. Removing these unhandled MSR prints solves the
issue.

Because these prints can help diagnose issues with guests,
this commit only suppress them by a module parameter instead of
removing them from code entirely.

Signed-off-by: Eyal Moscovici <eyal.moscovici@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[Changed suppress_ignore_msrs_prints to report_ignored_msrs - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-11-17 13:20:16 +01:00