Commit Graph

4060 Commits

Author SHA1 Message Date
Ingo Molnar 3f07c01441 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:29 +01:00
Masahiro Yamada 9332ef9dbd scripts/spelling.txt: add "an user" pattern and fix typo instances
Fix typos and add the following to the scripts/spelling.txt:

  an user||a user
  an userspace||a userspace

I also added "userspace" to the list since it is a common word in Linux.
I found some instances for "an userfaultfd", but I did not add it to the
list.  I felt it is endless to find words that start with "user" such as
"userland" etc., so must draw a line somewhere.

Link: http://lkml.kernel.org/r/1481573103-11329-4-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:46 -08:00
Linus Torvalds fd7e9a8834 4.11 is going to be a relatively large release for KVM, with a little over
200 commits and noteworthy changes for most architectures.
 
 * ARM:
 - GICv3 save/restore
 - cache flushing fixes
 - working MSI injection for GICv3 ITS
 - physical timer emulation
 
 * MIPS:
 - various improvements under the hood
 - support for SMP guests
 - a large rewrite of MMU emulation.  KVM MIPS can now use MMU notifiers
 to support copy-on-write, KSM, idle page tracking, swapping, ballooning
 and everything else.  KVM_CAP_READONLY_MEM is also supported, so that
 writes to some memory regions can be treated as MMIO.  The new MMU also
 paves the way for hardware virtualization support.
 
 * PPC:
 - support for POWER9 using the radix-tree MMU for host and guest
 - resizable hashed page table
 - bugfixes.
 
 * s390: expose more features to the guest
 - more SIMD extensions
 - instruction execution protection
 - ESOP2
 
 * x86:
 - improved hashing in the MMU
 - faster PageLRU tracking for Intel CPUs without EPT A/D bits
 - some refactoring of nested VMX entry/exit code, preparing for live
 migration support of nested hypervisors
 - expose yet another AVX512 CPUID bit
 - host-to-guest PTP support
 - refactoring of interrupt injection, with some optimizations thrown in
 and some duct tape removed.
 - remove lazy FPU handling
 - optimizations of user-mode exits
 - optimizations of vcpu_is_preempted() for KVM guests
 
 * generic:
 - alternative signaling mechanism that doesn't pound on tsk->sighand->siglock
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJYral1AAoJEL/70l94x66DbNgH/Rx8YXuidFq2fe3RWOvld3RK
 85OM/D5g38cTLpBE0/sJpcvX34iYN8U/l5foCZwpxB+83GHEk2Cr57JyfTogdaAJ
 x8dBhHKQCA/HxSQUQLN6nFqRV+yT8WUR92Fhqx82+80BSen5Yzcfee/TDoW6T1IW
 g8CYgX9FrRaGOX066ImAuUfdAdUVjyssfs9VttDTX+HiusPeuBPx/wsRe1ZEEPlH
 vnltIJQb1ETV2GOZLUojKjzH6aZkjIl29XxjkYii9JTUornClG0DfW+5QT3uLrB5
 gJ+G+Zmpsq8ZBx9jNDtAi7sFsoPY1Mzf+JPNCGXBra2sP2GrBAuXcxmgznRYltQ=
 =8IIp
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "4.11 is going to be a relatively large release for KVM, with a little
  over 200 commits and noteworthy changes for most architectures.

  ARM:
   - GICv3 save/restore
   - cache flushing fixes
   - working MSI injection for GICv3 ITS
   - physical timer emulation

  MIPS:
   - various improvements under the hood
   - support for SMP guests
   - a large rewrite of MMU emulation. KVM MIPS can now use MMU
     notifiers to support copy-on-write, KSM, idle page tracking,
     swapping, ballooning and everything else. KVM_CAP_READONLY_MEM is
     also supported, so that writes to some memory regions can be
     treated as MMIO. The new MMU also paves the way for hardware
     virtualization support.

  PPC:
   - support for POWER9 using the radix-tree MMU for host and guest
   - resizable hashed page table
   - bugfixes.

  s390:
   - expose more features to the guest
   - more SIMD extensions
   - instruction execution protection
   - ESOP2

  x86:
   - improved hashing in the MMU
   - faster PageLRU tracking for Intel CPUs without EPT A/D bits
   - some refactoring of nested VMX entry/exit code, preparing for live
     migration support of nested hypervisors
   - expose yet another AVX512 CPUID bit
   - host-to-guest PTP support
   - refactoring of interrupt injection, with some optimizations thrown
     in and some duct tape removed.
   - remove lazy FPU handling
   - optimizations of user-mode exits
   - optimizations of vcpu_is_preempted() for KVM guests

  generic:
   - alternative signaling mechanism that doesn't pound on
     tsk->sighand->siglock"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (195 commits)
  x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
  x86/paravirt: Change vcp_is_preempted() arg type to long
  KVM: VMX: use correct vmcs_read/write for guest segment selector/base
  x86/kvm/vmx: Defer TR reload after VM exit
  x86/asm/64: Drop __cacheline_aligned from struct x86_hw_tss
  x86/kvm/vmx: Simplify segment_base()
  x86/kvm/vmx: Get rid of segment_base() on 64-bit kernels
  x86/kvm/vmx: Don't fetch the TSS base from the GDT
  x86/asm: Define the kernel TSS limit in a macro
  kvm: fix page struct leak in handle_vmon
  KVM: PPC: Book3S HV: Disable HPT resizing on POWER9 for now
  KVM: Return an error code only as a constant in kvm_get_dirty_log()
  KVM: Return an error code only as a constant in kvm_get_dirty_log_protect()
  KVM: Return directly after a failed copy_from_user() in kvm_vm_compat_ioctl()
  KVM: x86: remove code for lazy FPU handling
  KVM: race-free exit from KVM_RUN without POSIX signals
  KVM: PPC: Book3S HV: Turn "KVM guest htab" message into a debug message
  KVM: PPC: Book3S PR: Ratelimit copy data failure error messages
  KVM: Support vCPU-based gfn->hva cache
  KVM: use separate generations for each address space
  ...
2017-02-22 18:22:53 -08:00
Chao Peng 96794e4ed4 KVM: VMX: use correct vmcs_read/write for guest segment selector/base
Guest segment selector is 16 bit field and guest segment base is natural
width field. Fix two incorrect invocations accordingly.

Without this patch, build fails when aggressive inlining is used with ICC.

Cc: stable@vger.kernel.org
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21 12:45:49 +01:00
Andy Lutomirski b7ffc44d5b x86/kvm/vmx: Defer TR reload after VM exit
Intel's VMX is daft and resets the hidden TSS limit register to 0x67
on VMX reload, and the 0x67 is not configurable.  KVM currently
reloads TR using the LTR instruction on every exit, but this is quite
slow because LTR is serializing.

The 0x67 limit is entirely harmless unless ioperm() is in use, so
defer the reload until a task using ioperm() is actually running.

Here's some poorly done benchmarking using kvm-unit-tests:

Before:

cpuid 1313
vmcall 1195
mov_from_cr8 11
mov_to_cr8 17
inl_from_pmtimer 6770
inl_from_qemu 6856
inl_from_kernel 2435
outl_to_kernel 1402

After:

cpuid 1291
vmcall 1181
mov_from_cr8 11
mov_to_cr8 16
inl_from_pmtimer 6457
inl_from_qemu 6209
inl_from_kernel 2339
outl_to_kernel 1391

Signed-off-by: Andy Lutomirski <luto@kernel.org>
[Force-reload TR in invalidate_tss_limit. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21 12:45:08 +01:00
Andy Lutomirski 8c2e41f7ae x86/kvm/vmx: Simplify segment_base()
Use actual pointer types for pointers (instead of unsigned long) and
replace hardcoded constants with the appropriate self-documenting
macros.

The function is still a bit messy, but this seems a lot better than
before to me.

This is mostly borrowed from a patch by Thomas Garnier.

Cc: Thomas Garnier <thgarnie@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21 11:48:56 +01:00
Andy Lutomirski e28baeadcf x86/kvm/vmx: Get rid of segment_base() on 64-bit kernels
It was a bit buggy (it didn't list all segment types that needed
64-bit fixups), but the bug was irrelevant because it wasn't called
in any interesting context on 64-bit kernels and was only used for
data segents on 32-bit kernels.

To avoid confusion, make it explicitly 32-bit only.

Cc: Thomas Garnier <thgarnie@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21 11:48:46 +01:00
Andy Lutomirski e0c230634a x86/kvm/vmx: Don't fetch the TSS base from the GDT
The current CPU's TSS base is a foregone conclusion, so there's no need
to parse it out of the segment tables.  This should save a couple cycles
(as STR is surely microcoded and poorly optimized) but, more importantly,
it's a cleanup and it means that segment_base() will never be called on
64-bit kernels.

Cc: Thomas Garnier <thgarnie@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21 11:48:40 +01:00
Linus Torvalds 828cad8ea0 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this (fairly busy) cycle were:

   - There was a class of scheduler bugs related to forgetting to update
     the rq-clock timestamp which can cause weird and hard to debug
     problems, so there's a new debug facility for this: which uncovered
     a whole lot of bugs which convinced us that we want to keep the
     debug facility.

     (Peter Zijlstra, Matt Fleming)

   - Various cputime related updates: eliminate cputime and use u64
     nanoseconds directly, simplify and improve the arch interfaces,
     implement delayed accounting more widely, etc. - (Frederic
     Weisbecker)

   - Move code around for better structure plus cleanups (Ingo Molnar)

   - Move IO schedule accounting deeper into the scheduler plus related
     changes to improve the situation (Tejun Heo)

   - ... plus a round of sched/rt and sched/deadline fixes, plus other
     fixes, updats and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (85 commits)
  sched/core: Remove unlikely() annotation from sched_move_task()
  sched/autogroup: Rename auto_group.[ch] to autogroup.[ch]
  sched/topology: Split out scheduler topology code from core.c into topology.c
  sched/core: Remove unnecessary #include headers
  sched/rq_clock: Consolidate the ordering of the rq_clock methods
  delayacct: Include <uapi/linux/taskstats.h>
  sched/core: Clean up comments
  sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds
  sched/clock: Add dummy clear_sched_clock_stable() stub function
  sched/cputime: Remove generic asm headers
  sched/cputime: Remove unused nsec_to_cputime()
  s390, sched/cputime: Remove unused cputime definitions
  powerpc, sched/cputime: Remove unused cputime definitions
  s390, sched/cputime: Make arch_cpu_idle_time() to return nsecs
  ia64, sched/cputime: Remove unused cputime definitions
  ia64: Convert vtime to use nsec units directly
  ia64, sched/cputime: Move the nsecs based cputime headers to the last arch using it
  sched/cputime: Remove jiffies based cputime
  sched/cputime, vtime: Return nsecs instead of cputime_t to account
  sched/cputime: Complete nsec conversion of tick based accounting
  ...
2017-02-20 12:52:55 -08:00
Paolo Bonzini 06ce521af9 kvm: fix page struct leak in handle_vmon
handle_vmon gets a reference on VMXON region page,
but does not release it. Release the reference.

Found by syzkaller; based on a patch by Dmitry.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-20 17:54:43 +01:00
Paolo Bonzini bd7e5b0899 KVM: x86: remove code for lazy FPU handling
The FPU is always active now when running KVM.

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-17 12:28:01 +01:00
Paolo Bonzini 460df4c1fc KVM: race-free exit from KVM_RUN without POSIX signals
The purpose of the KVM_SET_SIGNAL_MASK API is to let userspace "kick"
a VCPU out of KVM_RUN through a POSIX signal.  A signal is attached
to a dummy signal handler; by blocking the signal outside KVM_RUN and
unblocking it inside, this possible race is closed:

          VCPU thread                     service thread
   --------------------------------------------------------------
        check flag
                                          set flag
                                          raise signal
        (signal handler does nothing)
        KVM_RUN

However, one issue with KVM_SET_SIGNAL_MASK is that it has to take
tsk->sighand->siglock on every KVM_RUN.  This lock is often on a
remote NUMA node, because it is on the node of a thread's creator.
Taking this lock can be very expensive if there are many userspace
exits (as is the case for SMP Windows VMs without Hyper-V reference
time counter).

As an alternative, we can put the flag directly in kvm_run so that
KVM can see it:

          VCPU thread                     service thread
   --------------------------------------------------------------
                                          raise signal
        signal handler
          set run->immediate_exit
        KVM_RUN
          check run->immediate_exit

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-17 12:27:37 +01:00
Cao, Lei bbd6411513 KVM: Support vCPU-based gfn->hva cache
Provide versions of struct gfn_to_hva_cache functions that
take vcpu as a parameter instead of struct kvm.  The existing functions
are not needed anymore, so delete them.  This allows dirty pages to
be logged in the vcpu dirty ring, instead of the global dirty ring,
for ring-based dirty memory tracking.

Signed-off-by: Lei Cao <lei.cao@stratus.com>
Message-Id: <CY1PR08MB19929BD2AC47A291FD680E83F04F0@CY1PR08MB1992.namprd08.prod.outlook.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-16 18:42:46 +01:00
Paolo Bonzini 47c0152e0f KVM: VMX: use vmcs_set/clear_bits for CPU-based execution controls
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-16 18:42:26 +01:00
David Hildenbrand 681bcea802 KVM: svm: inititalize hash table structures directly
The hashtable and guarding spinlock are global data structures,
we can inititalize them statically.

Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20170124212116.4568-1-david@redhat.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 15:26:06 +01:00
Jim Mattson 858e25c06f kvm: nVMX: Refactor nested_vmx_run()
Nested_vmx_run is split into two parts: the part that handles the
VMLAUNCH/VMRESUME instruction, and the part that modifies the vcpu state
to transition from VMX root mode to VMX non-root mode. The latter will
be used when restoring the checkpointed state of a vCPU that was in VMX
operation when a snapshot was taken.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:56:36 +01:00
Jim Mattson ca0bde28f2 kvm: nVMX: Split VMCS checks from nested_vmx_run()
The checks performed on the contents of the vmcs12 are extracted from
nested_vmx_run so that they can be used to validate a vmcs12 that has
been restored from a checkpoint.

Signed-off-by: Jim Mattson <jmattson@google.com>
[Change prepare_vmcs02 and nested_vmx_load_cr3's last argument to u32,
 to match check_vmentry_postreqs.  Update comments for singlestep
 handling. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:56:35 +01:00
Jim Mattson 6beb7bd52e kvm: nVMX: Refactor nested_get_vmcs12_pages()
Perform the checks on vmcs12 state early, but defer the gpa->hpa lookups
until after prepare_vmcs02. Later, when we restore the checkpointed
state of a vCPU in guest mode, we will not be able to do the gpa->hpa
lookups when the restore is done.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:56:11 +01:00
Jim Mattson a8bc284eb7 kvm: nVMX: Refactor handle_vmptrld()
Handle_vmptrld is split into two parts: the part that handles the
VMPTRLD instruction, and the part that establishes the current VMCS
pointer.  The latter will be used when restoring the checkpointed state
of a vCPU that had a valid VMCS pointer when a snapshot was taken.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:37 +01:00
Jim Mattson e29acc55bf kvm: nVMX: Refactor handle_vmon()
Handle_vmon is split into two parts: the part that handles the VMXON
instruction, and the part that modifies the vcpu state to transition
from legacy mode to VMX operation. The latter will be used when
restoring the checkpointed state of a vCPU that was in VMX operation
when a snapshot was taken.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:37 +01:00
Jim Mattson cf8b84f48a kvm: nVMX: Prepare for checkpointing L2 state
Split prepare_vmcs12 into two parts: the part that stores the current L2
guest state and the part that sets up the exit information fields. The
former will be used when checkpointing the vCPU's VMX state.

Modify prepare_vmcs02 so that it can construct a vmcs02 midway through
L2 execution, using the checkpointed L2 guest state saved into the
cached vmcs12 above.

Signed-off-by: Jim Mattson <jmattson@google.com>
[Rebasing: add from_vmentry argument to prepare_vmcs02 instead of using
 vmx->nested.nested_run_pending, because it is no longer 1 at the
 point prepare_vmcs02 is called. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:36 +01:00
Paolo Bonzini b95234c840 kvm: x86: do not use KVM_REQ_EVENT for APICv interrupt injection
Since bf9f6ac8d7 ("KVM: Update Posted-Interrupts Descriptor when vCPU
is blocked", 2015-09-18) the posted interrupt descriptor is checked
unconditionally for PIR.ON.  Therefore we don't need KVM_REQ_EVENT to
trigger the scan and, if NMIs or SMIs are not involved, we can avoid
the complicated event injection path.

Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been
there since APICv was introduced.

However, without the KVM_REQ_EVENT safety net KVM needs to be much
more careful about races between vmx_deliver_posted_interrupt and
vcpu_enter_guest.  First, the IPI for posted interrupts may be issued
between setting vcpu->mode = IN_GUEST_MODE and disabling interrupts.
If that happens, kvm_trigger_posted_interrupt returns true, but
smp_kvm_posted_intr_ipi doesn't do anything about it.  The guest is
entered with PIR.ON, but the posted interrupt IPI has not been sent
and the interrupt is only delivered to the guest on the next vmentry
(if any).  To fix this, disable interrupts before setting vcpu->mode.
This ensures that the IPI is delayed until the guest enters non-root mode;
it is then trapped by the processor causing the interrupt to be injected.

Second, the IPI may be issued between kvm_x86_ops->sync_pir_to_irr(vcpu)
and vcpu->mode = IN_GUEST_MODE.  In this case, kvm_vcpu_kick is called
but it (correctly) doesn't do anything because it sees vcpu->mode ==
OUTSIDE_GUEST_MODE.  Again, the guest is entered with PIR.ON but no
posted interrupt IPI is pending; this time, the fix for this is to move
the RVI update after IN_GUEST_MODE.

Both issues were mostly masked by the liberal usage of KVM_REQ_EVENT,
though the second could actually happen with VT-d posted interrupts.
In both race scenarios KVM_REQ_EVENT would cancel guest entry, resulting
in another vmentry which would inject the interrupt.

This saves about 300 cycles on the self_ipi_* tests of vmexit.flat.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:36 +01:00
Paolo Bonzini 76dfafd536 KVM: x86: do not scan IRR twice on APICv vmentry
Calls to apic_find_highest_irr are scanning IRR twice, once
in vmx_sync_pir_from_irr and once in apic_search_irr.  Change
sync_pir_from_irr to get the new maximum IRR from kvm_apic_update_irr;
now that it does the computation, it can also do the RVI write.

In order to avoid complications in svm.c, make the callback optional.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:35 +01:00
Paolo Bonzini 3d92789f69 KVM: vmx: move sync_pir_to_irr from apic_find_highest_irr to callers
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:34 +01:00
Paolo Bonzini 810e6defcc KVM: x86: preparatory changes for APICv cleanups
Add return value to __kvm_apic_update_irr/kvm_apic_update_irr.
Move vmx_sync_pir_to_irr around.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:34 +01:00
Paolo Bonzini 0ad3bed6c5 kvm: nVMX: move nested events check to kvm_vcpu_running
vcpu_run calls kvm_vcpu_running, not kvm_arch_vcpu_runnable,
and the former does not call check_nested_events.

Once KVM_REQ_EVENT is removed from the APICv interrupt injection
path, however, this would leave no place to trigger a vmexit
from L2 to L1, causing a missed interrupt delivery while in guest
mode.  This is caught by the "ack interrupt on exit" test in
vmx.flat.

[This does not change the calls to check_nested_events in
 inject_pending_event.  That is material for a separate cleanup.]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:33 +01:00
Paolo Bonzini 967235d320 KVM: vmx: clear pending interrupts on KVM_SET_LAPIC
Pending interrupts might be in the PI descriptor when the
LAPIC is restored from an external state; we do not want
them to be injected.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:33 +01:00
Paolo Bonzini db1c056cee kvm: vmx: Use the hardware provided GPA instead of page walk
As in the SVM patch, the guest physical address is passed by
VMX to x86_emulate_instruction already, so mark the GPA as available
in vcpu->arch.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:32 +01:00
Arnd Bergmann 8ef81a9a45 KVM: x86: hide KVM_HC_CLOCK_PAIRING on 32 bit
The newly added hypercall doesn't work on x86-32:

arch/x86/kvm/x86.c: In function 'kvm_pv_clock_pairing':
arch/x86/kvm/x86.c:6163:6: error: implicit declaration of function 'kvm_get_walltime_and_clockread';did you mean 'kvm_get_time_scale'? [-Werror=implicit-function-declaration]

This adds an #ifdef around it, matching the one around the related
functions that are also only implemented on 64-bit systems.

Fixes: 55dd00a73a ("KVM: x86: add KVM_HC_CLOCK_PAIRING hypercall")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-09 16:23:39 +01:00
Paolo Bonzini 2e751dfb5f kvmarm updates for 4.11
- GICv3 save restore
 - Cache flushing fixes
 - MSI injection fix for GICv3 ITS
 - Physical timer emulation support
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYnICmAAoJECPQ0LrRPXpDC04P/A73ZEL6m0vUzGpuvclxwWc6
 OCJ2C9kYloK+twyGLFbPprI4eN/70dpThgFE1Zr+ol/vAOhQlGQJoarc4n4+eYyb
 8e8IxM5Cmi44HUB64xOInidLacqeyRy5+TvKXIH0aHLgpdynSEQJu88RVXUvVgvs
 IZizhTpYueDYdexNNEkL5r2yJhVZCaczyjB1vU8k5MdODLDM63ABnPOSNJNXio2x
 itoO0EU1Lb9GhuzQj0hiMvKJPyviuPHwau7AhokUSjDPaHzaQT7TgSVioKov/rl6
 bRzhPmXqesex97ZWA5Fxr8jgSNR7JyRz+bzCLEry7XFaI3chbe0YvXeRv32PNH7I
 meuycQw64gsKmfJGRNlq30qhQQfv4fTbzpZP/j1UbvKNwhK5J6e7037c1CUH4i9C
 p9UO9HF/zAMqzD3iMcDZSpaFcbhJYrfQufbhTnbHfGC5AMVJEOWheHSEmzlDWnwr
 K5fPBxnsPv58hDmp/UZUTqCEPusY+HyuOq4ZumFSsnBwjdW+z9mLuaaTJbxaqR/G
 B6dfSQNwSnw6b2lbiXPUCm6c+Z9b190pUEWdwJ4kOTxwiPUWBppVU7gE2TrjnQ8m
 aIvEBPGIf58okjEewA5Dni6qjv7CjDN5z1V0vZUTTdVw8xuhX9eJ1Cx853SM7n0U
 sJgW5nSvSLDUpizSKdRI
 =H4vX
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

kvmarm updates for 4.11

- GICv3 save restore
- Cache flushing fixes
- MSI injection fix for GICv3 ITS
- Physical timer emulation support
2017-02-09 16:01:23 +01:00
Paolo Bonzini 80fbd89cbd KVM: x86: fix compilation
Fix rebase breakage from commit 55dd00a73a ("KVM: x86: add
KVM_HC_CLOCK_PAIRING hypercall", 2017-01-24), courtesy of the
"I could have sworn I had pushed the right branch" department.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-08 10:57:24 +01:00
Marcelo Tosatti 55dd00a73a KVM: x86: add KVM_HC_CLOCK_PAIRING hypercall
Add a hypercall to retrieve the host realtime clock and the TSC value
used to calculate that clock read.

Used to implement clock synchronization between host and guest.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-07 18:16:45 +01:00
David Hildenbrand 6342c50ad1 KVM: nVMX: vmx_complete_nested_posted_interrupt() can't fail
vmx_complete_nested_posted_interrupt() can't fail, let's turn it into
a void function.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-07 18:16:44 +01:00
David Hildenbrand 42cf014d38 KVM: nVMX: kmap() can't fail
kmap() can't fail, therefore it will always return a valid pointer. Let's
just get rid of the unnecessary checks.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-07 18:16:44 +01:00
Radim Krčmář 00c87e9a70 KVM: x86: do not save guest-unsupported XSAVE state
Saving unsupported state prevents migration when the new host does not
support a XSAVE feature of the original host, even if the feature is not
exposed to the guest.

We've masked host features with guest-visible features before, with
4344ee981e ("KVM: x86: only copy XSAVE state for the supported
features") and dropped it when implementing XSAVES.  Do it again.

Fixes: df1daba7d1 ("KVM: x86: support XSAVES usage in the host")
Cc: stable@vger.kernel.org
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-02-03 18:43:08 +01:00
Frederic Weisbecker 5613fda9a5 sched/cputime: Convert task/group cputime to nsecs
Now that most cputime readers use the transition API which return the
task cputime in old style cputime_t, we can safely store the cputime in
nsecs. This will eventually make cputime statistics less opaque and more
granular. Back and forth convertions between cputime_t and nsecs in order
to deal with cputime_t random granularity won't be needed anymore.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-8-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:49 +01:00
Junaid Shahid d3e328f2cb kvm: x86: mmu: Verify that restored PTE has needed perms in fast page fault
Before fast page fault restores an access track PTE back to a regular PTE,
it now also verifies that the restored PTE would grant the necessary
permissions for the faulting access to succeed. If not, it falls back
to the slow page fault path.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-27 15:46:40 +01:00
Junaid Shahid d162f30a7c kvm: x86: mmu: Move pgtbl walk inside retry loop in fast_page_fault
Redo the page table walk in fast_page_fault when retrying so that we are
working on the latest PTE even if the hierarchy changes.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-27 15:46:40 +01:00
Junaid Shahid 20d65236d0 kvm: x86: mmu: Update comment in mark_spte_for_access_track
Reword the comment to hopefully make it more clear.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-27 15:46:39 +01:00
Junaid Shahid 312b616b30 kvm: x86: mmu: Set SPTE_SPECIAL_MASK within mmu.c
Instead of the caller including the SPTE_SPECIAL_MASK in the masks being
supplied to kvm_mmu_set_mmio_spte_mask() and kvm_mmu_set_mask_ptes(),
those functions now themselves include the SPTE_SPECIAL_MASK.

Note that bit 63 is now reset in the default MMIO mask.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-27 15:46:39 +01:00
Junaid Shahid ab22a4733f kvm: x86: mmu: Rename EPT_VIOLATION_READ/WRITE/INSTR constants
Rename the EPT_VIOLATION_READ/WRITE/INSTR constants to
EPT_VIOLATION_ACC_READ/WRITE/INSTR to more clearly indicate that these
signify the type of the memory access as opposed to the permissions
granted by the PTE.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-27 15:46:38 +01:00
Jim Mattson 0b4c208d44 Revert "KVM: nested VMX: disable perf cpuid reporting"
This reverts commit bc6134942d.

A CPUID instruction executed in VMX non-root mode always causes a
VM-exit, regardless of the leaf being queried.

Fixes: bc6134942d ("KVM: nested VMX: disable perf cpuid reporting")
Signed-off-by: Jim Mattson <jmattson@google.com>
[The issue solved by bc6134942d has been resolved with ff651cb613
 ("KVM: nVMX: Add nested msr load/restore algorithm").]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-01-20 22:18:55 +01:00
Piotr Luc a17f32270a kvm: x86: Expose Intel VPOPCNTDQ feature to guest
Vector population count instructions for dwords and qwords are to be
used in future Intel Xeon & Xeon Phi processors. The bit 14 of
CPUID[level:0x07, ECX] indicates that the new instructions are
supported by a processor.

The spec can be found in the Intel Software Developer Manual (SDM)
or in the Instruction Set Extensions Programming Reference (ISE).

Signed-off-by: Piotr Luc <piotr.luc@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-01-17 17:55:18 +01:00
Radim Krčmář a9ff720e0f Merge branch 'x86/cpufeature' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
For AVX512_VPOPCNTDQ.
2017-01-17 17:53:01 +01:00
Dmitry Vyukov ce2e852ecc KVM: x86: fix fixing of hypercalls
emulator_fix_hypercall() replaces hypercall with vmcall instruction,
but it does not handle GP exception properly when writes the new instruction.
It can return X86EMUL_PROPAGATE_FAULT without setting exception information.
This leads to incorrect emulation and triggers
WARN_ON(ctxt->exception.vector > 0x1f) in x86_emulate_insn()
as discovered by syzkaller fuzzer:

WARNING: CPU: 2 PID: 18646 at arch/x86/kvm/emulate.c:5558
Call Trace:
 warn_slowpath_null+0x2c/0x40 kernel/panic.c:582
 x86_emulate_insn+0x16a5/0x4090 arch/x86/kvm/emulate.c:5572
 x86_emulate_instruction+0x403/0x1cc0 arch/x86/kvm/x86.c:5618
 emulate_instruction arch/x86/include/asm/kvm_host.h:1127 [inline]
 handle_exception+0x594/0xfd0 arch/x86/kvm/vmx.c:5762
 vmx_handle_exit+0x2b7/0x38b0 arch/x86/kvm/vmx.c:8625
 vcpu_enter_guest arch/x86/kvm/x86.c:6888 [inline]
 vcpu_run arch/x86/kvm/x86.c:6947 [inline]

Set exception information when write in emulator_fix_hypercall() fails.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: kvm@vger.kernel.org
Cc: syzkaller@googlegroups.com
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-01-17 15:06:05 +01:00
Paolo Bonzini 33ab91103b KVM: x86: fix emulation of "MOV SS, null selector"
This is CVE-2017-2583.  On Intel this causes a failed vmentry because
SS's type is neither 3 nor 7 (even though the manual says this check is
only done for usable SS, and the dmesg splat says that SS is unusable!).
On AMD it's worse: svm.c is confused and sets CPL to 0 in the vmcb.

The fix fabricates a data segment descriptor when SS is set to a null
selector, so that CPL and SS.DPL are set correctly in the VMCS/vmcb.
Furthermore, only allow setting SS to a NULL selector if SS.RPL < 3;
this in turn ensures CPL < 3 because RPL must be equal to CPL.

Thanks to Andy Lutomirski and Willy Tarreau for help in analyzing
the bug and deciphering the manuals.

Reported-by: Xiaohan Zhang <zhangxiaohan1@huawei.com>
Fixes: 79d5b4c3cd
Cc: stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-12 15:17:13 +01:00
Wanpeng Li 546d87e5c9 KVM: x86: fix NULL deref in vcpu_scan_ioapic
Reported by syzkaller:

    BUG: unable to handle kernel NULL pointer dereference at 00000000000001b0
    IP: _raw_spin_lock+0xc/0x30
    PGD 3e28eb067
    PUD 3f0ac6067
    PMD 0
    Oops: 0002 [#1] SMP
    CPU: 0 PID: 2431 Comm: test Tainted: G           OE   4.10.0-rc1+ #3
    Call Trace:
     ? kvm_ioapic_scan_entry+0x3e/0x110 [kvm]
     kvm_arch_vcpu_ioctl_run+0x10a8/0x15f0 [kvm]
     ? pick_next_task_fair+0xe1/0x4e0
     ? kvm_arch_vcpu_load+0xea/0x260 [kvm]
     kvm_vcpu_ioctl+0x33a/0x600 [kvm]
     ? hrtimer_try_to_cancel+0x29/0x130
     ? do_nanosleep+0x97/0xf0
     do_vfs_ioctl+0xa1/0x5d0
     ? __hrtimer_init+0x90/0x90
     ? do_nanosleep+0x5b/0xf0
     SyS_ioctl+0x79/0x90
     do_syscall_64+0x6e/0x180
     entry_SYSCALL64_slow_path+0x25/0x25
    RIP: _raw_spin_lock+0xc/0x30 RSP: ffffa43688973cc0

The syzkaller folks reported a NULL pointer dereference due to
ENABLE_CAP succeeding even without an irqchip.  The Hyper-V
synthetic interrupt controller is activated, resulting in a
wrong request to rescan the ioapic and a NULL pointer dereference.

    #include <sys/ioctl.h>
    #include <sys/mman.h>
    #include <sys/types.h>
    #include <linux/kvm.h>
    #include <pthread.h>
    #include <stddef.h>
    #include <stdint.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>

    #ifndef KVM_CAP_HYPERV_SYNIC
    #define KVM_CAP_HYPERV_SYNIC 123
    #endif

    void* thr(void* arg)
    {
	struct kvm_enable_cap cap;
	cap.flags = 0;
	cap.cap = KVM_CAP_HYPERV_SYNIC;
	ioctl((long)arg, KVM_ENABLE_CAP, &cap);
	return 0;
    }

    int main()
    {
	void *host_mem = mmap(0, 0x1000, PROT_READ|PROT_WRITE,
			MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
	int kvmfd = open("/dev/kvm", 0);
	int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
	struct kvm_userspace_memory_region memreg;
	memreg.slot = 0;
	memreg.flags = 0;
	memreg.guest_phys_addr = 0;
	memreg.memory_size = 0x1000;
	memreg.userspace_addr = (unsigned long)host_mem;
	host_mem[0] = 0xf4;
	ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &memreg);
	int cpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0);
	struct kvm_sregs sregs;
	ioctl(cpufd, KVM_GET_SREGS, &sregs);
	sregs.cr0 = 0;
	sregs.cr4 = 0;
	sregs.efer = 0;
	sregs.cs.selector = 0;
	sregs.cs.base = 0;
	ioctl(cpufd, KVM_SET_SREGS, &sregs);
	struct kvm_regs regs = { .rflags = 2 };
	ioctl(cpufd, KVM_SET_REGS, &regs);
	ioctl(vmfd, KVM_CREATE_IRQCHIP, 0);
	pthread_t th;
	pthread_create(&th, 0, thr, (void*)(long)cpufd);
	usleep(rand() % 10000);
	ioctl(cpufd, KVM_RUN, 0);
	pthread_join(th, 0);
	return 0;
    }

This patch fixes it by failing ENABLE_CAP if without an irqchip.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 5c919412fe (kvm/x86: Hyper-V synthetic interrupt controller)
Cc: stable@vger.kernel.org # 4.5+
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-12 14:52:52 +01:00
Steve Rutherford 129a72a0d3 KVM: x86: Introduce segmented_write_std
Introduces segemented_write_std.

Switches from emulated reads/writes to standard read/writes in fxsave,
fxrstor, sgdt, and sidt.  This fixes CVE-2017-2584, a longstanding
kernel memory leak.

Since commit 283c95d0e3 ("KVM: x86: emulate FXSAVE and FXRSTOR",
2016-11-09), which is luckily not yet in any final release, this would
also be an exploitable kernel memory *write*!

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: 96051572c8
Fixes: 283c95d0e3
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-12 14:34:58 +01:00
David Matlack cef84c302f KVM: x86: flush pending lapic jump label updates on module unload
KVM's lapic emulation uses static_key_deferred (apic_{hw,sw}_disabled).
These are implemented with delayed_work structs which can still be
pending when the KVM module is unloaded. We've seen this cause kernel
panics when the kvm_intel module is quickly reloaded.

Use the new static_key_deferred_flush() API to flush pending updates on
module unload.

Signed-off-by: David Matlack <dmatlack@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-12 14:33:17 +01:00
Jim Mattson 21e7fbe7db kvm: nVMX: Reorder error checks for emulated VMXON
Checks on the operand to VMXON are performed after the check for
legacy mode operation and the #GP checks, according to the pseudo-code
in Intel's SDM.

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-01-09 14:48:04 +01:00