2020-02-10 14:02:59 +08:00
|
|
|
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
|
|
|
|
=================
|
|
|
|
KVM Lock Overview
|
|
|
|
=================
|
|
|
|
|
|
|
|
1. Acquisition Orders
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
The acquisition orders for mutexes are as follows:
|
|
|
|
|
|
|
|
- kvm->lock is taken outside vcpu->mutex
|
|
|
|
|
|
|
|
- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock
|
|
|
|
|
|
|
|
- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
|
|
|
|
them together is quite rare.
|
|
|
|
|
2021-05-19 01:34:11 +08:00
|
|
|
- Unlike kvm->slots_lock, kvm->slots_arch_lock is released before
|
|
|
|
synchronize_srcu(&kvm->srcu). Therefore kvm->slots_arch_lock
|
|
|
|
can be taken inside a kvm->srcu read-side critical section,
|
|
|
|
while kvm->slots_lock cannot.
|
|
|
|
|
KVM: Block memslot updates across range_start() and range_end()
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM. Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none. Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().
Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.
A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".
======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc3+ #6 Tainted: G OE
------------------------------------------------------
qemu-system-x86/3069 is trying to acquire lock:
ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
but task is already holding lock:
ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
which lock already depends on the new lock.
This corresponds to the following MMU notifier logic:
invalidate_range_start
take pseudo lock
down_read() (*)
release pseudo lock
invalidate_range_end
take pseudo lock (**)
up_read()
release pseudo lock
At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):
- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots
- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes
- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.
Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock. This also allows handling blockable and non-blockable
critical section in the same way.
Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever. Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 20:09:15 +08:00
|
|
|
- kvm->mn_active_invalidate_count ensures that pairs of
|
|
|
|
invalidate_range_start() and invalidate_range_end() callbacks
|
|
|
|
use the same memslots array. kvm->slots_lock and kvm->slots_arch_lock
|
|
|
|
are taken on the waiting side in install_new_memslots, so MMU notifiers
|
|
|
|
must not take either kvm->slots_lock or kvm->slots_arch_lock.
|
|
|
|
|
2021-02-03 02:57:26 +08:00
|
|
|
On x86:
|
|
|
|
|
|
|
|
- vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock
|
|
|
|
|
KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock
Add yet another spinlock for the TDP MMU and take it when marking indirect
shadow pages unsync. When using the TDP MMU and L1 is running L2(s) with
nested TDP, KVM may encounter shadow pages for the TDP entries managed by
L1 (controlling L2) when handling a TDP MMU page fault. The unsync logic
is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
which runs with mmu_lock held for read, not write.
Lack of a critical section manifests most visibly as an underflow of
unsync_children in clear_unsync_child_bit() due to unsync_children being
corrupted when multiple CPUs write it without a critical section and
without atomic operations. But underflow is the best case scenario. The
worst case scenario is that unsync_children prematurely hits '0' and
leads to guest memory corruption due to KVM neglecting to properly sync
shadow pages.
Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
would functionally be ok. Usurping the lock could degrade performance when
building upper level page tables on different vCPUs, especially since the
unsync flow could hold the lock for a comparatively long time depending on
the number of indirect shadow pages and the depth of the paging tree.
For simplicity, take the lock for all MMUs, even though KVM could fairly
easily know that mmu_lock is held for write. If mmu_lock is held for
write, there cannot be contention for the inner spinlock, and marking
shadow pages unsync across multiple vCPUs will be slow enough that
bouncing the kvm_arch cacheline should be in the noise.
Note, even though L2 could theoretically be given access to its own EPT
entries, a nested MMU must hold mmu_lock for write and thus cannot race
against a TDP MMU page fault. I.e. the additional spinlock only _needs_ to
be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
that is running with the TDP MMU enabled. Holding mmu_lock for read also
prevents the indirect shadow page from being freed. But as above, keep
it simple and always take the lock.
Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
effectively disable unsync behavior for nested TDP. Write protecting leaf
shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
VMMs typically don't modify TDP entries, but the same may not hold true for
non-standard use cases and/or VMMs that are migrating physical pages (from
L1's perspective).
Alternative #2, the unsync logic could be made thread safe. In theory,
simply converting all relevant kvm_mmu_page fields to atomics and using
atomic bitops for the bitmap would suffice. However, (a) an in-depth audit
would be required, (b) the code churn would be substantial, and (c) legacy
shadow paging would incur additional atomic operations in performance
sensitive paths for no benefit (to legacy shadow paging).
Fixes: a2855afc7ee8 ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181815.3378104-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 02:18:15 +08:00
|
|
|
- kvm->arch.mmu_lock is an rwlock. kvm->arch.tdp_mmu_pages_lock and
|
|
|
|
kvm->arch.mmu_unsync_pages_lock are taken inside kvm->arch.mmu_lock, and
|
|
|
|
cannot be taken without already holding kvm->arch.mmu_lock (typically with
|
|
|
|
``read_lock`` for the TDP MMU, thus the need for additional spinlocks).
|
2020-02-10 14:02:59 +08:00
|
|
|
|
|
|
|
Everything else is a leaf: no other lock is taken inside the critical
|
|
|
|
sections.
|
|
|
|
|
|
|
|
2. Exception
|
|
|
|
------------
|
|
|
|
|
|
|
|
Fast page fault:
|
|
|
|
|
|
|
|
Fast page fault is the fast path which fixes the guest page fault out of
|
|
|
|
the mmu-lock on x86. Currently, the page fault can be fast in one of the
|
|
|
|
following two cases:
|
|
|
|
|
|
|
|
1. Access Tracking: The SPTE is not present, but it is marked for access
|
2021-02-26 04:47:37 +08:00
|
|
|
tracking. That means we need to restore the saved R/X bits. This is
|
|
|
|
described in more detail later below.
|
2020-02-10 14:02:59 +08:00
|
|
|
|
2021-02-26 04:47:37 +08:00
|
|
|
2. Write-Protection: The SPTE is present and the fault is caused by
|
|
|
|
write-protect. That means we just need to change the W bit of the spte.
|
2020-02-10 14:02:59 +08:00
|
|
|
|
2021-02-26 04:47:43 +08:00
|
|
|
What we use to avoid all the race is the Host-writable bit and MMU-writable bit
|
|
|
|
on the spte:
|
2020-02-10 14:02:59 +08:00
|
|
|
|
2021-02-26 04:47:43 +08:00
|
|
|
- Host-writable means the gfn is writable in the host kernel page tables and in
|
|
|
|
its KVM memslot.
|
|
|
|
- MMU-writable means the gfn is writable in the guest's mmu and it is not
|
|
|
|
write-protected by shadow page write-protection.
|
2020-02-10 14:02:59 +08:00
|
|
|
|
|
|
|
On fast page fault path, we will use cmpxchg to atomically set the spte W
|
2021-02-26 04:47:43 +08:00
|
|
|
bit if spte.HOST_WRITEABLE = 1 and spte.WRITE_PROTECT = 1, to restore the saved
|
|
|
|
R/X bits if for an access-traced spte, or both. This is safe because whenever
|
|
|
|
changing these bits can be detected by cmpxchg.
|
2020-02-10 14:02:59 +08:00
|
|
|
|
|
|
|
But we need carefully check these cases:
|
|
|
|
|
|
|
|
1) The mapping from gfn to pfn
|
|
|
|
|
|
|
|
The mapping from gfn to pfn may be changed since we can only ensure the pfn
|
|
|
|
is not changed during cmpxchg. This is a ABA problem, for example, below case
|
|
|
|
will happen:
|
|
|
|
|
|
|
|
+------------------------------------------------------------------------+
|
|
|
|
| At the beginning:: |
|
|
|
|
| |
|
|
|
|
| gpte = gfn1 |
|
|
|
|
| gfn1 is mapped to pfn1 on host |
|
|
|
|
| spte is the shadow page table entry corresponding with gpte and |
|
|
|
|
| spte = pfn1 |
|
|
|
|
+------------------------------------------------------------------------+
|
|
|
|
| On fast page fault path: |
|
|
|
|
+------------------------------------+-----------------------------------+
|
|
|
|
| CPU 0: | CPU 1: |
|
|
|
|
+------------------------------------+-----------------------------------+
|
|
|
|
| :: | |
|
|
|
|
| | |
|
|
|
|
| old_spte = *spte; | |
|
|
|
|
+------------------------------------+-----------------------------------+
|
|
|
|
| | pfn1 is swapped out:: |
|
|
|
|
| | |
|
|
|
|
| | spte = 0; |
|
|
|
|
| | |
|
|
|
|
| | pfn1 is re-alloced for gfn2. |
|
|
|
|
| | |
|
|
|
|
| | gpte is changed to point to |
|
|
|
|
| | gfn2 by the guest:: |
|
|
|
|
| | |
|
|
|
|
| | spte = pfn1; |
|
|
|
|
+------------------------------------+-----------------------------------+
|
|
|
|
| :: |
|
|
|
|
| |
|
|
|
|
| if (cmpxchg(spte, old_spte, old_spte+W) |
|
|
|
|
| mark_page_dirty(vcpu->kvm, gfn1) |
|
|
|
|
| OOPS!!! |
|
|
|
|
+------------------------------------------------------------------------+
|
|
|
|
|
|
|
|
We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
|
|
|
|
|
|
|
|
For direct sp, we can easily avoid it since the spte of direct sp is fixed
|
2020-03-05 23:57:08 +08:00
|
|
|
to gfn. For indirect sp, we disabled fast page fault for simplicity.
|
|
|
|
|
|
|
|
A solution for indirect sp could be to pin the gfn, for example via
|
|
|
|
kvm_vcpu_gfn_to_pfn_atomic, before the cmpxchg. After the pinning:
|
2020-02-10 14:02:59 +08:00
|
|
|
|
|
|
|
- We have held the refcount of pfn that means the pfn can not be freed and
|
|
|
|
be reused for another gfn.
|
2020-03-05 23:57:08 +08:00
|
|
|
- The pfn is writable and therefore it cannot be shared between different gfns
|
2020-02-10 14:02:59 +08:00
|
|
|
by KSM.
|
|
|
|
|
|
|
|
Then, we can ensure the dirty bitmaps is correctly set for a gfn.
|
|
|
|
|
|
|
|
2) Dirty bit tracking
|
|
|
|
|
|
|
|
In the origin code, the spte can be fast updated (non-atomically) if the
|
|
|
|
spte is read-only and the Accessed bit has already been set since the
|
|
|
|
Accessed bit and Dirty bit can not be lost.
|
|
|
|
|
|
|
|
But it is not true after fast page fault since the spte can be marked
|
|
|
|
writable between reading spte and updating spte. Like below case:
|
|
|
|
|
|
|
|
+------------------------------------------------------------------------+
|
|
|
|
| At the beginning:: |
|
|
|
|
| |
|
|
|
|
| spte.W = 0 |
|
|
|
|
| spte.Accessed = 1 |
|
|
|
|
+------------------------------------+-----------------------------------+
|
|
|
|
| CPU 0: | CPU 1: |
|
|
|
|
+------------------------------------+-----------------------------------+
|
|
|
|
| In mmu_spte_clear_track_bits():: | |
|
|
|
|
| | |
|
|
|
|
| old_spte = *spte; | |
|
|
|
|
| | |
|
|
|
|
| | |
|
|
|
|
| /* 'if' condition is satisfied. */| |
|
|
|
|
| if (old_spte.Accessed == 1 && | |
|
|
|
|
| old_spte.W == 0) | |
|
|
|
|
| spte = 0ull; | |
|
|
|
|
+------------------------------------+-----------------------------------+
|
|
|
|
| | on fast page fault path:: |
|
|
|
|
| | |
|
|
|
|
| | spte.W = 1 |
|
|
|
|
| | |
|
|
|
|
| | memory write on the spte:: |
|
|
|
|
| | |
|
|
|
|
| | spte.Dirty = 1 |
|
|
|
|
+------------------------------------+-----------------------------------+
|
|
|
|
| :: | |
|
|
|
|
| | |
|
|
|
|
| else | |
|
|
|
|
| old_spte = xchg(spte, 0ull) | |
|
|
|
|
| if (old_spte.Accessed == 1) | |
|
|
|
|
| kvm_set_pfn_accessed(spte.pfn);| |
|
|
|
|
| if (old_spte.Dirty == 1) | |
|
|
|
|
| kvm_set_pfn_dirty(spte.pfn); | |
|
|
|
|
| OOPS!!! | |
|
|
|
|
+------------------------------------+-----------------------------------+
|
|
|
|
|
|
|
|
The Dirty bit is lost in this case.
|
|
|
|
|
|
|
|
In order to avoid this kind of issue, we always treat the spte as "volatile"
|
|
|
|
if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means,
|
|
|
|
the spte is always atomically updated in this case.
|
|
|
|
|
|
|
|
3) flush tlbs due to spte updated
|
|
|
|
|
|
|
|
If the spte is updated from writable to readonly, we should flush all TLBs,
|
|
|
|
otherwise rmap_write_protect will find a read-only spte, even though the
|
|
|
|
writable spte might be cached on a CPU's TLB.
|
|
|
|
|
|
|
|
As mentioned before, the spte can be updated to writable out of mmu-lock on
|
|
|
|
fast page fault path, in order to easily audit the path, we see if TLBs need
|
|
|
|
be flushed caused by this reason in mmu_spte_update() since this is a common
|
|
|
|
function to update spte (present -> present).
|
|
|
|
|
|
|
|
Since the spte is "volatile" if it can be updated out of mmu-lock, we always
|
|
|
|
atomically update the spte, the race caused by fast page fault can be avoided,
|
|
|
|
See the comments in spte_has_volatile_bits() and mmu_spte_update().
|
|
|
|
|
|
|
|
Lockless Access Tracking:
|
|
|
|
|
|
|
|
This is used for Intel CPUs that are using EPT but do not support the EPT A/D
|
2021-02-26 04:47:37 +08:00
|
|
|
bits. In this case, PTEs are tagged as A/D disabled (using ignored bits), and
|
|
|
|
when the KVM MMU notifier is called to track accesses to a page (via
|
|
|
|
kvm_mmu_notifier_clear_flush_young), it marks the PTE not-present in hardware
|
|
|
|
by clearing the RWX bits in the PTE and storing the original R & X bits in more
|
|
|
|
unused/ignored bits. When the VM tries to access the page later on, a fault is
|
|
|
|
generated and the fast page fault mechanism described above is used to
|
|
|
|
atomically restore the PTE to a Present state. The W bit is not saved when the
|
|
|
|
PTE is marked for access tracking and during restoration to the Present state,
|
|
|
|
the W bit is set depending on whether or not it was a write access. If it
|
|
|
|
wasn't, then the W bit will remain clear until a write access happens, at which
|
|
|
|
time it will be set using the Dirty tracking mechanism described above.
|
2020-02-10 14:02:59 +08:00
|
|
|
|
|
|
|
3. Reference
|
|
|
|
------------
|
|
|
|
|
|
|
|
:Name: kvm_lock
|
|
|
|
:Type: mutex
|
|
|
|
:Arch: any
|
|
|
|
:Protects: - vm_list
|
|
|
|
|
|
|
|
:Name: kvm_count_lock
|
|
|
|
:Type: raw_spinlock_t
|
|
|
|
:Arch: any
|
|
|
|
:Protects: - hardware virtualization enable/disable
|
|
|
|
:Comment: 'raw' because hardware enabling/disabling must be atomic /wrt
|
|
|
|
migration.
|
|
|
|
|
|
|
|
:Name: kvm_arch::tsc_write_lock
|
|
|
|
:Type: raw_spinlock
|
|
|
|
:Arch: x86
|
|
|
|
:Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
|
|
|
|
- tsc offset in vmcb
|
|
|
|
:Comment: 'raw' because updating the tsc offsets must not be preempted.
|
|
|
|
|
|
|
|
:Name: kvm->mmu_lock
|
|
|
|
:Type: spinlock_t
|
|
|
|
:Arch: any
|
|
|
|
:Protects: -shadow page/shadow tlb entry
|
|
|
|
:Comment: it is a spinlock since it is used in mmu notifier.
|
|
|
|
|
|
|
|
:Name: kvm->srcu
|
|
|
|
:Type: srcu lock
|
|
|
|
:Arch: any
|
|
|
|
:Protects: - kvm->memslots
|
|
|
|
- kvm->buses
|
|
|
|
:Comment: The srcu read lock must be held while accessing memslots (e.g.
|
|
|
|
when using gfn_to_* functions) and while accessing in-kernel
|
|
|
|
MMIO/PIO address->device structure mapping (kvm->buses).
|
|
|
|
The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
|
|
|
|
if it is needed by multiple functions.
|
|
|
|
|
|
|
|
:Name: blocked_vcpu_on_cpu_lock
|
|
|
|
:Type: spinlock_t
|
|
|
|
:Arch: x86
|
|
|
|
:Protects: blocked_vcpu_on_cpu
|
|
|
|
:Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts.
|
|
|
|
When VT-d posted-interrupts is supported and the VM has assigned
|
|
|
|
devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu
|
|
|
|
protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues
|
|
|
|
wakeup notification event since external interrupts from the
|
|
|
|
assigned devices happens, we will find the vCPU on the list to
|
|
|
|
wakeup.
|