OpenCloudOS-Kernel/Documentation
Sean Christopherson 4777225ec8 KVM: Use dedicated mutex to protect kvm_usage_count to avoid deadlock
commit 44d17459626052a2390457e550a12cb973506b2f upstream.

Use a dedicated mutex to guard kvm_usage_count to fix a potential deadlock
on x86 due to a chain of locks and SRCU synchronizations.  Translating the
below lockdep splat, CPU1 #6 will wait on CPU0 #1, CPU0 #8 will wait on
CPU2 #3, and CPU2 #7 will wait on CPU1 #4 (if there's a writer, due to the
fairness of r/w semaphores).

    CPU0                     CPU1                     CPU2
1   lock(&kvm->slots_lock);
2                                                     lock(&vcpu->mutex);
3                                                     lock(&kvm->srcu);
4                            lock(cpu_hotplug_lock);
5                            lock(kvm_lock);
6                            lock(&kvm->slots_lock);
7                                                     lock(cpu_hotplug_lock);
8   sync(&kvm->srcu);

Note, there are likely more potential deadlocks in KVM x86, e.g. the same
pattern of taking cpu_hotplug_lock outside of kvm_lock likely exists with
__kvmclock_cpufreq_notifier():

  cpuhp_cpufreq_online()
  |
  -> cpufreq_online()
     |
     -> cpufreq_gov_performance_limits()
        |
        -> __cpufreq_driver_target()
           |
           -> __target_index()
              |
              -> cpufreq_freq_transition_begin()
                 |
                 -> cpufreq_notify_transition()
                    |
                    -> ... __kvmclock_cpufreq_notifier()

But, actually triggering such deadlocks is beyond rare due to the
combination of dependencies and timings involved.  E.g. the cpufreq
notifier is only used on older CPUs without a constant TSC, mucking with
the NX hugepage mitigation while VMs are running is very uncommon, and
doing so while also onlining/offlining a CPU (necessary to generate
contention on cpu_hotplug_lock) would be even more unusual.

The most robust solution to the general cpu_hotplug_lock issue is likely
to switch vm_list to be an RCU-protected list, e.g. so that x86's cpufreq
notifier doesn't to take kvm_lock.  For now, settle for fixing the most
blatant deadlock, as switching to an RCU-protected list is a much more
involved change, but add a comment in locking.rst to call out that care
needs to be taken when walking holding kvm_lock and walking vm_list.

  ======================================================
  WARNING: possible circular locking dependency detected
  6.10.0-smp--c257535a0c9d-pip #330 Tainted: G S         O
  ------------------------------------------------------
  tee/35048 is trying to acquire lock:
  ff6a80eced71e0a8 (&kvm->slots_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x179/0x1e0 [kvm]

  but task is already holding lock:
  ffffffffc07abb08 (kvm_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x14a/0x1e0 [kvm]

  which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:

  -> #3 (kvm_lock){+.+.}-{3:3}:
         __mutex_lock+0x6a/0xb40
         mutex_lock_nested+0x1f/0x30
         kvm_dev_ioctl+0x4fb/0xe50 [kvm]
         __se_sys_ioctl+0x7b/0xd0
         __x64_sys_ioctl+0x21/0x30
         x64_sys_call+0x15d0/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -> #2 (cpu_hotplug_lock){++++}-{0:0}:
         cpus_read_lock+0x2e/0xb0
         static_key_slow_inc+0x16/0x30
         kvm_lapic_set_base+0x6a/0x1c0 [kvm]
         kvm_set_apic_base+0x8f/0xe0 [kvm]
         kvm_set_msr_common+0x9ae/0xf80 [kvm]
         vmx_set_msr+0xa54/0xbe0 [kvm_intel]
         __kvm_set_msr+0xb6/0x1a0 [kvm]
         kvm_arch_vcpu_ioctl+0xeca/0x10c0 [kvm]
         kvm_vcpu_ioctl+0x485/0x5b0 [kvm]
         __se_sys_ioctl+0x7b/0xd0
         __x64_sys_ioctl+0x21/0x30
         x64_sys_call+0x15d0/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -> #1 (&kvm->srcu){.+.+}-{0:0}:
         __synchronize_srcu+0x44/0x1a0
         synchronize_srcu_expedited+0x21/0x30
         kvm_swap_active_memslots+0x110/0x1c0 [kvm]
         kvm_set_memslot+0x360/0x620 [kvm]
         __kvm_set_memory_region+0x27b/0x300 [kvm]
         kvm_vm_ioctl_set_memory_region+0x43/0x60 [kvm]
         kvm_vm_ioctl+0x295/0x650 [kvm]
         __se_sys_ioctl+0x7b/0xd0
         __x64_sys_ioctl+0x21/0x30
         x64_sys_call+0x15d0/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -> #0 (&kvm->slots_lock){+.+.}-{3:3}:
         __lock_acquire+0x15ef/0x2e30
         lock_acquire+0xe0/0x260
         __mutex_lock+0x6a/0xb40
         mutex_lock_nested+0x1f/0x30
         set_nx_huge_pages+0x179/0x1e0 [kvm]
         param_attr_store+0x93/0x100
         module_attr_store+0x22/0x40
         sysfs_kf_write+0x81/0xb0
         kernfs_fop_write_iter+0x133/0x1d0
         vfs_write+0x28d/0x380
         ksys_write+0x70/0xe0
         __x64_sys_write+0x1f/0x30
         x64_sys_call+0x281b/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

Cc: Chao Gao <chao.gao@intel.com>
Fixes: 0bf50497f0 ("KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock")
Cc: stable@vger.kernel.org
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240830043600.127750-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-04 16:29:47 +02:00
..
ABI ABI: testing: fix admv8818 attr description 2024-10-04 16:29:39 +02:00
PCI Merge branch 'pci/misc' 2023-08-29 11:03:57 -05:00
RCU
accel
accounting docs: psi: use correct config name 2023-07-31 09:54:17 -06:00
admin-guide smb3: fix setting SecurityFlags when encryption is required 2024-08-14 13:58:59 +02:00
arch arm64: errata: Expand speculative SSBS workaround (again) 2024-08-14 13:58:49 +02:00
block Documentation work keeps chugging along; stuff for 6.6 includes: 2023-08-30 20:05:42 -07:00
bpf bpf: Replace bpf_lpm_trie_key 0-length array with flexible array 2024-08-19 06:04:27 +02:00
cdrom scsi: sr: Fix unintentional arithmetic wraparound 2024-07-25 09:50:40 +02:00
core-api workqueue: doc: Fix function and sysfs path errors 2023-10-12 07:27:22 -10:00
cpu-freq
crypto
dev-tools LoongArch changes for v6.6 2023-09-08 12:16:52 -07:00
devicetree dt-bindings: iio: asahi-kasei,ak8975: drop incorrect AK09116 compatible 2024-10-04 16:29:39 +02:00
doc-guide
driver-api ipmi: docs: don't advertise deprecated sysfs entries 2024-10-04 16:29:12 +02:00
fault-injection Documentation: Fix typos 2023-08-18 11:29:03 -06:00
fb Documentation: Fix typos 2023-08-18 11:29:03 -06:00
features LoongArch changes for v6.6 2023-09-08 12:16:52 -07:00
filesystems f2fs: deprecate io_bits 2024-06-12 11:12:29 +02:00
firmware-guide Documentation work keeps chugging along; stuff for 6.6 includes: 2023-08-30 20:05:42 -07:00
firmware_class
fpga
gpu drm: Allow drivers to indicate the damage helpers to ignore damage clips 2024-01-31 16:19:08 -08:00
hid HID: Add introduction about HID for non-kernel programmers 2023-08-07 13:24:36 +02:00
hwmon hwmon: corsair-psu: add USB id of HX1200i Series 2023 psu 2024-08-14 13:58:41 +02:00
i2c i2c: i801: Add support for Intel Birch Stream SoC 2023-11-28 17:19:46 +00:00
iio
images
infiniband
input input: docs: pxrc: remove reference to phoenix-sim 2023-08-28 12:43:32 -06:00
isdn
kbuild kbuild: doc: Update default INSTALL_MOD_DIR from extra to updates 2024-07-05 09:33:56 +02:00
kernel-hacking
leds
litmus-tests
livepatch Documentation: Fix typos 2023-08-18 11:29:03 -06:00
locking hwspinlock: Introduce hwspin_lock_bust() 2024-09-08 07:54:43 +02:00
maintainer Documentation work keeps chugging along; stuff for 6.6 includes: 2023-08-30 20:05:42 -07:00
mhi
misc-devices
mm mm/page_table_check: support userfault wr-protect entries 2024-08-19 06:04:29 +02:00
netlabel
netlink netlink: specs: devlink: fix reply command values 2023-10-13 17:27:27 -07:00
networking net: ena: Move XDP code to its new files 2024-04-17 11:19:32 +02:00
nvdimm
nvme
pcmcia
peci
power Documentation: Fix typos 2023-08-18 11:29:03 -06:00
powerpc docs: kernel_feat.py: fix potential command injection 2024-01-31 16:18:46 -08:00
process rust: upgrade to Rust 1.73.0 2024-02-16 19:10:43 +01:00
riscv docs: kernel_feat.py: fix potential command injection 2024-01-31 16:18:46 -08:00
rust docs: rust: update Rust docs output path 2023-10-19 16:39:03 +02:00
scheduler Documentation work keeps chugging along; stuff for 6.6 includes: 2023-08-30 20:05:42 -07:00
scsi SCSI misc on 20230902 2023-09-02 12:02:41 -07:00
security Documentation: Fix typos 2023-08-18 11:29:03 -06:00
sound ASoC: doc: Fix undefined SND_SOC_DAPM_NOPM argument 2024-02-05 20:14:26 +00:00
sphinx docs: kernel_include.py: Cope with docutils 0.21 2024-05-25 16:22:55 +02:00
sphinx-static
spi Documentation: Fix typos 2023-08-18 11:29:03 -06:00
staging
target
timers
tools rtla: fix a example in rtla-timerlat-hist.rst 2023-09-22 14:44:04 +02:00
trace Documentation: probes: Add a new ret_ip callback parameter 2023-10-17 10:21:45 +09:00
translations docs: kernel_feat.py: fix potential command injection 2024-01-31 16:18:46 -08:00
usb USB / Thunderbolt / PHY driver update for 6.6-rc1 2023-09-01 09:23:34 -07:00
userspace-api media: mc: Expand MUST_CONNECT flag to always require an enabled link 2024-04-03 15:28:17 +02:00
virt KVM: Use dedicated mutex to protect kvm_usage_count to avoid deadlock 2024-10-04 16:29:47 +02:00
w1 Documentation: Fix typos 2023-08-18 11:29:03 -06:00
watchdog Documentation: Fix typos 2023-08-18 11:29:03 -06:00
wmi Documentation work keeps chugging along; stuff for 6.6 includes: 2023-08-30 20:05:42 -07:00
.gitignore
Changes
CodingStyle
Kconfig
Makefile docs: Integrate rustdoc generation into htmldocs 2023-07-21 15:08:46 -06:00
SubmittingPatches
atomic_bitops.txt
atomic_t.txt
conf.py docs: Restore "smart quotes" for quotes 2024-04-03 15:28:22 +02:00
docutils.conf
dontdiff
index.rst
memory-barriers.txt
subsystem-apis.rst docs: consolidate networking interfaces 2023-07-21 14:54:50 -06:00