2019-06-04 16:11:32 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0-only */
|
2007-12-16 17:02:48 +08:00
|
|
|
#ifndef __KVM_HOST_H
|
|
|
|
#define __KVM_HOST_H
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 18:21:36 +08:00
|
|
|
|
|
|
|
|
|
|
|
#include <linux/types.h>
|
2007-10-18 20:39:10 +08:00
|
|
|
#include <linux/hardirq.h>
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 18:21:36 +08:00
|
|
|
#include <linux/list.h>
|
|
|
|
#include <linux/mutex.h>
|
|
|
|
#include <linux/spinlock.h>
|
2007-05-27 15:46:52 +08:00
|
|
|
#include <linux/signal.h>
|
|
|
|
#include <linux/sched.h>
|
2021-05-18 20:00:31 +08:00
|
|
|
#include <linux/sched/stat.h>
|
2011-11-24 09:12:59 +08:00
|
|
|
#include <linux/bug.h>
|
2021-02-22 10:45:22 +08:00
|
|
|
#include <linux/minmax.h>
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 18:21:36 +08:00
|
|
|
#include <linux/mm.h>
|
2011-10-10 23:46:15 +08:00
|
|
|
#include <linux/mmu_notifier.h>
|
2007-07-11 23:17:21 +08:00
|
|
|
#include <linux/preempt.h>
|
2008-11-24 14:32:53 +08:00
|
|
|
#include <linux/msi.h>
|
2010-11-10 00:02:49 +08:00
|
|
|
#include <linux/slab.h>
|
2018-05-15 19:37:37 +08:00
|
|
|
#include <linux/vmalloc.h>
|
2010-11-19 01:09:08 +08:00
|
|
|
#include <linux/rcupdate.h>
|
2011-09-12 17:26:22 +08:00
|
|
|
#include <linux/ratelimit.h>
|
2012-08-03 15:39:59 +08:00
|
|
|
#include <linux/err.h>
|
2013-01-21 07:50:22 +08:00
|
|
|
#include <linux/irqflags.h>
|
2013-05-16 07:21:38 +08:00
|
|
|
#include <linux/context_tracking.h>
|
2015-09-18 22:29:43 +08:00
|
|
|
#include <linux/irqbypass.h>
|
2020-04-24 13:48:37 +08:00
|
|
|
#include <linux/rcuwait.h>
|
2017-02-20 19:06:21 +08:00
|
|
|
#include <linux/refcount.h>
|
2019-04-11 17:16:47 +08:00
|
|
|
#include <linux/nospec.h>
|
2021-06-06 10:10:44 +08:00
|
|
|
#include <linux/notifier.h>
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 21:29:22 +08:00
|
|
|
#include <linux/ftrace.h>
|
2021-12-07 03:54:27 +08:00
|
|
|
#include <linux/hashtable.h>
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 21:29:22 +08:00
|
|
|
#include <linux/instrumentation.h>
|
2021-12-07 03:54:28 +08:00
|
|
|
#include <linux/interval_tree.h>
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
#include <linux/rbtree.h>
|
2021-11-17 00:04:01 +08:00
|
|
|
#include <linux/xarray.h>
|
Detach sched.h from mm.h
First thing mm.h does is including sched.h solely for can_do_mlock() inline
function which has "current" dereference inside. By dealing with can_do_mlock()
mm.h can be detached from sched.h which is good. See below, why.
This patch
a) removes unconditional inclusion of sched.h from mm.h
b) makes can_do_mlock() normal function in mm/mlock.c
c) exports can_do_mlock() to not break compilation
d) adds sched.h inclusions back to files that were getting it indirectly.
e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
getting them indirectly
Net result is:
a) mm.h users would get less code to open, read, preprocess, parse, ... if
they don't need sched.h
b) sched.h stops being dependency for significant number of files:
on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
after patch it's only 3744 (-8.3%).
Cross-compile tested on
all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
alpha alpha-up
arm
i386 i386-up i386-defconfig i386-allnoconfig
ia64 ia64-up
m68k
mips
parisc parisc-up
powerpc powerpc-up
s390 s390-up
sparc sparc-up
sparc64 sparc64-up
um-x86_64
x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig
as well as my two usual configs.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-21 05:22:52 +08:00
|
|
|
#include <asm/signal.h>
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 18:21:36 +08:00
|
|
|
|
|
|
|
#include <linux/kvm.h>
|
2007-02-19 20:37:47 +08:00
|
|
|
#include <linux/kvm_para.h>
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 18:21:36 +08:00
|
|
|
|
2007-12-16 17:02:48 +08:00
|
|
|
#include <linux/kvm_types.h>
|
2007-12-04 05:30:23 +08:00
|
|
|
|
2007-12-16 17:02:48 +08:00
|
|
|
#include <asm/kvm_host.h>
|
2020-10-01 09:22:22 +08:00
|
|
|
#include <linux/kvm_dirty_ring.h>
|
2007-12-14 09:41:22 +08:00
|
|
|
|
2021-09-13 21:57:44 +08:00
|
|
|
#ifndef KVM_MAX_VCPU_IDS
|
|
|
|
#define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS
|
2016-05-10 00:13:37 +08:00
|
|
|
#endif
|
|
|
|
|
2012-08-21 10:58:45 +08:00
|
|
|
/*
|
2022-12-02 18:50:10 +08:00
|
|
|
* The bit 16 ~ bit 31 of kvm_userspace_memory_region::flags are internally
|
|
|
|
* used in kvm, other bits are visible for userspace which are defined in
|
2012-08-21 10:58:45 +08:00
|
|
|
* include/linux/kvm_h.
|
|
|
|
*/
|
|
|
|
#define KVM_MEMSLOT_INVALID (1UL << 16)
|
|
|
|
|
KVM: Explicitly define the "memslot update in-progress" bit
KVM uses bit 0 of the memslots generation as an "update in-progress"
flag, which is used by x86 to prevent caching MMIO access while the
memslots are changing. Although the intended behavior is flag-like,
e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
caching data from in-flux memslots, the implementation oftentimes treats
the bit as part of the generation number itself, e.g. incrementing the
generation increments twice, once to set the flag and once to clear it.
Prior to commit 4bd518f1598d ("KVM: use separate generations for
each address space"), incorporating the "update in-progress" bit into
the generation number largely made sense, e.g. "real" generations are
even, "bogus" generations are odd, most code doesn't need to be aware of
the bit, etc...
Now that unique memslots generation numbers are assigned to each address
space, stealthing the in-progress status into the generation number
results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
over bit 0 when initializing the memslots generation without any hint as
to why.
Explicitly define the flag and convert as much code as possible (which
isn't much) to actually treat it like a flag. This paves the way for
eventually using a different bit for "update in-progress" so that it can
be a flag in truth instead of a awkward extension to the generation
number.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-06 05:01:14 +08:00
|
|
|
/*
|
2019-02-06 05:01:18 +08:00
|
|
|
* Bit 63 of the memslot generation number is an "update in-progress flag",
|
KVM: Explicitly define the "memslot update in-progress" bit
KVM uses bit 0 of the memslots generation as an "update in-progress"
flag, which is used by x86 to prevent caching MMIO access while the
memslots are changing. Although the intended behavior is flag-like,
e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
caching data from in-flux memslots, the implementation oftentimes treats
the bit as part of the generation number itself, e.g. incrementing the
generation increments twice, once to set the flag and once to clear it.
Prior to commit 4bd518f1598d ("KVM: use separate generations for
each address space"), incorporating the "update in-progress" bit into
the generation number largely made sense, e.g. "real" generations are
even, "bogus" generations are odd, most code doesn't need to be aware of
the bit, etc...
Now that unique memslots generation numbers are assigned to each address
space, stealthing the in-progress status into the generation number
results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
over bit 0 when initializing the memslots generation without any hint as
to why.
Explicitly define the flag and convert as much code as possible (which
isn't much) to actually treat it like a flag. This paves the way for
eventually using a different bit for "update in-progress" so that it can
be a flag in truth instead of a awkward extension to the generation
number.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-06 05:01:14 +08:00
|
|
|
* e.g. is temporarily set for the duration of install_new_memslots().
|
|
|
|
* This flag effectively creates a unique generation number that is used to
|
|
|
|
* mark cached memslot data, e.g. MMIO accesses, as potentially being stale,
|
|
|
|
* i.e. may (or may not) have come from the previous memslots generation.
|
|
|
|
*
|
|
|
|
* This is necessary because the actual memslots update is not atomic with
|
|
|
|
* respect to the generation number update. Updating the generation number
|
|
|
|
* first would allow a vCPU to cache a spte from the old memslots using the
|
|
|
|
* new generation number, and updating the generation number after switching
|
|
|
|
* to the new memslots would allow cache hits using the old generation number
|
|
|
|
* to reference the defunct memslots.
|
|
|
|
*
|
|
|
|
* This mechanism is used to prevent getting hits in KVM's caches while a
|
|
|
|
* memslot update is in-progress, and to prevent cache hits *after* updating
|
|
|
|
* the actual generation number against accesses that were inserted into the
|
|
|
|
* cache *before* the memslots were updated.
|
|
|
|
*/
|
2019-02-06 05:01:18 +08:00
|
|
|
#define KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS BIT_ULL(63)
|
KVM: Explicitly define the "memslot update in-progress" bit
KVM uses bit 0 of the memslots generation as an "update in-progress"
flag, which is used by x86 to prevent caching MMIO access while the
memslots are changing. Although the intended behavior is flag-like,
e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
caching data from in-flux memslots, the implementation oftentimes treats
the bit as part of the generation number itself, e.g. incrementing the
generation increments twice, once to set the flag and once to clear it.
Prior to commit 4bd518f1598d ("KVM: use separate generations for
each address space"), incorporating the "update in-progress" bit into
the generation number largely made sense, e.g. "real" generations are
even, "bogus" generations are odd, most code doesn't need to be aware of
the bit, etc...
Now that unique memslots generation numbers are assigned to each address
space, stealthing the in-progress status into the generation number
results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
over bit 0 when initializing the memslots generation without any hint as
to why.
Explicitly define the flag and convert as much code as possible (which
isn't much) to actually treat it like a flag. This paves the way for
eventually using a different bit for "update in-progress" so that it can
be a flag in truth instead of a awkward extension to the generation
number.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-06 05:01:14 +08:00
|
|
|
|
2012-10-24 14:07:59 +08:00
|
|
|
/* Two fragments for cross MMIO pages. */
|
|
|
|
#define KVM_MAX_MMIO_FRAGMENTS 2
|
2012-04-19 00:22:47 +08:00
|
|
|
|
2015-05-17 23:30:37 +08:00
|
|
|
#ifndef KVM_ADDRESS_SPACE_NUM
|
|
|
|
#define KVM_ADDRESS_SPACE_NUM 1
|
|
|
|
#endif
|
|
|
|
|
2012-08-03 15:43:51 +08:00
|
|
|
/*
|
|
|
|
* For the normal pfn, the highest 12 bits should be zero,
|
2012-10-16 20:10:59 +08:00
|
|
|
* so we can mask bit 62 ~ bit 52 to indicate the error pfn,
|
|
|
|
* mask bit 63 to indicate the noslot pfn.
|
2012-08-03 15:43:51 +08:00
|
|
|
*/
|
2012-10-16 20:10:59 +08:00
|
|
|
#define KVM_PFN_ERR_MASK (0x7ffULL << 52)
|
|
|
|
#define KVM_PFN_ERR_NOSLOT_MASK (0xfffULL << 52)
|
|
|
|
#define KVM_PFN_NOSLOT (0x1ULL << 63)
|
2012-08-03 15:43:51 +08:00
|
|
|
|
|
|
|
#define KVM_PFN_ERR_FAULT (KVM_PFN_ERR_MASK)
|
|
|
|
#define KVM_PFN_ERR_HWPOISON (KVM_PFN_ERR_MASK + 1)
|
2012-10-16 20:10:59 +08:00
|
|
|
#define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
|
2022-10-12 03:58:07 +08:00
|
|
|
#define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3)
|
2012-08-03 15:37:54 +08:00
|
|
|
|
2012-10-16 20:10:59 +08:00
|
|
|
/*
|
|
|
|
* error pfns indicate that the gfn is in slot but faild to
|
|
|
|
* translate it to pfn on host.
|
|
|
|
*/
|
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 08:56:11 +08:00
|
|
|
static inline bool is_error_pfn(kvm_pfn_t pfn)
|
2012-08-03 15:39:59 +08:00
|
|
|
{
|
2012-08-03 15:43:51 +08:00
|
|
|
return !!(pfn & KVM_PFN_ERR_MASK);
|
2012-08-03 15:39:59 +08:00
|
|
|
}
|
|
|
|
|
2022-10-12 03:58:07 +08:00
|
|
|
/*
|
|
|
|
* KVM_PFN_ERR_SIGPENDING indicates that fetching the PFN was interrupted
|
|
|
|
* by a pending signal. Note, the signal may or may not be fatal.
|
|
|
|
*/
|
|
|
|
static inline bool is_sigpending_pfn(kvm_pfn_t pfn)
|
|
|
|
{
|
|
|
|
return pfn == KVM_PFN_ERR_SIGPENDING;
|
|
|
|
}
|
|
|
|
|
2012-10-16 20:10:59 +08:00
|
|
|
/*
|
|
|
|
* error_noslot pfns indicate that the gfn can not be
|
|
|
|
* translated to pfn - it is not in slot or failed to
|
|
|
|
* translate it to pfn.
|
|
|
|
*/
|
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 08:56:11 +08:00
|
|
|
static inline bool is_error_noslot_pfn(kvm_pfn_t pfn)
|
2012-08-03 15:39:59 +08:00
|
|
|
{
|
2012-10-16 20:10:59 +08:00
|
|
|
return !!(pfn & KVM_PFN_ERR_NOSLOT_MASK);
|
2012-08-03 15:39:59 +08:00
|
|
|
}
|
|
|
|
|
2012-10-16 20:10:59 +08:00
|
|
|
/* noslot pfn indicates that the gfn is not in slot. */
|
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 08:56:11 +08:00
|
|
|
static inline bool is_noslot_pfn(kvm_pfn_t pfn)
|
2012-08-03 15:39:59 +08:00
|
|
|
{
|
2012-10-16 20:10:59 +08:00
|
|
|
return pfn == KVM_PFN_NOSLOT;
|
2012-08-03 15:39:59 +08:00
|
|
|
}
|
|
|
|
|
2013-07-26 21:04:07 +08:00
|
|
|
/*
|
|
|
|
* architectures with KVM_HVA_ERR_BAD other than PAGE_OFFSET (e.g. s390)
|
|
|
|
* provide own defines and kvm_is_error_hva
|
|
|
|
*/
|
|
|
|
#ifndef KVM_HVA_ERR_BAD
|
|
|
|
|
2012-08-21 11:02:22 +08:00
|
|
|
#define KVM_HVA_ERR_BAD (PAGE_OFFSET)
|
|
|
|
#define KVM_HVA_ERR_RO_BAD (PAGE_OFFSET + PAGE_SIZE)
|
2012-08-21 11:01:50 +08:00
|
|
|
|
|
|
|
static inline bool kvm_is_error_hva(unsigned long addr)
|
|
|
|
{
|
2012-08-21 11:02:22 +08:00
|
|
|
return addr >= PAGE_OFFSET;
|
2012-08-21 11:01:50 +08:00
|
|
|
}
|
|
|
|
|
2013-07-26 21:04:07 +08:00
|
|
|
#endif
|
|
|
|
|
2012-08-03 15:41:22 +08:00
|
|
|
#define KVM_ERR_PTR_BAD_PAGE (ERR_PTR(-ENOENT))
|
|
|
|
|
2012-08-03 15:43:51 +08:00
|
|
|
static inline bool is_error_page(struct page *page)
|
2012-08-03 15:41:22 +08:00
|
|
|
{
|
|
|
|
return IS_ERR(page);
|
|
|
|
}
|
|
|
|
|
2017-04-27 04:32:22 +08:00
|
|
|
#define KVM_REQUEST_MASK GENMASK(7,0)
|
|
|
|
#define KVM_REQUEST_NO_WAKEUP BIT(8)
|
2017-04-27 20:33:43 +08:00
|
|
|
#define KVM_REQUEST_WAIT BIT(9)
|
2022-02-24 00:53:02 +08:00
|
|
|
#define KVM_REQUEST_NO_ACTION BIT(10)
|
2007-06-08 00:18:30 +08:00
|
|
|
/*
|
2016-01-07 22:05:10 +08:00
|
|
|
* Architecture-independent vcpu->requests bit members
|
2022-09-21 08:32:01 +08:00
|
|
|
* Bits 3-7 are reserved for more arch-independent bits.
|
2007-06-08 00:18:30 +08:00
|
|
|
*/
|
2022-11-10 18:49:08 +08:00
|
|
|
#define KVM_REQ_TLB_FLUSH (0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
|
|
|
|
#define KVM_REQ_VM_DEAD (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
|
|
|
|
#define KVM_REQ_UNBLOCK 2
|
|
|
|
#define KVM_REQ_DIRTY_RING_SOFT_FULL 3
|
|
|
|
#define KVM_REQUEST_ARCH_BASE 8
|
2017-06-04 20:43:51 +08:00
|
|
|
|
2022-02-24 00:53:02 +08:00
|
|
|
/*
|
|
|
|
* KVM_REQ_OUTSIDE_GUEST_MODE exists is purely as way to force the vCPU to
|
|
|
|
* OUTSIDE_GUEST_MODE. KVM_REQ_OUTSIDE_GUEST_MODE differs from a vCPU "kick"
|
|
|
|
* in that it ensures the vCPU has reached OUTSIDE_GUEST_MODE before continuing
|
|
|
|
* on. A kick only guarantees that the vCPU is on its way out, e.g. a previous
|
|
|
|
* kick may have set vcpu->mode to EXITING_GUEST_MODE, and so there's no
|
|
|
|
* guarantee the vCPU received an IPI and has actually exited guest mode.
|
|
|
|
*/
|
|
|
|
#define KVM_REQ_OUTSIDE_GUEST_MODE (KVM_REQUEST_NO_ACTION | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
|
|
|
|
|
2017-06-04 20:43:51 +08:00
|
|
|
#define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
|
2019-12-10 02:31:43 +08:00
|
|
|
BUILD_BUG_ON((unsigned)(nr) >= (sizeof_field(struct kvm_vcpu, requests) * 8) - KVM_REQUEST_ARCH_BASE); \
|
2017-06-04 20:43:51 +08:00
|
|
|
(unsigned)(((nr) + KVM_REQUEST_ARCH_BASE) | (flags)); \
|
|
|
|
})
|
|
|
|
#define KVM_ARCH_REQ(nr) KVM_ARCH_REQ_FLAGS(nr, 0)
|
2016-01-07 22:00:53 +08:00
|
|
|
|
2021-07-03 06:04:24 +08:00
|
|
|
bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req,
|
2021-09-03 15:51:41 +08:00
|
|
|
unsigned long *vcpu_bitmap);
|
2021-07-03 06:04:24 +08:00
|
|
|
bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
|
|
|
|
bool kvm_make_all_cpus_request_except(struct kvm *kvm, unsigned int req,
|
|
|
|
struct kvm_vcpu *except);
|
|
|
|
bool kvm_make_cpus_request_mask(struct kvm *kvm, unsigned int req,
|
|
|
|
unsigned long *vcpu_bitmap);
|
|
|
|
|
2012-09-22 01:58:03 +08:00
|
|
|
#define KVM_USERSPACE_IRQ_SOURCE_ID 0
|
|
|
|
#define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID 1
|
2008-10-15 20:15:06 +08:00
|
|
|
|
2019-01-04 09:14:28 +08:00
|
|
|
extern struct mutex kvm_lock;
|
2013-04-06 03:20:30 +08:00
|
|
|
extern struct list_head vm_list;
|
|
|
|
|
2011-07-27 21:00:48 +08:00
|
|
|
struct kvm_io_range {
|
|
|
|
gpa_t addr;
|
|
|
|
int len;
|
|
|
|
struct kvm_io_device *dev;
|
|
|
|
};
|
|
|
|
|
2012-03-09 12:17:40 +08:00
|
|
|
#define NR_IOBUS_DEVS 1000
|
2012-03-09 12:17:32 +08:00
|
|
|
|
2007-06-01 02:08:53 +08:00
|
|
|
struct kvm_io_bus {
|
2013-05-25 06:44:15 +08:00
|
|
|
int dev_count;
|
|
|
|
int ioeventfd_count;
|
2012-03-09 12:17:32 +08:00
|
|
|
struct kvm_io_range range[];
|
2007-06-01 02:08:53 +08:00
|
|
|
};
|
|
|
|
|
2009-12-24 00:35:24 +08:00
|
|
|
enum kvm_bus {
|
|
|
|
KVM_MMIO_BUS,
|
|
|
|
KVM_PIO_BUS,
|
2013-02-28 19:33:19 +08:00
|
|
|
KVM_VIRTIO_CCW_NOTIFY_BUS,
|
KVM: VMX: speed up wildcard MMIO EVENTFD
With KVM, MMIO is much slower than PIO, due to the need to
do page walk and emulation. But with EPT, it does not have to be: we
know the address from the VMCS so if the address is unique, we can look
up the eventfd directly, bypassing emulation.
Unfortunately, this only works if userspace does not need to match on
access length and data. The implementation adds a separate FAST_MMIO
bus internally. This serves two purposes:
- minimize overhead for old userspace that does not use eventfd with lengtth = 0
- minimize disruption in other code (since we don't know the length,
devices on the MMIO bus only get a valid address in write, this
way we don't need to touch all devices to teach them to handle
an invalid length)
At the moment, this optimization only has effect for EPT on x86.
It will be possible to speed up MMIO for NPT and MMU using the same
idea in the future.
With this patch applied, on VMX MMIO EVENTFD is essentially as fast as PIO.
I was unable to detect any measureable slowdown to non-eventfd MMIO.
Making MMIO faster is important for the upcoming virtio 1.0 which
includes an MMIO signalling capability.
The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
pre-review and suggestions.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-01 02:50:44 +08:00
|
|
|
KVM_FAST_MMIO_BUS,
|
2009-12-24 00:35:24 +08:00
|
|
|
KVM_NR_BUSES
|
|
|
|
};
|
|
|
|
|
2015-03-26 22:39:28 +08:00
|
|
|
int kvm_io_bus_write(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
|
2009-12-24 00:35:24 +08:00
|
|
|
int len, const void *val);
|
2015-03-26 22:39:28 +08:00
|
|
|
int kvm_io_bus_write_cookie(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx,
|
|
|
|
gpa_t addr, int len, const void *val, long cookie);
|
|
|
|
int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
|
|
|
|
int len, void *val);
|
2011-07-27 21:00:48 +08:00
|
|
|
int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
|
|
|
|
int len, struct kvm_io_device *dev);
|
2021-04-13 06:20:49 +08:00
|
|
|
int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
|
|
|
|
struct kvm_io_device *dev);
|
2016-07-15 19:43:26 +08:00
|
|
|
struct kvm_io_device *kvm_io_bus_get_dev(struct kvm *kvm, enum kvm_bus bus_idx,
|
|
|
|
gpa_t addr);
|
2007-06-01 02:08:53 +08:00
|
|
|
|
2010-10-14 17:22:46 +08:00
|
|
|
#ifdef CONFIG_KVM_ASYNC_PF
|
|
|
|
struct kvm_async_pf {
|
|
|
|
struct work_struct work;
|
|
|
|
struct list_head link;
|
|
|
|
struct list_head queue;
|
|
|
|
struct kvm_vcpu *vcpu;
|
|
|
|
struct mm_struct *mm;
|
2019-12-07 07:57:14 +08:00
|
|
|
gpa_t cr2_or_gpa;
|
2010-10-14 17:22:46 +08:00
|
|
|
unsigned long addr;
|
|
|
|
struct kvm_arch_async_pf arch;
|
2013-10-14 22:22:33 +08:00
|
|
|
bool wakeup_all;
|
2020-06-11 01:55:32 +08:00
|
|
|
bool notpresent_injected;
|
2010-10-14 17:22:46 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
|
|
|
|
void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu);
|
2020-06-15 20:13:34 +08:00
|
|
|
bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
|
|
|
|
unsigned long hva, struct kvm_arch_async_pf *arch);
|
2010-10-14 17:22:50 +08:00
|
|
|
int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
|
2010-10-14 17:22:46 +08:00
|
|
|
#endif
|
|
|
|
|
2021-03-26 10:19:47 +08:00
|
|
|
#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
|
KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-02 08:56:50 +08:00
|
|
|
struct kvm_gfn_range {
|
|
|
|
struct kvm_memory_slot *slot;
|
|
|
|
gfn_t start;
|
|
|
|
gfn_t end;
|
|
|
|
pte_t pte;
|
|
|
|
bool may_block;
|
|
|
|
};
|
|
|
|
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
|
|
|
|
bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
|
|
|
|
bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
|
|
|
|
bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
|
2021-03-26 10:19:47 +08:00
|
|
|
#endif
|
|
|
|
|
2011-01-12 15:40:31 +08:00
|
|
|
enum {
|
|
|
|
OUTSIDE_GUEST_MODE,
|
|
|
|
IN_GUEST_MODE,
|
2012-05-14 20:44:06 +08:00
|
|
|
EXITING_GUEST_MODE,
|
|
|
|
READING_SHADOW_PAGE_TABLES,
|
2011-01-12 15:40:31 +08:00
|
|
|
};
|
|
|
|
|
2019-02-01 04:24:34 +08:00
|
|
|
#define KVM_UNMAPPED_PAGE ((void *) 0x500 + POISON_POINTER_DELTA)
|
|
|
|
|
|
|
|
struct kvm_host_map {
|
|
|
|
/*
|
|
|
|
* Only valid if the 'pfn' is managed by the host kernel (i.e. There is
|
|
|
|
* a 'struct page' for it. When using mem= kernel parameter some memory
|
|
|
|
* can be used as guest memory but they are not managed by host
|
|
|
|
* kernel).
|
|
|
|
* If 'pfn' is not managed by the host kernel, this field is
|
|
|
|
* initialized to KVM_UNMAPPED_PAGE.
|
|
|
|
*/
|
|
|
|
struct page *page;
|
|
|
|
void *hva;
|
|
|
|
kvm_pfn_t pfn;
|
|
|
|
kvm_pfn_t gfn;
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Used to check if the mapping is valid or not. Never use 'kvm_host_map'
|
|
|
|
* directly to check for that.
|
|
|
|
*/
|
|
|
|
static inline bool kvm_vcpu_mapped(struct kvm_host_map *map)
|
|
|
|
{
|
|
|
|
return !!map->hva;
|
|
|
|
}
|
|
|
|
|
2021-05-18 20:00:31 +08:00
|
|
|
static inline bool kvm_vcpu_can_poll(ktime_t cur, ktime_t stop)
|
|
|
|
{
|
|
|
|
return single_task_running() && !need_resched() && ktime_before(cur, stop);
|
|
|
|
}
|
|
|
|
|
2012-04-19 00:22:47 +08:00
|
|
|
/*
|
|
|
|
* Sometimes a large or cross-page mmio needs to be broken up into separate
|
|
|
|
* exits for userspace servicing.
|
|
|
|
*/
|
|
|
|
struct kvm_mmio_fragment {
|
|
|
|
gpa_t gpa;
|
|
|
|
void *data;
|
|
|
|
unsigned len;
|
|
|
|
};
|
|
|
|
|
2007-12-14 09:45:31 +08:00
|
|
|
struct kvm_vcpu {
|
|
|
|
struct kvm *kvm;
|
2008-01-29 07:42:34 +08:00
|
|
|
#ifdef CONFIG_PREEMPT_NOTIFIERS
|
2007-12-14 09:45:31 +08:00
|
|
|
struct preempt_notifier preempt_notifier;
|
2008-01-29 07:42:34 +08:00
|
|
|
#endif
|
2011-01-12 15:40:31 +08:00
|
|
|
int cpu;
|
2019-11-07 20:53:42 +08:00
|
|
|
int vcpu_id; /* id given by userspace at creation */
|
|
|
|
int vcpu_idx; /* index in kvm->vcpus array */
|
2022-04-15 08:43:43 +08:00
|
|
|
int ____srcu_idx; /* Don't use this directly. You've been warned. */
|
|
|
|
#ifdef CONFIG_PROVE_RCU
|
|
|
|
int srcu_depth;
|
|
|
|
#endif
|
2011-01-12 15:40:31 +08:00
|
|
|
int mode;
|
2018-07-10 17:27:19 +08:00
|
|
|
u64 requests;
|
2008-12-15 20:52:10 +08:00
|
|
|
unsigned long guest_debug;
|
2011-01-12 15:40:31 +08:00
|
|
|
|
|
|
|
struct mutex mutex;
|
|
|
|
struct kvm_run *run;
|
2009-12-24 00:35:25 +08:00
|
|
|
|
2021-10-09 10:11:57 +08:00
|
|
|
#ifndef __KVM_HAVE_ARCH_WQP
|
2020-04-24 13:48:37 +08:00
|
|
|
struct rcuwait wait;
|
2021-10-09 10:11:57 +08:00
|
|
|
#endif
|
2017-07-06 20:44:28 +08:00
|
|
|
struct pid __rcu *pid;
|
2007-12-14 09:45:31 +08:00
|
|
|
int sigset_active;
|
|
|
|
sigset_t sigset;
|
2015-09-03 22:07:37 +08:00
|
|
|
unsigned int halt_poll_ns;
|
2016-05-13 18:16:35 +08:00
|
|
|
bool valid_wakeup;
|
2007-12-14 09:45:31 +08:00
|
|
|
|
2007-10-20 15:34:38 +08:00
|
|
|
#ifdef CONFIG_HAS_IOMEM
|
2007-12-14 09:45:31 +08:00
|
|
|
int mmio_needed;
|
|
|
|
int mmio_read_completed;
|
|
|
|
int mmio_is_write;
|
2012-04-19 00:22:47 +08:00
|
|
|
int mmio_cur_fragment;
|
|
|
|
int mmio_nr_fragments;
|
|
|
|
struct kvm_mmio_fragment mmio_fragments[KVM_MAX_MMIO_FRAGMENTS];
|
2007-10-20 15:34:38 +08:00
|
|
|
#endif
|
2007-04-19 22:27:43 +08:00
|
|
|
|
2010-10-14 17:22:46 +08:00
|
|
|
#ifdef CONFIG_KVM_ASYNC_PF
|
|
|
|
struct {
|
|
|
|
u32 queued;
|
|
|
|
struct list_head queue;
|
|
|
|
struct list_head done;
|
|
|
|
spinlock_t lock;
|
|
|
|
} async_pf;
|
|
|
|
#endif
|
|
|
|
|
2012-07-18 21:37:46 +08:00
|
|
|
#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
|
|
|
|
/*
|
|
|
|
* Cpu relax intercept or pause loop exit optimization
|
|
|
|
* in_spin_loop: set when a vcpu does a pause loop exit
|
|
|
|
* or cpu relax intercepted.
|
|
|
|
* dy_eligible: indicates whether vcpu is eligible for directed yield.
|
|
|
|
*/
|
|
|
|
struct {
|
|
|
|
bool in_spin_loop;
|
|
|
|
bool dy_eligible;
|
|
|
|
} spin_loop;
|
|
|
|
#endif
|
2013-03-05 02:02:07 +08:00
|
|
|
bool preempted;
|
KVM: Boost vCPUs that are delivering interrupts
Inspired by commit 9cac38dd5d (KVM/s390: Set preempted flag during
vcpu wakeup and interrupt delivery), we want to also boost not just
lock holders but also vCPUs that are delivering interrupts. Most
smp_call_function_many calls are synchronous, so the IPI target vCPUs
are also good yield candidates. This patch introduces vcpu->ready to
boost vCPUs during wakeup and interrupt delivery time; unlike s390 we do
not reuse vcpu->preempted so that voluntarily preempted vCPUs are taken
into account by kvm_vcpu_on_spin, but vmx_vcpu_pi_put is not affected
(VT-d PI handles voluntary preemption separately, in pi_pre_block).
Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM:
ebizzy -M
vanilla boosting improved
1VM 21443 23520 9%
2VM 2800 8000 180%
3VM 1800 3100 72%
Testing on my Haswell desktop 8 HT, with 8 vCPUs VM 8GB RAM, two VMs,
one running ebizzy -M, the other running 'stress --cpu 2':
w/ boosting + w/o pv sched yield(vanilla)
vanilla boosting improved
1570 4000 155%
w/ boosting + w/ pv sched yield(vanilla)
vanilla boosting improved
1844 5157 179%
w/o boosting, perf top in VM:
72.33% [kernel] [k] smp_call_function_many
4.22% [kernel] [k] call_function_i
3.71% [kernel] [k] async_page_fault
w/ boosting, perf top in VM:
38.43% [kernel] [k] smp_call_function_many
6.31% [kernel] [k] async_page_fault
6.13% libc-2.23.so [.] __memcpy_avx_unaligned
4.88% [kernel] [k] call_function_interrupt
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-18 19:39:06 +08:00
|
|
|
bool ready;
|
2007-12-14 09:41:22 +08:00
|
|
|
struct kvm_vcpu_arch arch;
|
2021-06-19 06:27:06 +08:00
|
|
|
struct kvm_vcpu_stat stat;
|
|
|
|
char stats_id[KVM_STATS_NAME_SIZE];
|
2020-10-01 09:22:22 +08:00
|
|
|
struct kvm_dirty_ring dirty_ring;
|
2021-08-05 06:28:40 +08:00
|
|
|
|
|
|
|
/*
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
* The most recently used memslot by this vCPU and the slots generation
|
|
|
|
* for which it is valid.
|
|
|
|
* No wraparound protection is needed since generations won't overflow in
|
|
|
|
* thousands of years, even assuming 1M memslot operations per second.
|
2021-08-05 06:28:40 +08:00
|
|
|
*/
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
struct kvm_memory_slot *last_used_slot;
|
|
|
|
u64 last_used_slot_gen;
|
2007-12-14 09:41:22 +08:00
|
|
|
};
|
|
|
|
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 21:29:22 +08:00
|
|
|
/*
|
|
|
|
* Start accounting time towards a guest.
|
|
|
|
* Must be called before entering guest context.
|
|
|
|
*/
|
|
|
|
static __always_inline void guest_timing_enter_irqoff(void)
|
2021-05-05 08:27:34 +08:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* This is running in ioctl context so its safe to assume that it's the
|
|
|
|
* stime pending cputime to flush.
|
|
|
|
*/
|
|
|
|
instrumentation_begin();
|
|
|
|
vtime_account_guest_enter();
|
|
|
|
instrumentation_end();
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 21:29:22 +08:00
|
|
|
}
|
2021-05-05 08:27:34 +08:00
|
|
|
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 21:29:22 +08:00
|
|
|
/*
|
|
|
|
* Enter guest context and enter an RCU extended quiescent state.
|
|
|
|
*
|
|
|
|
* Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
|
|
|
|
* unsafe to use any code which may directly or indirectly use RCU, tracing
|
|
|
|
* (including IRQ flag tracing), or lockdep. All code in this period must be
|
|
|
|
* non-instrumentable.
|
|
|
|
*/
|
|
|
|
static __always_inline void guest_context_enter_irqoff(void)
|
|
|
|
{
|
2021-05-05 08:27:34 +08:00
|
|
|
/*
|
|
|
|
* KVM does not hold any references to rcu protected data when it
|
|
|
|
* switches CPU into a guest mode. In fact switching to a guest mode
|
|
|
|
* is very similar to exiting to userspace from rcu point of view. In
|
|
|
|
* addition CPU may stay in a guest mode for quite a long time (up to
|
|
|
|
* one time slice). Lets treat guest mode as quiescent state, just like
|
|
|
|
* we do with user-mode execution.
|
|
|
|
*/
|
|
|
|
if (!context_tracking_guest_enter()) {
|
|
|
|
instrumentation_begin();
|
2022-09-15 16:38:24 +08:00
|
|
|
rcu_virt_note_context_switch();
|
2021-05-05 08:27:34 +08:00
|
|
|
instrumentation_end();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 21:29:22 +08:00
|
|
|
/*
|
|
|
|
* Deprecated. Architectures should move to guest_timing_enter_irqoff() and
|
|
|
|
* guest_state_enter_irqoff().
|
|
|
|
*/
|
|
|
|
static __always_inline void guest_enter_irqoff(void)
|
|
|
|
{
|
|
|
|
guest_timing_enter_irqoff();
|
|
|
|
guest_context_enter_irqoff();
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* guest_state_enter_irqoff - Fixup state when entering a guest
|
|
|
|
*
|
|
|
|
* Entry to a guest will enable interrupts, but the kernel state is interrupts
|
|
|
|
* disabled when this is invoked. Also tell RCU about it.
|
|
|
|
*
|
|
|
|
* 1) Trace interrupts on state
|
|
|
|
* 2) Invoke context tracking if enabled to adjust RCU state
|
|
|
|
* 3) Tell lockdep that interrupts are enabled
|
|
|
|
*
|
|
|
|
* Invoked from architecture specific code before entering a guest.
|
|
|
|
* Must be called with interrupts disabled and the caller must be
|
|
|
|
* non-instrumentable.
|
|
|
|
* The caller has to invoke guest_timing_enter_irqoff() before this.
|
|
|
|
*
|
|
|
|
* Note: this is analogous to exit_to_user_mode().
|
|
|
|
*/
|
|
|
|
static __always_inline void guest_state_enter_irqoff(void)
|
|
|
|
{
|
|
|
|
instrumentation_begin();
|
|
|
|
trace_hardirqs_on_prepare();
|
2022-03-15 06:19:03 +08:00
|
|
|
lockdep_hardirqs_on_prepare();
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 21:29:22 +08:00
|
|
|
instrumentation_end();
|
|
|
|
|
|
|
|
guest_context_enter_irqoff();
|
|
|
|
lockdep_hardirqs_on(CALLER_ADDR0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Exit guest context and exit an RCU extended quiescent state.
|
|
|
|
*
|
|
|
|
* Between guest_context_enter_irqoff() and guest_context_exit_irqoff() it is
|
|
|
|
* unsafe to use any code which may directly or indirectly use RCU, tracing
|
|
|
|
* (including IRQ flag tracing), or lockdep. All code in this period must be
|
|
|
|
* non-instrumentable.
|
|
|
|
*/
|
|
|
|
static __always_inline void guest_context_exit_irqoff(void)
|
2021-05-05 08:27:34 +08:00
|
|
|
{
|
|
|
|
context_tracking_guest_exit();
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 21:29:22 +08:00
|
|
|
}
|
2021-05-05 08:27:34 +08:00
|
|
|
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 21:29:22 +08:00
|
|
|
/*
|
|
|
|
* Stop accounting time towards a guest.
|
|
|
|
* Must be called after exiting guest context.
|
|
|
|
*/
|
|
|
|
static __always_inline void guest_timing_exit_irqoff(void)
|
|
|
|
{
|
2021-05-05 08:27:34 +08:00
|
|
|
instrumentation_begin();
|
|
|
|
/* Flush the guest cputime we spent on the guest */
|
|
|
|
vtime_account_guest_exit();
|
|
|
|
instrumentation_end();
|
|
|
|
}
|
|
|
|
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 21:29:22 +08:00
|
|
|
/*
|
|
|
|
* Deprecated. Architectures should move to guest_state_exit_irqoff() and
|
|
|
|
* guest_timing_exit_irqoff().
|
|
|
|
*/
|
|
|
|
static __always_inline void guest_exit_irqoff(void)
|
|
|
|
{
|
|
|
|
guest_context_exit_irqoff();
|
|
|
|
guest_timing_exit_irqoff();
|
|
|
|
}
|
|
|
|
|
2021-05-05 08:27:34 +08:00
|
|
|
static inline void guest_exit(void)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
local_irq_save(flags);
|
|
|
|
guest_exit_irqoff();
|
|
|
|
local_irq_restore(flags);
|
|
|
|
}
|
|
|
|
|
kvm: add guest_state_{enter,exit}_irqoff()
When transitioning to/from guest mode, it is necessary to inform
lockdep, tracing, and RCU in a specific order, similar to the
requirements for transitions to/from user mode. Additionally, it is
necessary to perform vtime accounting for a window around running the
guest, with RCU enabled, such that timer interrupts taken from the guest
can be accounted as guest time.
Most architectures don't handle all the necessary pieces, and a have a
number of common bugs, including unsafe usage of RCU during the window
between guest_enter() and guest_exit().
On x86, this was dealt with across commits:
87fa7f3e98a1310e ("x86/kvm: Move context tracking where it belongs")
0642391e2139a2c1 ("x86/kvm/vmx: Add hardirq tracing to guest enter/exit")
9fc975e9efd03e57 ("x86/kvm/svm: Add hardirq tracing on guest enter/exit")
3ebccdf373c21d86 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
135961e0a7d555fc ("x86/kvm/svm: Move guest enter/exit into .noinstr.text")
160457140187c5fb ("KVM: x86: Defer vtime accounting 'til after IRQ handling")
bc908e091b326467 ("KVM: x86: Consolidate guest enter/exit logic to common helpers")
... but those fixes are specific to x86, and as the resulting logic
(while correct) is split across generic helper functions and
x86-specific helper functions, it is difficult to see that the
entry/exit accounting is balanced.
This patch adds generic helpers which architectures can use to handle
guest entry/exit consistently and correctly. The guest_{enter,exit}()
helpers are split into guest_timing_{enter,exit}() to perform vtime
accounting, and guest_context_{enter,exit}() to perform the necessary
context tracking and RCU management. The existing guest_{enter,exit}()
heleprs are left as wrappers of these.
Atop this, new guest_state_enter_irqoff() and guest_state_exit_irqoff()
helpers are added to handle the ordering of lockdep, tracing, and RCU
manageent. These are inteneded to mirror exit_to_user_mode() and
enter_from_user_mode().
Subsequent patches will migrate architectures over to the new helpers,
following a sequence:
guest_timing_enter_irqoff();
guest_state_enter_irqoff();
< run the vcpu >
guest_state_exit_irqoff();
< take any pending IRQs >
guest_timing_exit_irqoff();
This sequences handles all of the above correctly, and more clearly
balances the entry and exit portions, making it easier to understand.
The existing helpers are marked as deprecated, and will be removed once
all architectures have been converted.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Message-Id: <20220201132926.3301912-2-mark.rutland@arm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-01 21:29:22 +08:00
|
|
|
/**
|
|
|
|
* guest_state_exit_irqoff - Establish state when returning from guest mode
|
|
|
|
*
|
|
|
|
* Entry from a guest disables interrupts, but guest mode is traced as
|
|
|
|
* interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
|
|
|
|
*
|
|
|
|
* 1) Tell lockdep that interrupts are disabled
|
|
|
|
* 2) Invoke context tracking if enabled to reactivate RCU
|
|
|
|
* 3) Trace interrupts off state
|
|
|
|
*
|
|
|
|
* Invoked from architecture specific code after exiting a guest.
|
|
|
|
* Must be invoked with interrupts disabled and the caller must be
|
|
|
|
* non-instrumentable.
|
|
|
|
* The caller has to invoke guest_timing_exit_irqoff() after this.
|
|
|
|
*
|
|
|
|
* Note: this is analogous to enter_from_user_mode().
|
|
|
|
*/
|
|
|
|
static __always_inline void guest_state_exit_irqoff(void)
|
|
|
|
{
|
|
|
|
lockdep_hardirqs_off(CALLER_ADDR0);
|
|
|
|
guest_context_exit_irqoff();
|
|
|
|
|
|
|
|
instrumentation_begin();
|
|
|
|
trace_hardirqs_off_finish();
|
|
|
|
instrumentation_end();
|
|
|
|
}
|
|
|
|
|
2011-01-12 15:40:31 +08:00
|
|
|
static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2017-04-27 04:32:24 +08:00
|
|
|
/*
|
|
|
|
* The memory barrier ensures a previous write to vcpu->requests cannot
|
|
|
|
* be reordered with the read of vcpu->mode. It pairs with the general
|
|
|
|
* memory barrier following the write of vcpu->mode in VCPU RUN.
|
|
|
|
*/
|
|
|
|
smp_mb__before_atomic();
|
2011-01-12 15:40:31 +08:00
|
|
|
return cmpxchg(&vcpu->mode, IN_GUEST_MODE, EXITING_GUEST_MODE);
|
|
|
|
}
|
|
|
|
|
2010-04-13 21:47:24 +08:00
|
|
|
/*
|
|
|
|
* Some of the bitops functions do not support too long bitmaps.
|
|
|
|
* This number must be determined not to exceed such limits.
|
|
|
|
*/
|
|
|
|
#define KVM_MEM_MAX_NR_PAGES ((1UL << 31) - 1)
|
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
/*
|
|
|
|
* Since at idle each memslot belongs to two memslot sets it has to contain
|
|
|
|
* two embedded nodes for each data structure that it forms a part of.
|
|
|
|
*
|
|
|
|
* Two memslot sets (one active and one inactive) are necessary so the VM
|
|
|
|
* continues to run on one memslot set while the other is being modified.
|
|
|
|
*
|
|
|
|
* These two memslot sets normally point to the same set of memslots.
|
|
|
|
* They can, however, be desynchronized when performing a memslot management
|
|
|
|
* operation by replacing the memslot to be modified by its copy.
|
|
|
|
* After the operation is complete, both memslot sets once again point to
|
|
|
|
* the same, common set of memslot data.
|
|
|
|
*
|
|
|
|
* The memslots themselves are independent of each other so they can be
|
|
|
|
* individually added or deleted.
|
|
|
|
*/
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 18:21:36 +08:00
|
|
|
struct kvm_memory_slot {
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
struct hlist_node id_node[2];
|
|
|
|
struct interval_tree_node hva_node[2];
|
|
|
|
struct rb_node gfn_node[2];
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 18:21:36 +08:00
|
|
|
gfn_t base_gfn;
|
|
|
|
unsigned long npages;
|
|
|
|
unsigned long *dirty_bitmap;
|
2012-02-08 12:02:18 +08:00
|
|
|
struct kvm_arch_memory_slot arch;
|
2007-10-18 17:09:33 +08:00
|
|
|
unsigned long userspace_addr;
|
2012-12-11 01:33:26 +08:00
|
|
|
u32 flags;
|
2012-12-11 01:33:32 +08:00
|
|
|
short id;
|
2020-10-15 02:26:46 +08:00
|
|
|
u16 as_id;
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 18:21:36 +08:00
|
|
|
};
|
|
|
|
|
2021-11-16 07:45:58 +08:00
|
|
|
static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot)
|
2020-10-01 09:22:26 +08:00
|
|
|
{
|
|
|
|
return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
|
|
|
|
}
|
|
|
|
|
2010-04-12 18:35:35 +08:00
|
|
|
static inline unsigned long kvm_dirty_bitmap_bytes(struct kvm_memory_slot *memslot)
|
|
|
|
{
|
|
|
|
return ALIGN(memslot->npages, BITS_PER_LONG) / 8;
|
|
|
|
}
|
|
|
|
|
2018-05-01 00:33:24 +08:00
|
|
|
static inline unsigned long *kvm_second_dirty_bitmap(struct kvm_memory_slot *memslot)
|
|
|
|
{
|
|
|
|
unsigned long len = kvm_dirty_bitmap_bytes(memslot);
|
|
|
|
|
|
|
|
return memslot->dirty_bitmap + len / sizeof(*memslot->dirty_bitmap);
|
|
|
|
}
|
|
|
|
|
2020-02-27 09:32:27 +08:00
|
|
|
#ifndef KVM_DIRTY_LOG_MANUAL_CAPS
|
|
|
|
#define KVM_DIRTY_LOG_MANUAL_CAPS KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE
|
|
|
|
#endif
|
|
|
|
|
2013-07-15 19:36:01 +08:00
|
|
|
struct kvm_s390_adapter_int {
|
|
|
|
u64 ind_addr;
|
|
|
|
u64 summary_addr;
|
|
|
|
u64 ind_offset;
|
|
|
|
u32 summary_offset;
|
|
|
|
u32 adapter_id;
|
|
|
|
};
|
|
|
|
|
2015-11-10 20:36:34 +08:00
|
|
|
struct kvm_hv_sint {
|
|
|
|
u32 vcpu;
|
|
|
|
u32 sint;
|
|
|
|
};
|
|
|
|
|
2021-12-11 00:36:23 +08:00
|
|
|
struct kvm_xen_evtchn {
|
|
|
|
u32 port;
|
2022-03-03 23:41:17 +08:00
|
|
|
u32 vcpu_id;
|
|
|
|
int vcpu_idx;
|
2021-12-11 00:36:23 +08:00
|
|
|
u32 priority;
|
|
|
|
};
|
|
|
|
|
2008-11-19 19:58:46 +08:00
|
|
|
struct kvm_kernel_irq_routing_entry {
|
|
|
|
u32 gsi;
|
2009-07-26 22:10:01 +08:00
|
|
|
u32 type;
|
2009-02-04 23:28:14 +08:00
|
|
|
int (*set)(struct kvm_kernel_irq_routing_entry *e,
|
2013-04-11 19:21:40 +08:00
|
|
|
struct kvm *kvm, int irq_source_id, int level,
|
|
|
|
bool line_status);
|
2008-11-19 19:58:46 +08:00
|
|
|
union {
|
|
|
|
struct {
|
|
|
|
unsigned irqchip;
|
|
|
|
unsigned pin;
|
|
|
|
} irqchip;
|
2016-07-23 00:20:38 +08:00
|
|
|
struct {
|
|
|
|
u32 address_lo;
|
|
|
|
u32 address_hi;
|
|
|
|
u32 data;
|
|
|
|
u32 flags;
|
|
|
|
u32 devid;
|
|
|
|
} msi;
|
2013-07-15 19:36:01 +08:00
|
|
|
struct kvm_s390_adapter_int adapter;
|
2015-11-10 20:36:34 +08:00
|
|
|
struct kvm_hv_sint hv_sint;
|
2021-12-11 00:36:23 +08:00
|
|
|
struct kvm_xen_evtchn xen_evtchn;
|
2008-11-19 19:58:46 +08:00
|
|
|
};
|
2009-08-24 16:54:20 +08:00
|
|
|
struct hlist_node link;
|
|
|
|
};
|
|
|
|
|
2015-07-30 14:32:35 +08:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
|
|
|
|
struct kvm_irq_routing_table {
|
|
|
|
int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
|
|
|
|
u32 nr_rt_entries;
|
|
|
|
/*
|
|
|
|
* Array indexed by gsi. Each entry contains list of irq chips
|
|
|
|
* the gsi is connected to.
|
|
|
|
*/
|
2020-05-28 22:35:11 +08:00
|
|
|
struct hlist_head map[];
|
2015-07-30 14:32:35 +08:00
|
|
|
};
|
|
|
|
#endif
|
|
|
|
|
2022-11-03 22:44:10 +08:00
|
|
|
bool kvm_arch_irqchip_in_kernel(struct kvm *kvm);
|
2015-07-30 14:32:35 +08:00
|
|
|
|
2022-08-16 20:53:21 +08:00
|
|
|
#ifndef KVM_INTERNAL_MEM_SLOTS
|
|
|
|
#define KVM_INTERNAL_MEM_SLOTS 0
|
2012-12-11 01:33:15 +08:00
|
|
|
#endif
|
|
|
|
|
2021-01-29 02:01:31 +08:00
|
|
|
#define KVM_MEM_SLOTS_NUM SHRT_MAX
|
2022-08-16 20:53:21 +08:00
|
|
|
#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
|
2011-11-24 17:37:48 +08:00
|
|
|
|
2015-05-17 23:30:37 +08:00
|
|
|
#ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
|
|
|
|
static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2009-12-24 00:35:16 +08:00
|
|
|
struct kvm_memslots {
|
2010-10-18 21:22:23 +08:00
|
|
|
u64 generation;
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
atomic_long_t last_used_slot;
|
2021-12-07 03:54:28 +08:00
|
|
|
struct rb_root_cached hva_tree;
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
struct rb_root gfn_tree;
|
2021-12-07 03:54:27 +08:00
|
|
|
/*
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
* The mapping table from slot id to memslot.
|
2021-12-07 03:54:27 +08:00
|
|
|
*
|
|
|
|
* 7-bit bucket count matches the size of the old id to index array for
|
|
|
|
* 512 slots, while giving good performance with this slot count.
|
|
|
|
* Higher bucket counts bring only small performance improvements but
|
|
|
|
* always result in higher memory usage (even for lower memslot counts).
|
|
|
|
*/
|
|
|
|
DECLARE_HASHTABLE(id_hash, 7);
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
int node_idx;
|
2009-12-24 00:35:16 +08:00
|
|
|
};
|
|
|
|
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 18:21:36 +08:00
|
|
|
struct kvm {
|
2021-02-03 02:57:24 +08:00
|
|
|
#ifdef KVM_HAVE_MMU_RWLOCK
|
|
|
|
rwlock_t mmu_lock;
|
|
|
|
#else
|
2007-12-21 08:18:26 +08:00
|
|
|
spinlock_t mmu_lock;
|
2021-02-03 02:57:24 +08:00
|
|
|
#endif /* KVM_HAVE_MMU_RWLOCK */
|
|
|
|
|
2009-12-24 00:35:26 +08:00
|
|
|
struct mutex slots_lock;
|
2021-05-19 01:34:11 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Protects the arch-specific fields of struct kvm_memory_slots in
|
|
|
|
* use by the VM. To be used under the slots_lock (above) or in a
|
|
|
|
* kvm->srcu critical section where acquiring the slots_lock would
|
|
|
|
* lead to deadlock with the synchronize_srcu in
|
|
|
|
* install_new_memslots.
|
|
|
|
*/
|
|
|
|
struct mutex slots_arch_lock;
|
2007-11-21 22:41:05 +08:00
|
|
|
struct mm_struct *mm; /* userspace tied to this vm */
|
KVM: Require total number of memslot pages to fit in an unsigned long
Explicitly disallow creating more memslot pages than can fit in an
unsigned long, KVM doesn't correctly handle a total number of memslot
pages that doesn't fit in an unsigned long and remedying that would be a
waste of time.
For a 64-bit kernel, this is a nop as memslots are not allowed to overlap
in the gfn address space.
With a 32-bit kernel, userspace can at most address 3gb of virtual memory,
whereas wrapping the total number of pages would require 4tb+ of guest
physical memory. Even with x86's second address space for SMM, userspace
would need to alias all of guest memory more than one _thousand_ times.
And on older x86 hardware with MAXPHYADDR < 43, the guest couldn't
actually access any of those aliases even if userspace lied about
guest.MAXPHYADDR.
On 390 and arm64, this is a nop as they don't support 32-bit hosts.
On x86, practically speaking this is simply acknowledging reality as the
existing kvm_mmu_calculate_default_mmu_pages() assumes the total number
of pages fits in an "unsigned long".
On PPC, this is likely a nop as every flavor of PPC KVM assumes gfns (and
gpas!) fit in unsigned long. arch/powerpc/kvm/book3s_32_mmu_host.c goes
a step further and fails the build if CONFIG_PTE_64BIT=y, which
presumably means that it does't support 64-bit physical addresses.
On MIPS, this is also likely a nop as the core MMU helpers assume gpas
fit in unsigned long, e.g. see kvm_mips_##name##_pte.
And finally, RISC-V is a "don't care" as it doesn't exist in any release,
i.e. there is no established ABI to break.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <1c2c91baf8e78acccd4dad38da591002e61c013c.1638817638.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:07 +08:00
|
|
|
unsigned long nr_memslot_pages;
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
/* The two memslot sets - active and inactive (per address space) */
|
|
|
|
struct kvm_memslots __memslots[KVM_ADDRESS_SPACE_NUM][2];
|
|
|
|
/* The current active memslot set for each address space */
|
2017-07-06 22:17:14 +08:00
|
|
|
struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
|
2021-11-17 00:04:01 +08:00
|
|
|
struct xarray vcpu_array;
|
2022-11-18 01:25:02 +08:00
|
|
|
/*
|
|
|
|
* Protected by slots_lock, but can be read outside if an
|
|
|
|
* incorrect answer is acceptable.
|
|
|
|
*/
|
|
|
|
atomic_t nr_memslots_dirty_logging;
|
2016-06-13 20:48:25 +08:00
|
|
|
|
KVM: Block memslot updates across range_start() and range_end()
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM. Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none. Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().
Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.
A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".
======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc3+ #6 Tainted: G OE
------------------------------------------------------
qemu-system-x86/3069 is trying to acquire lock:
ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
but task is already holding lock:
ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
which lock already depends on the new lock.
This corresponds to the following MMU notifier logic:
invalidate_range_start
take pseudo lock
down_read() (*)
release pseudo lock
invalidate_range_end
take pseudo lock (**)
up_read()
release pseudo lock
At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):
- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots
- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes
- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.
Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock. This also allows handling blockable and non-blockable
critical section in the same way.
Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever. Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 20:09:15 +08:00
|
|
|
/* Used to wait for completion of MMU notifiers. */
|
|
|
|
spinlock_t mn_invalidate_lock;
|
|
|
|
unsigned long mn_active_invalidate_count;
|
|
|
|
struct rcuwait mn_memslots_update_rcuwait;
|
|
|
|
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-11 00:36:21 +08:00
|
|
|
/* For management / invalidation of gfn_to_pfn_caches */
|
|
|
|
spinlock_t gpc_lock;
|
|
|
|
struct list_head gpc_list;
|
|
|
|
|
2016-06-13 20:48:25 +08:00
|
|
|
/*
|
|
|
|
* created_vcpus is protected by kvm->lock, and is incremented
|
|
|
|
* at the beginning of KVM_CREATE_VCPU. online_vcpus is only
|
|
|
|
* incremented after storing the kvm_vcpu pointer in vcpus,
|
|
|
|
* and is accessed atomically.
|
|
|
|
*/
|
2009-06-09 20:56:28 +08:00
|
|
|
atomic_t online_vcpus;
|
2022-03-05 03:48:38 +08:00
|
|
|
int max_vcpus;
|
2016-06-13 20:48:25 +08:00
|
|
|
int created_vcpus;
|
2011-02-01 22:53:28 +08:00
|
|
|
int last_boosted_vcpu;
|
2007-02-12 16:54:44 +08:00
|
|
|
struct list_head vm_list;
|
2009-06-05 02:08:23 +08:00
|
|
|
struct mutex lock;
|
2017-07-07 16:51:38 +08:00
|
|
|
struct kvm_io_bus __rcu *buses[KVM_NR_BUSES];
|
2009-05-20 22:30:49 +08:00
|
|
|
#ifdef CONFIG_HAVE_KVM_EVENTFD
|
|
|
|
struct {
|
|
|
|
spinlock_t lock;
|
|
|
|
struct list_head items;
|
2012-09-22 01:58:03 +08:00
|
|
|
struct list_head resampler_list;
|
|
|
|
struct mutex resampler_lock;
|
2009-05-20 22:30:49 +08:00
|
|
|
} irqfds;
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-08 05:08:49 +08:00
|
|
|
struct list_head ioeventfds;
|
2009-05-20 22:30:49 +08:00
|
|
|
#endif
|
2007-11-18 22:24:12 +08:00
|
|
|
struct kvm_vm_stat stat;
|
2007-12-14 09:54:20 +08:00
|
|
|
struct kvm_arch arch;
|
2017-02-20 19:06:21 +08:00
|
|
|
refcount_t users_count;
|
2017-03-31 19:53:23 +08:00
|
|
|
#ifdef CONFIG_KVM_MMIO
|
2008-05-30 22:05:54 +08:00
|
|
|
struct kvm_coalesced_mmio_ring *coalesced_mmio_ring;
|
2011-07-21 01:59:00 +08:00
|
|
|
spinlock_t ring_lock;
|
|
|
|
struct list_head coalesced_zones;
|
2008-05-30 22:05:54 +08:00
|
|
|
#endif
|
2008-07-25 22:24:52 +08:00
|
|
|
|
2009-06-05 02:08:23 +08:00
|
|
|
struct mutex irq_lock;
|
2009-01-04 23:10:50 +08:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQCHIP
|
2010-11-19 01:09:08 +08:00
|
|
|
/*
|
2014-06-30 18:51:11 +08:00
|
|
|
* Update side is protected by irq_lock.
|
2010-11-19 01:09:08 +08:00
|
|
|
*/
|
2010-03-04 22:59:23 +08:00
|
|
|
struct kvm_irq_routing_table __rcu *irq_routing;
|
2014-08-06 20:24:45 +08:00
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQFD
|
2009-08-24 16:54:23 +08:00
|
|
|
struct hlist_head irq_ack_notifier_list;
|
2009-01-04 23:10:50 +08:00
|
|
|
#endif
|
|
|
|
|
2012-06-16 03:07:24 +08:00
|
|
|
#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
|
2008-07-25 22:24:52 +08:00
|
|
|
struct mmu_notifier mmu_notifier;
|
2022-08-16 20:53:22 +08:00
|
|
|
unsigned long mmu_invalidate_seq;
|
|
|
|
long mmu_invalidate_in_progress;
|
|
|
|
unsigned long mmu_invalidate_range_start;
|
|
|
|
unsigned long mmu_invalidate_range_end;
|
2008-07-25 22:24:52 +08:00
|
|
|
#endif
|
2013-04-25 22:11:23 +08:00
|
|
|
struct list_head devices;
|
2020-02-27 09:32:27 +08:00
|
|
|
u64 manual_dirty_log_protect;
|
2016-05-18 19:26:23 +08:00
|
|
|
struct dentry *debugfs_dentry;
|
|
|
|
struct kvm_stat_data **debugfs_stat_data;
|
2017-04-21 08:30:06 +08:00
|
|
|
struct srcu_struct srcu;
|
|
|
|
struct srcu_struct irq_srcu;
|
2017-07-24 19:40:03 +08:00
|
|
|
pid_t userspace_pid;
|
2022-11-17 08:16:57 +08:00
|
|
|
bool override_halt_poll_ns;
|
2020-04-18 06:14:46 +08:00
|
|
|
unsigned int max_halt_poll_ns;
|
2020-10-01 09:22:22 +08:00
|
|
|
u32 dirty_ring_size;
|
2022-11-10 18:49:10 +08:00
|
|
|
bool dirty_ring_with_bitmap;
|
2021-07-03 06:04:23 +08:00
|
|
|
bool vm_bugged;
|
2021-11-11 23:13:38 +08:00
|
|
|
bool vm_dead;
|
2021-06-06 10:10:44 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
|
|
|
|
struct notifier_block pm_notifier;
|
|
|
|
#endif
|
2021-06-19 06:27:05 +08:00
|
|
|
char stats_id[KVM_STATS_NAME_SIZE];
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 18:21:36 +08:00
|
|
|
};
|
|
|
|
|
KVM: Cleanup the kvm_print functions and introduce pr_XX wrappers
Introduces a couple of print functions, which are essentially wrappers
around standard printk functions, with a KVM: prefix.
Functions introduced or modified are:
- kvm_err(fmt, ...)
- kvm_info(fmt, ...)
- kvm_debug(fmt, ...)
- kvm_pr_unimpl(fmt, ...)
- pr_unimpl(vcpu, fmt, ...) -> vcpu_unimpl(vcpu, fmt, ...)
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-04 02:17:48 +08:00
|
|
|
#define kvm_err(fmt, ...) \
|
|
|
|
pr_err("kvm [%i]: " fmt, task_pid_nr(current), ## __VA_ARGS__)
|
|
|
|
#define kvm_info(fmt, ...) \
|
|
|
|
pr_info("kvm [%i]: " fmt, task_pid_nr(current), ## __VA_ARGS__)
|
|
|
|
#define kvm_debug(fmt, ...) \
|
|
|
|
pr_debug("kvm [%i]: " fmt, task_pid_nr(current), ## __VA_ARGS__)
|
2016-11-15 14:36:18 +08:00
|
|
|
#define kvm_debug_ratelimited(fmt, ...) \
|
|
|
|
pr_debug_ratelimited("kvm [%i]: " fmt, task_pid_nr(current), \
|
|
|
|
## __VA_ARGS__)
|
KVM: Cleanup the kvm_print functions and introduce pr_XX wrappers
Introduces a couple of print functions, which are essentially wrappers
around standard printk functions, with a KVM: prefix.
Functions introduced or modified are:
- kvm_err(fmt, ...)
- kvm_info(fmt, ...)
- kvm_debug(fmt, ...)
- kvm_pr_unimpl(fmt, ...)
- pr_unimpl(vcpu, fmt, ...) -> vcpu_unimpl(vcpu, fmt, ...)
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-04 02:17:48 +08:00
|
|
|
#define kvm_pr_unimpl(fmt, ...) \
|
|
|
|
pr_err_ratelimited("kvm [%i]: " fmt, \
|
|
|
|
task_tgid_nr(current), ## __VA_ARGS__)
|
2007-08-01 08:48:02 +08:00
|
|
|
|
KVM: Cleanup the kvm_print functions and introduce pr_XX wrappers
Introduces a couple of print functions, which are essentially wrappers
around standard printk functions, with a KVM: prefix.
Functions introduced or modified are:
- kvm_err(fmt, ...)
- kvm_info(fmt, ...)
- kvm_debug(fmt, ...)
- kvm_pr_unimpl(fmt, ...)
- pr_unimpl(vcpu, fmt, ...) -> vcpu_unimpl(vcpu, fmt, ...)
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-04 02:17:48 +08:00
|
|
|
/* The guest did something we don't support. */
|
|
|
|
#define vcpu_unimpl(vcpu, fmt, ...) \
|
2015-11-21 02:52:12 +08:00
|
|
|
kvm_pr_unimpl("vcpu%i, guest rIP: 0x%lx " fmt, \
|
|
|
|
(vcpu)->vcpu_id, kvm_rip_read(vcpu), ## __VA_ARGS__)
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 18:21:36 +08:00
|
|
|
|
2015-07-03 20:01:35 +08:00
|
|
|
#define vcpu_debug(vcpu, fmt, ...) \
|
|
|
|
kvm_debug("vcpu%i " fmt, (vcpu)->vcpu_id, ## __VA_ARGS__)
|
2016-11-15 14:36:18 +08:00
|
|
|
#define vcpu_debug_ratelimited(vcpu, fmt, ...) \
|
|
|
|
kvm_debug_ratelimited("vcpu%i " fmt, (vcpu)->vcpu_id, \
|
|
|
|
## __VA_ARGS__)
|
2015-12-01 00:22:20 +08:00
|
|
|
#define vcpu_err(vcpu, fmt, ...) \
|
|
|
|
kvm_err("vcpu%i " fmt, (vcpu)->vcpu_id, ## __VA_ARGS__)
|
2015-07-03 20:01:35 +08:00
|
|
|
|
2021-11-11 23:13:38 +08:00
|
|
|
static inline void kvm_vm_dead(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
kvm->vm_dead = true;
|
|
|
|
kvm_make_all_cpus_request(kvm, KVM_REQ_VM_DEAD);
|
|
|
|
}
|
|
|
|
|
2021-07-03 06:04:23 +08:00
|
|
|
static inline void kvm_vm_bugged(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
kvm->vm_bugged = true;
|
2021-11-11 23:13:38 +08:00
|
|
|
kvm_vm_dead(kvm);
|
2021-07-03 06:04:23 +08:00
|
|
|
}
|
|
|
|
|
2021-11-11 23:13:38 +08:00
|
|
|
|
2021-07-03 06:04:23 +08:00
|
|
|
#define KVM_BUG(cond, kvm, fmt...) \
|
|
|
|
({ \
|
|
|
|
int __ret = (cond); \
|
|
|
|
\
|
|
|
|
if (WARN_ONCE(__ret && !(kvm)->vm_bugged, fmt)) \
|
|
|
|
kvm_vm_bugged(kvm); \
|
|
|
|
unlikely(__ret); \
|
|
|
|
})
|
|
|
|
|
|
|
|
#define KVM_BUG_ON(cond, kvm) \
|
|
|
|
({ \
|
|
|
|
int __ret = (cond); \
|
|
|
|
\
|
|
|
|
if (WARN_ON_ONCE(__ret && !(kvm)->vm_bugged)) \
|
|
|
|
kvm_vm_bugged(kvm); \
|
|
|
|
unlikely(__ret); \
|
|
|
|
})
|
|
|
|
|
2022-04-15 08:43:43 +08:00
|
|
|
static inline void kvm_vcpu_srcu_read_lock(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_PROVE_RCU
|
|
|
|
WARN_ONCE(vcpu->srcu_depth++,
|
|
|
|
"KVM: Illegal vCPU srcu_idx LOCK, depth=%d", vcpu->srcu_depth - 1);
|
|
|
|
#endif
|
|
|
|
vcpu->____srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void kvm_vcpu_srcu_read_unlock(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
srcu_read_unlock(&vcpu->kvm->srcu, vcpu->____srcu_idx);
|
|
|
|
|
|
|
|
#ifdef CONFIG_PROVE_RCU
|
|
|
|
WARN_ONCE(--vcpu->srcu_depth,
|
|
|
|
"KVM: Illegal vCPU srcu_idx UNLOCK, depth=%d", vcpu->srcu_depth);
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2020-02-27 09:32:27 +08:00
|
|
|
static inline bool kvm_dirty_log_manual_protect_and_init_set(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return !!(kvm->manual_dirty_log_protect & KVM_DIRTY_LOG_INITIALLY_SET);
|
|
|
|
}
|
|
|
|
|
2017-07-07 16:51:38 +08:00
|
|
|
static inline struct kvm_io_bus *kvm_get_bus(struct kvm *kvm, enum kvm_bus idx)
|
|
|
|
{
|
|
|
|
return srcu_dereference_check(kvm->buses[idx], &kvm->srcu,
|
2017-08-02 23:55:54 +08:00
|
|
|
lockdep_is_held(&kvm->slots_lock) ||
|
|
|
|
!refcount_read(&kvm->users_count));
|
2017-07-07 16:51:38 +08:00
|
|
|
}
|
|
|
|
|
2009-06-09 20:56:29 +08:00
|
|
|
static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
|
|
|
|
{
|
2019-04-11 17:16:47 +08:00
|
|
|
int num_vcpus = atomic_read(&kvm->online_vcpus);
|
|
|
|
i = array_index_nospec(i, num_vcpus);
|
|
|
|
|
|
|
|
/* Pairs with smp_wmb() in kvm_vm_ioctl_create_vcpu. */
|
2009-06-09 20:56:29 +08:00
|
|
|
smp_rmb();
|
2021-11-17 00:04:01 +08:00
|
|
|
return xa_load(&kvm->vcpu_array, i);
|
2009-06-09 20:56:29 +08:00
|
|
|
}
|
|
|
|
|
2021-11-17 00:04:03 +08:00
|
|
|
#define kvm_for_each_vcpu(idx, vcpup, kvm) \
|
|
|
|
xa_for_each_range(&kvm->vcpu_array, idx, vcpup, 0, \
|
|
|
|
(atomic_read(&kvm->online_vcpus) - 1))
|
2009-06-09 20:56:29 +08:00
|
|
|
|
2015-11-05 16:03:50 +08:00
|
|
|
static inline struct kvm_vcpu *kvm_get_vcpu_by_id(struct kvm *kvm, int id)
|
|
|
|
{
|
2016-05-10 00:11:54 +08:00
|
|
|
struct kvm_vcpu *vcpu = NULL;
|
2021-11-17 00:04:02 +08:00
|
|
|
unsigned long i;
|
2015-11-05 16:03:50 +08:00
|
|
|
|
2016-05-10 00:11:54 +08:00
|
|
|
if (id < 0)
|
2015-11-05 16:55:08 +08:00
|
|
|
return NULL;
|
2016-05-10 00:11:54 +08:00
|
|
|
if (id < KVM_MAX_VCPUS)
|
|
|
|
vcpu = kvm_get_vcpu(kvm, id);
|
2015-11-05 16:55:08 +08:00
|
|
|
if (vcpu && vcpu->vcpu_id == id)
|
|
|
|
return vcpu;
|
2015-11-05 16:03:50 +08:00
|
|
|
kvm_for_each_vcpu(i, vcpu, kvm)
|
|
|
|
if (vcpu->vcpu_id == id)
|
|
|
|
return vcpu;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2021-11-17 00:03:57 +08:00
|
|
|
void kvm_destroy_vcpus(struct kvm *kvm);
|
2007-07-27 15:16:56 +08:00
|
|
|
|
2017-12-05 04:35:23 +08:00
|
|
|
void vcpu_load(struct kvm_vcpu *vcpu);
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-12 01:16:52 +08:00
|
|
|
void vcpu_put(struct kvm_vcpu *vcpu);
|
|
|
|
|
2014-11-20 20:45:31 +08:00
|
|
|
#ifdef __KVM_HAVE_IOAPIC
|
2017-04-07 16:50:33 +08:00
|
|
|
void kvm_arch_post_irq_ack_notifier_list_update(struct kvm *kvm);
|
2015-11-10 20:36:31 +08:00
|
|
|
void kvm_arch_post_irq_routing_update(struct kvm *kvm);
|
2014-11-20 20:45:31 +08:00
|
|
|
#else
|
2017-04-07 16:50:33 +08:00
|
|
|
static inline void kvm_arch_post_irq_ack_notifier_list_update(struct kvm *kvm)
|
2014-11-20 20:45:31 +08:00
|
|
|
{
|
|
|
|
}
|
2015-11-10 20:36:31 +08:00
|
|
|
static inline void kvm_arch_post_irq_routing_update(struct kvm *kvm)
|
2015-07-30 14:32:35 +08:00
|
|
|
{
|
|
|
|
}
|
2014-11-20 20:45:31 +08:00
|
|
|
#endif
|
|
|
|
|
2014-06-30 18:51:13 +08:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQFD
|
2013-02-28 19:33:18 +08:00
|
|
|
int kvm_irqfd_init(void);
|
|
|
|
void kvm_irqfd_exit(void);
|
|
|
|
#else
|
|
|
|
static inline int kvm_irqfd_init(void)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void kvm_irqfd_exit(void)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
2010-04-28 20:39:01 +08:00
|
|
|
int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
|
2007-07-30 19:12:19 +08:00
|
|
|
struct module *module);
|
2007-11-14 20:39:31 +08:00
|
|
|
void kvm_exit(void);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 18:21:36 +08:00
|
|
|
|
2008-03-30 21:01:25 +08:00
|
|
|
void kvm_get_kvm(struct kvm *kvm);
|
2021-06-25 23:32:07 +08:00
|
|
|
bool kvm_get_kvm_safe(struct kvm *kvm);
|
2008-03-30 21:01:25 +08:00
|
|
|
void kvm_put_kvm(struct kvm *kvm);
|
2021-04-09 06:32:14 +08:00
|
|
|
bool file_is_kvm(struct file *file);
|
2019-10-22 06:58:42 +08:00
|
|
|
void kvm_put_kvm_no_destroy(struct kvm *kvm);
|
2008-03-30 21:01:25 +08:00
|
|
|
|
2015-05-17 23:30:37 +08:00
|
|
|
static inline struct kvm_memslots *__kvm_memslots(struct kvm *kvm, int as_id)
|
2010-04-19 17:41:23 +08:00
|
|
|
{
|
2019-04-11 17:16:47 +08:00
|
|
|
as_id = array_index_nospec(as_id, KVM_ADDRESS_SPACE_NUM);
|
2017-07-07 21:49:00 +08:00
|
|
|
return srcu_dereference_check(kvm->memslots[as_id], &kvm->srcu,
|
2017-08-02 23:55:54 +08:00
|
|
|
lockdep_is_held(&kvm->slots_lock) ||
|
|
|
|
!refcount_read(&kvm->users_count));
|
2010-04-19 17:41:23 +08:00
|
|
|
}
|
|
|
|
|
2015-05-17 23:30:37 +08:00
|
|
|
static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return __kvm_memslots(kvm, 0);
|
|
|
|
}
|
|
|
|
|
2015-05-17 19:58:53 +08:00
|
|
|
static inline struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2015-05-17 23:30:37 +08:00
|
|
|
int as_id = kvm_arch_vcpu_memslots_id(vcpu);
|
|
|
|
|
|
|
|
return __kvm_memslots(vcpu->kvm, as_id);
|
2015-05-17 19:58:53 +08:00
|
|
|
}
|
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
static inline bool kvm_memslots_empty(struct kvm_memslots *slots)
|
|
|
|
{
|
|
|
|
return RB_EMPTY_ROOT(&slots->gfn_tree);
|
|
|
|
}
|
|
|
|
|
|
|
|
#define kvm_for_each_memslot(memslot, bkt, slots) \
|
|
|
|
hash_for_each(slots->id_hash, bkt, memslot, id_node[slots->node_idx]) \
|
|
|
|
if (WARN_ON_ONCE(!memslot->npages)) { \
|
|
|
|
} else
|
|
|
|
|
2020-02-19 05:07:31 +08:00
|
|
|
static inline
|
|
|
|
struct kvm_memory_slot *id_to_memslot(struct kvm_memslots *slots, int id)
|
2011-11-24 19:04:35 +08:00
|
|
|
{
|
2011-11-24 17:41:54 +08:00
|
|
|
struct kvm_memory_slot *slot;
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
int idx = slots->node_idx;
|
2011-11-24 17:40:57 +08:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
hash_for_each_possible(slots->id_hash, slot, id_node[idx], id) {
|
2021-12-07 03:54:27 +08:00
|
|
|
if (slot->id == id)
|
|
|
|
return slot;
|
|
|
|
}
|
2011-11-24 17:40:57 +08:00
|
|
|
|
2021-12-07 03:54:27 +08:00
|
|
|
return NULL;
|
2011-11-24 19:04:35 +08:00
|
|
|
}
|
|
|
|
|
2021-12-07 03:54:32 +08:00
|
|
|
/* Iterator used for walking memslots that overlap a gfn range. */
|
|
|
|
struct kvm_memslot_iter {
|
|
|
|
struct kvm_memslots *slots;
|
|
|
|
struct rb_node *node;
|
|
|
|
struct kvm_memory_slot *slot;
|
|
|
|
};
|
|
|
|
|
|
|
|
static inline void kvm_memslot_iter_next(struct kvm_memslot_iter *iter)
|
|
|
|
{
|
|
|
|
iter->node = rb_next(iter->node);
|
|
|
|
if (!iter->node)
|
|
|
|
return;
|
|
|
|
|
|
|
|
iter->slot = container_of(iter->node, struct kvm_memory_slot, gfn_node[iter->slots->node_idx]);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void kvm_memslot_iter_start(struct kvm_memslot_iter *iter,
|
|
|
|
struct kvm_memslots *slots,
|
|
|
|
gfn_t start)
|
|
|
|
{
|
|
|
|
int idx = slots->node_idx;
|
|
|
|
struct rb_node *tmp;
|
|
|
|
struct kvm_memory_slot *slot;
|
|
|
|
|
|
|
|
iter->slots = slots;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find the so called "upper bound" of a key - the first node that has
|
|
|
|
* its key strictly greater than the searched one (the start gfn in our case).
|
|
|
|
*/
|
|
|
|
iter->node = NULL;
|
|
|
|
for (tmp = slots->gfn_tree.rb_node; tmp; ) {
|
|
|
|
slot = container_of(tmp, struct kvm_memory_slot, gfn_node[idx]);
|
|
|
|
if (start < slot->base_gfn) {
|
|
|
|
iter->node = tmp;
|
|
|
|
tmp = tmp->rb_left;
|
|
|
|
} else {
|
|
|
|
tmp = tmp->rb_right;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find the slot with the lowest gfn that can possibly intersect with
|
|
|
|
* the range, so we'll ideally have slot start <= range start
|
|
|
|
*/
|
|
|
|
if (iter->node) {
|
|
|
|
/*
|
|
|
|
* A NULL previous node means that the very first slot
|
|
|
|
* already has a higher start gfn.
|
|
|
|
* In this case slot start > range start.
|
|
|
|
*/
|
|
|
|
tmp = rb_prev(iter->node);
|
|
|
|
if (tmp)
|
|
|
|
iter->node = tmp;
|
|
|
|
} else {
|
|
|
|
/* a NULL node below means no slots */
|
|
|
|
iter->node = rb_last(&slots->gfn_tree);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (iter->node) {
|
|
|
|
iter->slot = container_of(iter->node, struct kvm_memory_slot, gfn_node[idx]);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* It is possible in the slot start < range start case that the
|
|
|
|
* found slot ends before or at range start (slot end <= range start)
|
|
|
|
* and so it does not overlap the requested range.
|
|
|
|
*
|
|
|
|
* In such non-overlapping case the next slot (if it exists) will
|
|
|
|
* already have slot start > range start, otherwise the logic above
|
|
|
|
* would have found it instead of the current slot.
|
|
|
|
*/
|
|
|
|
if (iter->slot->base_gfn + iter->slot->npages <= start)
|
|
|
|
kvm_memslot_iter_next(iter);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool kvm_memslot_iter_is_valid(struct kvm_memslot_iter *iter, gfn_t end)
|
|
|
|
{
|
|
|
|
if (!iter->node)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If this slot starts beyond or at the end of the range so does
|
|
|
|
* every next one
|
|
|
|
*/
|
|
|
|
return iter->slot->base_gfn < end;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Iterate over each memslot at least partially intersecting [start, end) range */
|
|
|
|
#define kvm_for_each_memslot_in_gfn_range(iter, slots, start, end) \
|
|
|
|
for (kvm_memslot_iter_start(iter, slots, start); \
|
|
|
|
kvm_memslot_iter_is_valid(iter, end); \
|
|
|
|
kvm_memslot_iter_next(iter))
|
|
|
|
|
2013-02-27 18:43:44 +08:00
|
|
|
/*
|
|
|
|
* KVM_SET_USER_MEMORY_REGION ioctl allows the following operations:
|
|
|
|
* - create a new memory slot
|
|
|
|
* - delete an existing memory slot
|
|
|
|
* - modify an existing memory slot
|
|
|
|
* -- move it in the guest physical memory space
|
|
|
|
* -- just change its flags
|
|
|
|
*
|
|
|
|
* Since flags can be changed by some of these operations, the following
|
|
|
|
* differentiation is the best we can do for __kvm_set_memory_region():
|
|
|
|
*/
|
|
|
|
enum kvm_mr_change {
|
|
|
|
KVM_MR_CREATE,
|
|
|
|
KVM_MR_DELETE,
|
|
|
|
KVM_MR_MOVE,
|
|
|
|
KVM_MR_FLAGS_ONLY,
|
|
|
|
};
|
|
|
|
|
2007-10-25 05:52:57 +08:00
|
|
|
int kvm_set_memory_region(struct kvm *kvm,
|
2015-05-18 19:59:39 +08:00
|
|
|
const struct kvm_userspace_memory_region *mem);
|
2007-10-29 09:40:42 +08:00
|
|
|
int __kvm_set_memory_region(struct kvm *kvm,
|
2015-05-18 19:59:39 +08:00
|
|
|
const struct kvm_userspace_memory_region *mem);
|
2020-02-19 05:07:27 +08:00
|
|
|
void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
|
2019-02-06 04:54:17 +08:00
|
|
|
void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
|
2009-12-24 00:35:18 +08:00
|
|
|
int kvm_arch_prepare_memory_region(struct kvm *kvm,
|
2021-12-07 03:54:11 +08:00
|
|
|
const struct kvm_memory_slot *old,
|
|
|
|
struct kvm_memory_slot *new,
|
2013-02-27 18:44:34 +08:00
|
|
|
enum kvm_mr_change change);
|
2009-12-24 00:35:18 +08:00
|
|
|
void kvm_arch_commit_memory_region(struct kvm *kvm,
|
2020-02-19 05:07:24 +08:00
|
|
|
struct kvm_memory_slot *old,
|
2015-05-18 19:20:23 +08:00
|
|
|
const struct kvm_memory_slot *new,
|
2013-02-27 18:45:25 +08:00
|
|
|
enum kvm_mr_change change);
|
2012-08-25 02:54:57 +08:00
|
|
|
/* flush all memory translations */
|
|
|
|
void kvm_arch_flush_shadow_all(struct kvm *kvm);
|
|
|
|
/* flush memory translations pointing to 'slot' */
|
|
|
|
void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
|
|
|
|
struct kvm_memory_slot *slot);
|
2009-12-24 00:35:23 +08:00
|
|
|
|
2015-05-19 22:01:50 +08:00
|
|
|
int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn,
|
|
|
|
struct page **pages, int nr_pages);
|
2010-08-22 19:11:43 +08:00
|
|
|
|
2007-03-30 19:02:32 +08:00
|
|
|
struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn);
|
2008-02-23 22:44:30 +08:00
|
|
|
unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn);
|
2013-09-09 19:52:33 +08:00
|
|
|
unsigned long gfn_to_hva_prot(struct kvm *kvm, gfn_t gfn, bool *writable);
|
2012-08-21 11:02:51 +08:00
|
|
|
unsigned long gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
|
2014-08-19 18:15:00 +08:00
|
|
|
unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn,
|
|
|
|
bool *writable);
|
2007-11-20 17:49:33 +08:00
|
|
|
void kvm_release_page_clean(struct page *page);
|
|
|
|
void kvm_release_page_dirty(struct page *page);
|
2008-04-03 03:46:56 +08:00
|
|
|
|
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 08:56:11 +08:00
|
|
|
kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
|
|
|
|
kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
|
2010-10-23 00:18:18 +08:00
|
|
|
bool *writable);
|
2021-11-16 07:45:58 +08:00
|
|
|
kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn);
|
|
|
|
kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn);
|
|
|
|
kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
|
2022-10-12 03:58:08 +08:00
|
|
|
bool atomic, bool interruptible, bool *async,
|
|
|
|
bool write_fault, bool *writable, hva_t *hva);
|
2012-08-21 10:59:12 +08:00
|
|
|
|
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 08:56:11 +08:00
|
|
|
void kvm_release_pfn_clean(kvm_pfn_t pfn);
|
2017-09-01 23:11:43 +08:00
|
|
|
void kvm_release_pfn_dirty(kvm_pfn_t pfn);
|
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 08:56:11 +08:00
|
|
|
void kvm_set_pfn_dirty(kvm_pfn_t pfn);
|
|
|
|
void kvm_set_pfn_accessed(kvm_pfn_t pfn);
|
2008-04-03 03:46:56 +08:00
|
|
|
|
2021-11-16 00:50:27 +08:00
|
|
|
void kvm_release_pfn(kvm_pfn_t pfn, bool dirty);
|
2007-10-02 04:14:18 +08:00
|
|
|
int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
|
|
|
|
int len);
|
|
|
|
int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len);
|
2017-05-02 22:20:18 +08:00
|
|
|
int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
|
|
|
|
void *data, unsigned long len);
|
2020-05-25 22:41:19 +08:00
|
|
|
int kvm_read_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
|
|
|
|
void *data, unsigned int offset,
|
|
|
|
unsigned long len);
|
2007-10-02 04:14:18 +08:00
|
|
|
int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
|
|
|
|
int offset, int len);
|
|
|
|
int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
|
|
|
|
unsigned long len);
|
2017-05-02 22:20:18 +08:00
|
|
|
int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
|
|
|
|
void *data, unsigned long len);
|
|
|
|
int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
|
2018-12-15 06:34:43 +08:00
|
|
|
void *data, unsigned int offset,
|
|
|
|
unsigned long len);
|
2017-05-02 22:20:18 +08:00
|
|
|
int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
|
|
|
|
gpa_t gpa, unsigned long len);
|
2019-10-21 23:28:17 +08:00
|
|
|
|
2020-08-05 01:06:02 +08:00
|
|
|
#define __kvm_get_guest(kvm, gfn, offset, v) \
|
|
|
|
({ \
|
|
|
|
unsigned long __addr = gfn_to_hva(kvm, gfn); \
|
|
|
|
typeof(v) __user *__uaddr = (typeof(__uaddr))(__addr + offset); \
|
|
|
|
int __ret = -EFAULT; \
|
|
|
|
\
|
|
|
|
if (!kvm_is_error_hva(__addr)) \
|
|
|
|
__ret = get_user(v, __uaddr); \
|
|
|
|
__ret; \
|
|
|
|
})
|
|
|
|
|
|
|
|
#define kvm_get_guest(kvm, gpa, v) \
|
|
|
|
({ \
|
|
|
|
gpa_t __gpa = gpa; \
|
|
|
|
struct kvm *__kvm = kvm; \
|
|
|
|
\
|
|
|
|
__kvm_get_guest(__kvm, __gpa >> PAGE_SHIFT, \
|
|
|
|
offset_in_page(__gpa), v); \
|
|
|
|
})
|
|
|
|
|
2020-08-05 01:06:01 +08:00
|
|
|
#define __kvm_put_guest(kvm, gfn, offset, v) \
|
2019-10-21 23:28:17 +08:00
|
|
|
({ \
|
|
|
|
unsigned long __addr = gfn_to_hva(kvm, gfn); \
|
2020-08-05 01:06:01 +08:00
|
|
|
typeof(v) __user *__uaddr = (typeof(__uaddr))(__addr + offset); \
|
2019-10-21 23:28:17 +08:00
|
|
|
int __ret = -EFAULT; \
|
|
|
|
\
|
|
|
|
if (!kvm_is_error_hva(__addr)) \
|
2020-08-05 01:06:01 +08:00
|
|
|
__ret = put_user(v, __uaddr); \
|
2019-10-21 23:28:17 +08:00
|
|
|
if (!__ret) \
|
|
|
|
mark_page_dirty(kvm, gfn); \
|
|
|
|
__ret; \
|
|
|
|
})
|
|
|
|
|
2020-08-05 01:06:01 +08:00
|
|
|
#define kvm_put_guest(kvm, gpa, v) \
|
2019-10-21 23:28:17 +08:00
|
|
|
({ \
|
|
|
|
gpa_t __gpa = gpa; \
|
|
|
|
struct kvm *__kvm = kvm; \
|
2020-08-05 01:06:01 +08:00
|
|
|
\
|
2019-10-21 23:28:17 +08:00
|
|
|
__kvm_put_guest(__kvm, __gpa >> PAGE_SHIFT, \
|
2020-08-05 01:06:01 +08:00
|
|
|
offset_in_page(__gpa), v); \
|
2019-10-21 23:28:17 +08:00
|
|
|
})
|
|
|
|
|
2007-10-02 04:14:18 +08:00
|
|
|
int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 18:21:36 +08:00
|
|
|
struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
|
2015-11-14 11:21:06 +08:00
|
|
|
bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
|
2020-07-08 22:00:23 +08:00
|
|
|
bool kvm_vcpu_is_visible_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
|
2020-01-09 04:24:37 +08:00
|
|
|
unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn);
|
2021-11-16 07:45:58 +08:00
|
|
|
void mark_page_dirty_in_slot(struct kvm *kvm, const struct kvm_memory_slot *memslot, gfn_t gfn);
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 18:21:36 +08:00
|
|
|
void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
|
|
|
|
|
2015-05-17 19:58:53 +08:00
|
|
|
struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
|
|
|
|
struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn);
|
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 08:56:11 +08:00
|
|
|
kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
|
|
|
|
kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
|
2019-02-01 04:24:34 +08:00
|
|
|
int kvm_vcpu_map(struct kvm_vcpu *vcpu, gpa_t gpa, struct kvm_host_map *map);
|
|
|
|
void kvm_vcpu_unmap(struct kvm_vcpu *vcpu, struct kvm_host_map *map, bool dirty);
|
2015-05-17 19:58:53 +08:00
|
|
|
unsigned long kvm_vcpu_gfn_to_hva(struct kvm_vcpu *vcpu, gfn_t gfn);
|
|
|
|
unsigned long kvm_vcpu_gfn_to_hva_prot(struct kvm_vcpu *vcpu, gfn_t gfn, bool *writable);
|
|
|
|
int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data, int offset,
|
|
|
|
int len);
|
|
|
|
int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa, void *data,
|
|
|
|
unsigned long len);
|
|
|
|
int kvm_vcpu_read_guest(struct kvm_vcpu *vcpu, gpa_t gpa, void *data,
|
|
|
|
unsigned long len);
|
|
|
|
int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, const void *data,
|
|
|
|
int offset, int len);
|
|
|
|
int kvm_vcpu_write_guest(struct kvm_vcpu *vcpu, gpa_t gpa, const void *data,
|
|
|
|
unsigned long len);
|
|
|
|
void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn);
|
|
|
|
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-11 00:36:21 +08:00
|
|
|
/**
|
2022-10-14 05:12:19 +08:00
|
|
|
* kvm_gpc_init - initialize gfn_to_pfn_cache.
|
|
|
|
*
|
|
|
|
* @gpc: struct gfn_to_pfn_cache object.
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-11 00:36:21 +08:00
|
|
|
* @kvm: pointer to kvm instance.
|
|
|
|
* @vcpu: vCPU to be used for marking pages dirty and to be woken on
|
|
|
|
* invalidation.
|
2022-03-03 23:41:11 +08:00
|
|
|
* @usage: indicates if the resulting host physical PFN is used while
|
|
|
|
* the @vcpu is IN_GUEST_MODE (in which case invalidation of
|
|
|
|
* the cache from MMU notifiers---but not for KVM memslot
|
|
|
|
* changes!---will also force @vcpu to exit the guest and
|
|
|
|
* refresh the cache); and/or if the PFN used directly
|
|
|
|
* by KVM (and thus needs a kernel virtual mapping).
|
2022-10-14 05:12:24 +08:00
|
|
|
*
|
|
|
|
* This sets up a gfn_to_pfn_cache by initializing locks and assigning the
|
|
|
|
* immutable attributes. Note, the cache must be zero-allocated (or zeroed by
|
|
|
|
* the caller before init).
|
|
|
|
*/
|
|
|
|
void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm,
|
|
|
|
struct kvm_vcpu *vcpu, enum pfn_cache_usage usage);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* kvm_gpc_activate - prepare a cached kernel mapping and HPA for a given guest
|
|
|
|
* physical address.
|
|
|
|
*
|
|
|
|
* @gpc: struct gfn_to_pfn_cache object.
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-11 00:36:21 +08:00
|
|
|
* @gpa: guest physical address to map.
|
|
|
|
* @len: sanity check; the range being access must fit a single page.
|
|
|
|
*
|
|
|
|
* @return: 0 for success.
|
|
|
|
* -EINVAL for a mapping which would cross a page boundary.
|
2022-10-14 05:12:24 +08:00
|
|
|
* -EFAULT for an untranslatable guest physical address.
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-11 00:36:21 +08:00
|
|
|
*
|
2022-10-14 05:12:24 +08:00
|
|
|
* This primes a gfn_to_pfn_cache and links it into the @gpc->kvm's list for
|
2022-10-14 05:12:22 +08:00
|
|
|
* invalidations to be processed. Callers are required to use kvm_gpc_check()
|
|
|
|
* to ensure that the cache is valid before accessing the target page.
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-11 00:36:21 +08:00
|
|
|
*/
|
2022-10-14 05:12:24 +08:00
|
|
|
int kvm_gpc_activate(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned long len);
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-11 00:36:21 +08:00
|
|
|
|
|
|
|
/**
|
2022-10-14 05:12:22 +08:00
|
|
|
* kvm_gpc_check - check validity of a gfn_to_pfn_cache.
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-11 00:36:21 +08:00
|
|
|
*
|
|
|
|
* @gpc: struct gfn_to_pfn_cache object.
|
|
|
|
* @len: sanity check; the range being access must fit a single page.
|
|
|
|
*
|
|
|
|
* @return: %true if the cache is still valid and the address matches.
|
|
|
|
* %false if the cache is not valid.
|
|
|
|
*
|
|
|
|
* Callers outside IN_GUEST_MODE context should hold a read lock on @gpc->lock
|
|
|
|
* while calling this function, and then continue to hold the lock until the
|
|
|
|
* access is complete.
|
|
|
|
*
|
|
|
|
* Callers in IN_GUEST_MODE may do so without locking, although they should
|
|
|
|
* still hold a read lock on kvm->scru for the memslot checks.
|
|
|
|
*/
|
2022-10-14 05:12:31 +08:00
|
|
|
bool kvm_gpc_check(struct gfn_to_pfn_cache *gpc, unsigned long len);
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-11 00:36:21 +08:00
|
|
|
|
|
|
|
/**
|
2022-10-14 05:12:22 +08:00
|
|
|
* kvm_gpc_refresh - update a previously initialized cache.
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-11 00:36:21 +08:00
|
|
|
*
|
|
|
|
* @gpc: struct gfn_to_pfn_cache object.
|
|
|
|
* @len: sanity check; the range being access must fit a single page.
|
|
|
|
*
|
|
|
|
* @return: 0 for success.
|
|
|
|
* -EINVAL for a mapping which would cross a page boundary.
|
2022-10-14 05:12:28 +08:00
|
|
|
* -EFAULT for an untranslatable guest physical address.
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-11 00:36:21 +08:00
|
|
|
*
|
|
|
|
* This will attempt to refresh a gfn_to_pfn_cache. Note that a successful
|
2022-10-14 05:12:28 +08:00
|
|
|
* return from this function does not mean the page can be immediately
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-11 00:36:21 +08:00
|
|
|
* accessed because it may have raced with an invalidation. Callers must
|
|
|
|
* still lock and check the cache status, as this function does not return
|
|
|
|
* with the lock still held to permit access.
|
|
|
|
*/
|
2022-10-14 05:12:31 +08:00
|
|
|
int kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, unsigned long len);
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-11 00:36:21 +08:00
|
|
|
|
|
|
|
/**
|
2022-10-14 05:12:19 +08:00
|
|
|
* kvm_gpc_deactivate - deactivate and unlink a gfn_to_pfn_cache.
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-11 00:36:21 +08:00
|
|
|
*
|
|
|
|
* @gpc: struct gfn_to_pfn_cache object.
|
|
|
|
*
|
2022-10-14 05:12:24 +08:00
|
|
|
* This removes a cache from the VM's list to be processed on MMU notifier
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-11 00:36:21 +08:00
|
|
|
* invocation.
|
|
|
|
*/
|
2022-10-14 05:12:24 +08:00
|
|
|
void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc);
|
KVM: Reinstate gfn_to_pfn_cache with invalidation support
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-11 00:36:21 +08:00
|
|
|
|
2017-11-25 05:39:01 +08:00
|
|
|
void kvm_sigset_activate(struct kvm_vcpu *vcpu);
|
|
|
|
void kvm_sigset_deactivate(struct kvm_vcpu *vcpu);
|
|
|
|
|
2021-10-09 10:12:06 +08:00
|
|
|
void kvm_vcpu_halt(struct kvm_vcpu *vcpu);
|
2021-10-09 10:12:07 +08:00
|
|
|
bool kvm_vcpu_block(struct kvm_vcpu *vcpu);
|
2015-08-27 22:41:15 +08:00
|
|
|
void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu);
|
|
|
|
void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu);
|
2017-04-27 04:32:26 +08:00
|
|
|
bool kvm_vcpu_wake_up(struct kvm_vcpu *vcpu);
|
2012-03-09 05:44:24 +08:00
|
|
|
void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
|
2014-05-23 18:20:42 +08:00
|
|
|
int kvm_vcpu_yield_to(struct kvm_vcpu *target);
|
2017-08-08 12:05:32 +08:00
|
|
|
void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu, bool usermode_vcpu_not_eligible);
|
2010-11-23 11:13:00 +08:00
|
|
|
|
2007-06-08 00:18:30 +08:00
|
|
|
void kvm_flush_remote_tlbs(struct kvm *kvm);
|
2018-05-16 23:21:28 +08:00
|
|
|
|
2020-07-03 10:35:39 +08:00
|
|
|
#ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
|
|
|
|
int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
|
2022-06-23 03:27:08 +08:00
|
|
|
int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min);
|
2020-07-03 10:35:39 +08:00
|
|
|
int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
|
|
|
|
void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
|
|
|
|
void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
|
|
|
|
#endif
|
|
|
|
|
2022-08-16 20:53:22 +08:00
|
|
|
void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
|
|
|
|
unsigned long end);
|
|
|
|
void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
|
|
|
|
unsigned long end);
|
2021-08-11 04:52:39 +08:00
|
|
|
|
2007-10-10 23:16:19 +08:00
|
|
|
long kvm_arch_dev_ioctl(struct file *filp,
|
|
|
|
unsigned int ioctl, unsigned long arg);
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-12 01:16:52 +08:00
|
|
|
long kvm_arch_vcpu_ioctl(struct file *filp,
|
|
|
|
unsigned int ioctl, unsigned long arg);
|
2018-04-19 03:19:58 +08:00
|
|
|
vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf);
|
2007-11-15 23:07:47 +08:00
|
|
|
|
2014-07-15 00:27:35 +08:00
|
|
|
int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext);
|
2007-11-15 23:07:47 +08:00
|
|
|
|
2015-01-28 10:54:23 +08:00
|
|
|
void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
|
2015-01-16 07:58:53 +08:00
|
|
|
struct kvm_memory_slot *slot,
|
|
|
|
gfn_t gfn_offset,
|
|
|
|
unsigned long mask);
|
2020-02-19 05:07:29 +08:00
|
|
|
void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot);
|
|
|
|
|
|
|
|
#ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
|
|
|
|
void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
|
2021-04-02 23:53:09 +08:00
|
|
|
const struct kvm_memory_slot *memslot);
|
2020-02-19 05:07:29 +08:00
|
|
|
#else /* !CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
|
|
|
|
int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log);
|
|
|
|
int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log,
|
KVM: Ensure validity of memslot with respect to kvm_get_dirty_log()
Rework kvm_get_dirty_log() so that it "returns" the associated memslot
on success. A future patch will rework memslot handling such that
id_to_memslot() can return NULL, returning the memslot makes it more
obvious that the validity of the memslot has been verified, i.e.
precludes the need to add validity checks in the arch code that are
technically unnecessary.
To maintain ordering in s390, move the call to kvm_arch_sync_dirty_log()
from s390's kvm_vm_ioctl_get_dirty_log() to the new kvm_get_dirty_log().
This is a nop for PPC, the only other arch that doesn't select
KVM_GENERIC_DIRTYLOG_READ_PROTECT, as its sync_dirty_log() is empty.
Ideally, moving the sync_dirty_log() call would be done in a separate
patch, but it can't be done in a follow-on patch because that would
temporarily break s390's ordering. Making the move in a preparatory
patch would be functionally correct, but would create an odd scenario
where the moved sync_dirty_log() would operate on a "different" memslot
due to consuming the result of a different id_to_memslot(). The
memslot couldn't actually be different as slots_lock is held, but the
code is confusing enough as it is, i.e. moving sync_dirty_log() in this
patch is the lesser of all evils.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-19 05:07:30 +08:00
|
|
|
int *is_dirty, struct kvm_memory_slot **memslot);
|
2020-02-19 05:07:29 +08:00
|
|
|
#endif
|
2007-11-18 20:29:43 +08:00
|
|
|
|
2013-04-11 19:21:40 +08:00
|
|
|
int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level,
|
|
|
|
bool line_status);
|
2017-02-16 17:40:56 +08:00
|
|
|
int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
|
|
|
|
struct kvm_enable_cap *cap);
|
2007-10-29 23:08:35 +08:00
|
|
|
long kvm_arch_vm_ioctl(struct file *filp,
|
|
|
|
unsigned int ioctl, unsigned long arg);
|
2022-10-18 02:45:39 +08:00
|
|
|
long kvm_arch_vm_compat_ioctl(struct file *filp, unsigned int ioctl,
|
|
|
|
unsigned long arg);
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-12 01:16:52 +08:00
|
|
|
|
2007-11-01 06:24:25 +08:00
|
|
|
int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu);
|
|
|
|
int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu);
|
|
|
|
|
2007-11-16 13:05:55 +08:00
|
|
|
int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_translation *tr);
|
|
|
|
|
2007-11-02 03:16:10 +08:00
|
|
|
int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
|
|
|
|
int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
|
|
|
|
int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_sregs *sregs);
|
|
|
|
int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_sregs *sregs);
|
2008-04-12 00:24:45 +08:00
|
|
|
int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_mp_state *mp_state);
|
|
|
|
int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_mp_state *mp_state);
|
2008-12-15 20:52:10 +08:00
|
|
|
int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_guest_debug *dbg);
|
2020-04-16 13:10:57 +08:00
|
|
|
int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu);
|
2007-11-02 03:16:10 +08:00
|
|
|
|
2007-11-14 20:40:21 +08:00
|
|
|
int kvm_arch_init(void *opaque);
|
|
|
|
void kvm_arch_exit(void);
|
2007-10-10 23:16:19 +08:00
|
|
|
|
2014-08-22 00:08:05 +08:00
|
|
|
void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu);
|
|
|
|
|
2007-11-14 20:38:21 +08:00
|
|
|
void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
|
|
|
|
void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu);
|
2019-12-19 05:55:09 +08:00
|
|
|
int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id);
|
2019-12-19 05:55:15 +08:00
|
|
|
int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu);
|
2014-12-04 22:47:07 +08:00
|
|
|
void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu);
|
2007-11-20 04:04:43 +08:00
|
|
|
void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu);
|
2007-11-14 20:38:21 +08:00
|
|
|
|
2021-06-06 10:10:44 +08:00
|
|
|
#ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
|
|
|
|
int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state);
|
|
|
|
#endif
|
|
|
|
|
2019-08-03 14:14:25 +08:00
|
|
|
#ifdef __KVM_HAVE_ARCH_VCPU_DEBUGFS
|
2020-06-04 21:16:52 +08:00
|
|
|
void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry);
|
2022-05-24 03:03:27 +08:00
|
|
|
#else
|
|
|
|
static inline void kvm_create_vcpu_debugfs(struct kvm_vcpu *vcpu) {}
|
2019-08-03 14:14:25 +08:00
|
|
|
#endif
|
2016-09-08 02:47:23 +08:00
|
|
|
|
2014-08-28 21:13:03 +08:00
|
|
|
int kvm_arch_hardware_enable(void);
|
|
|
|
void kvm_arch_hardware_disable(void);
|
2020-03-22 04:25:55 +08:00
|
|
|
int kvm_arch_hardware_setup(void *opaque);
|
2007-11-14 20:38:21 +08:00
|
|
|
void kvm_arch_hardware_unsetup(void);
|
2020-03-22 04:25:55 +08:00
|
|
|
int kvm_arch_check_processor_compat(void *opaque);
|
2007-12-14 09:35:10 +08:00
|
|
|
int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
|
2017-08-08 12:05:32 +08:00
|
|
|
bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu);
|
2012-03-09 05:44:24 +08:00
|
|
|
int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
|
2019-08-05 10:03:19 +08:00
|
|
|
bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu);
|
2021-04-16 11:08:10 +08:00
|
|
|
bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
|
2020-02-14 01:22:55 +08:00
|
|
|
int kvm_arch_post_init_vm(struct kvm *kvm);
|
|
|
|
void kvm_arch_pre_destroy_vm(struct kvm *kvm);
|
2021-07-31 06:04:49 +08:00
|
|
|
int kvm_arch_create_vm_debugfs(struct kvm *kvm);
|
2007-11-14 20:38:21 +08:00
|
|
|
|
2010-11-10 00:02:49 +08:00
|
|
|
#ifndef __KVM_HAVE_ARCH_VM_ALLOC
|
2018-05-15 19:37:37 +08:00
|
|
|
/*
|
|
|
|
* All architectures that want to use vzalloc currently also
|
|
|
|
* need their own kvm_arch_alloc_vm implementation.
|
|
|
|
*/
|
2010-11-10 00:02:49 +08:00
|
|
|
static inline struct kvm *kvm_arch_alloc_vm(void)
|
|
|
|
{
|
|
|
|
return kzalloc(sizeof(struct kvm), GFP_KERNEL);
|
|
|
|
}
|
2021-09-03 21:08:05 +08:00
|
|
|
#endif
|
|
|
|
|
|
|
|
static inline void __kvm_arch_free_vm(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
kvfree(kvm);
|
|
|
|
}
|
2010-11-10 00:02:49 +08:00
|
|
|
|
2021-09-03 21:08:05 +08:00
|
|
|
#ifndef __KVM_HAVE_ARCH_VM_FREE
|
2010-11-10 00:02:49 +08:00
|
|
|
static inline void kvm_arch_free_vm(struct kvm *kvm)
|
|
|
|
{
|
2021-09-03 21:08:05 +08:00
|
|
|
__kvm_arch_free_vm(kvm);
|
2010-11-10 00:02:49 +08:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2018-07-19 16:40:17 +08:00
|
|
|
#ifndef __KVM_HAVE_ARCH_FLUSH_REMOTE_TLB
|
|
|
|
static inline int kvm_arch_flush_remote_tlb(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return -ENOTSUPP;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2013-10-31 01:02:30 +08:00
|
|
|
#ifdef __KVM_HAVE_ARCH_NONCOHERENT_DMA
|
|
|
|
void kvm_arch_register_noncoherent_dma(struct kvm *kvm);
|
|
|
|
void kvm_arch_unregister_noncoherent_dma(struct kvm *kvm);
|
|
|
|
bool kvm_arch_has_noncoherent_dma(struct kvm *kvm);
|
|
|
|
#else
|
|
|
|
static inline void kvm_arch_register_noncoherent_dma(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void kvm_arch_unregister_noncoherent_dma(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool kvm_arch_has_noncoherent_dma(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
#endif
|
2015-07-07 21:41:58 +08:00
|
|
|
#ifdef __KVM_HAVE_ARCH_ASSIGNED_DEVICE
|
|
|
|
void kvm_arch_start_assignment(struct kvm *kvm);
|
|
|
|
void kvm_arch_end_assignment(struct kvm *kvm);
|
|
|
|
bool kvm_arch_has_assigned_device(struct kvm *kvm);
|
|
|
|
#else
|
|
|
|
static inline void kvm_arch_start_assignment(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void kvm_arch_end_assignment(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2022-06-15 05:15:32 +08:00
|
|
|
static __always_inline bool kvm_arch_has_assigned_device(struct kvm *kvm)
|
2015-07-07 21:41:58 +08:00
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
#endif
|
2013-10-31 01:02:30 +08:00
|
|
|
|
2020-04-24 13:48:37 +08:00
|
|
|
static inline struct rcuwait *kvm_arch_vcpu_get_wait(struct kvm_vcpu *vcpu)
|
2012-03-09 05:44:24 +08:00
|
|
|
{
|
2012-03-14 05:35:01 +08:00
|
|
|
#ifdef __KVM_HAVE_ARCH_WQP
|
2020-04-24 13:48:37 +08:00
|
|
|
return vcpu->arch.waitp;
|
2012-03-14 05:35:01 +08:00
|
|
|
#else
|
2020-04-24 13:48:37 +08:00
|
|
|
return &vcpu->wait;
|
2012-03-09 05:44:24 +08:00
|
|
|
#endif
|
2012-03-14 05:35:01 +08:00
|
|
|
}
|
2012-03-09 05:44:24 +08:00
|
|
|
|
2021-10-09 10:12:12 +08:00
|
|
|
/*
|
|
|
|
* Wake a vCPU if necessary, but don't do any stats/metadata updates. Returns
|
|
|
|
* true if the vCPU was blocking and was awakened, false otherwise.
|
|
|
|
*/
|
|
|
|
static inline bool __kvm_vcpu_wake_up(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return !!rcuwait_wake_up(kvm_arch_vcpu_get_wait(vcpu));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool kvm_vcpu_is_blocking(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return rcuwait_active(kvm_arch_vcpu_get_wait(vcpu));
|
|
|
|
}
|
|
|
|
|
2015-03-04 18:14:33 +08:00
|
|
|
#ifdef __KVM_HAVE_ARCH_INTC_INITIALIZED
|
|
|
|
/*
|
|
|
|
* returns true if the virtual interrupt controller is initialized and
|
|
|
|
* ready to accept virtual IRQ. On some architectures the virtual interrupt
|
|
|
|
* controller is dynamically instantiated and this is not always true.
|
|
|
|
*/
|
|
|
|
bool kvm_arch_intc_initialized(struct kvm *kvm);
|
|
|
|
#else
|
|
|
|
static inline bool kvm_arch_intc_initialized(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2021-11-11 10:07:33 +08:00
|
|
|
#ifdef CONFIG_GUEST_PERF_EVENTS
|
|
|
|
unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu);
|
|
|
|
|
|
|
|
void kvm_register_perf_callbacks(unsigned int (*pt_intr_handler)(void));
|
|
|
|
void kvm_unregister_perf_callbacks(void);
|
|
|
|
#else
|
|
|
|
static inline void kvm_register_perf_callbacks(void *ign) {}
|
|
|
|
static inline void kvm_unregister_perf_callbacks(void) {}
|
|
|
|
#endif /* CONFIG_GUEST_PERF_EVENTS */
|
|
|
|
|
2012-01-04 17:25:20 +08:00
|
|
|
int kvm_arch_init_vm(struct kvm *kvm, unsigned long type);
|
2007-11-18 18:43:45 +08:00
|
|
|
void kvm_arch_destroy_vm(struct kvm *kvm);
|
2009-01-06 10:03:02 +08:00
|
|
|
void kvm_arch_sync_events(struct kvm *kvm);
|
2007-11-14 20:38:21 +08:00
|
|
|
|
2008-04-12 01:53:26 +08:00
|
|
|
int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
|
2007-12-11 20:36:00 +08:00
|
|
|
|
2022-04-29 09:04:15 +08:00
|
|
|
struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn);
|
2022-04-29 09:04:14 +08:00
|
|
|
bool kvm_is_zone_device_page(struct page *page);
|
2008-09-27 10:55:40 +08:00
|
|
|
|
2008-09-14 08:48:28 +08:00
|
|
|
struct kvm_irq_ack_notifier {
|
|
|
|
struct hlist_node link;
|
|
|
|
unsigned gsi;
|
|
|
|
void (*irq_acked)(struct kvm_irq_ack_notifier *kian);
|
|
|
|
};
|
|
|
|
|
2014-06-30 18:51:11 +08:00
|
|
|
int kvm_irq_map_gsi(struct kvm *kvm,
|
|
|
|
struct kvm_kernel_irq_routing_entry *entries, int gsi);
|
|
|
|
int kvm_irq_map_chip_pin(struct kvm *kvm, unsigned irqchip, unsigned pin);
|
2014-06-30 18:51:10 +08:00
|
|
|
|
2013-04-11 19:21:40 +08:00
|
|
|
int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level,
|
|
|
|
bool line_status);
|
2010-11-19 01:09:08 +08:00
|
|
|
int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
|
2013-04-11 19:21:40 +08:00
|
|
|
int irq_source_id, int level, bool line_status);
|
2015-10-29 02:16:47 +08:00
|
|
|
int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e,
|
|
|
|
struct kvm *kvm, int irq_source_id,
|
|
|
|
int level, bool line_status);
|
2013-01-25 10:18:51 +08:00
|
|
|
bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin);
|
2015-10-16 15:07:46 +08:00
|
|
|
void kvm_notify_acked_gsi(struct kvm *kvm, int gsi);
|
2009-01-28 01:12:38 +08:00
|
|
|
void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
|
2008-10-06 13:48:45 +08:00
|
|
|
void kvm_register_irq_ack_notifier(struct kvm *kvm,
|
|
|
|
struct kvm_irq_ack_notifier *kian);
|
2009-06-05 02:08:24 +08:00
|
|
|
void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
|
|
|
|
struct kvm_irq_ack_notifier *kian);
|
2008-10-15 20:15:06 +08:00
|
|
|
int kvm_request_irq_source_id(struct kvm *kvm);
|
|
|
|
void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
|
2019-07-10 08:24:03 +08:00
|
|
|
bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
|
2008-09-14 08:48:28 +08:00
|
|
|
|
2012-01-13 04:09:51 +08:00
|
|
|
/*
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
* Returns a pointer to the memslot if it contains gfn.
|
2021-08-05 06:28:39 +08:00
|
|
|
* Otherwise returns NULL.
|
|
|
|
*/
|
|
|
|
static inline struct kvm_memory_slot *
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
try_get_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
|
2021-08-05 06:28:39 +08:00
|
|
|
{
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
if (!slot)
|
2021-08-05 06:28:39 +08:00
|
|
|
return NULL;
|
|
|
|
|
|
|
|
if (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages)
|
|
|
|
return slot;
|
|
|
|
else
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
* Returns a pointer to the memslot that contains gfn. Otherwise returns NULL.
|
2020-02-19 05:07:31 +08:00
|
|
|
*
|
2021-12-07 03:54:25 +08:00
|
|
|
* With "approx" set returns the memslot also when the address falls
|
|
|
|
* in a hole. In that case one of the memslots bordering the hole is
|
|
|
|
* returned.
|
2012-01-13 04:09:51 +08:00
|
|
|
*/
|
|
|
|
static inline struct kvm_memory_slot *
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
search_memslots(struct kvm_memslots *slots, gfn_t gfn, bool approx)
|
2012-01-13 04:09:51 +08:00
|
|
|
{
|
2021-08-05 06:28:39 +08:00
|
|
|
struct kvm_memory_slot *slot;
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
struct rb_node *node;
|
|
|
|
int idx = slots->node_idx;
|
|
|
|
|
|
|
|
slot = NULL;
|
|
|
|
for (node = slots->gfn_tree.rb_node; node; ) {
|
|
|
|
slot = container_of(node, struct kvm_memory_slot, gfn_node[idx]);
|
|
|
|
if (gfn >= slot->base_gfn) {
|
|
|
|
if (gfn < slot->base_gfn + slot->npages)
|
|
|
|
return slot;
|
|
|
|
node = node->rb_right;
|
|
|
|
} else
|
|
|
|
node = node->rb_left;
|
2021-12-07 03:54:25 +08:00
|
|
|
}
|
2012-01-13 04:09:51 +08:00
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
return approx ? slot : NULL;
|
2012-01-13 04:09:51 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct kvm_memory_slot *
|
2021-12-07 03:54:25 +08:00
|
|
|
____gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn, bool approx)
|
2012-01-13 04:09:51 +08:00
|
|
|
{
|
2021-08-05 06:28:39 +08:00
|
|
|
struct kvm_memory_slot *slot;
|
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
slot = (struct kvm_memory_slot *)atomic_long_read(&slots->last_used_slot);
|
|
|
|
slot = try_get_memslot(slot, gfn);
|
2021-08-05 06:28:39 +08:00
|
|
|
if (slot)
|
|
|
|
return slot;
|
|
|
|
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
slot = search_memslots(slots, gfn, approx);
|
2021-08-05 06:28:39 +08:00
|
|
|
if (slot) {
|
KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.
Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.
Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.
Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.
In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.
The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.
These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy. After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.
This commit implements the aforementioned idea.
For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.
The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-07 03:54:30 +08:00
|
|
|
atomic_long_set(&slots->last_used_slot, (unsigned long)slot);
|
2021-08-05 06:28:39 +08:00
|
|
|
return slot;
|
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
2012-01-13 04:09:51 +08:00
|
|
|
}
|
|
|
|
|
2021-12-07 03:54:25 +08:00
|
|
|
/*
|
|
|
|
* __gfn_to_memslot() and its descendants are here to allow arch code to inline
|
|
|
|
* the lookups in hot paths. gfn_to_memslot() itself isn't here as an inline
|
|
|
|
* because that would bloat other code too much.
|
|
|
|
*/
|
|
|
|
static inline struct kvm_memory_slot *
|
|
|
|
__gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
|
|
|
|
{
|
|
|
|
return ____gfn_to_memslot(slots, gfn, false);
|
|
|
|
}
|
|
|
|
|
2012-08-24 16:50:28 +08:00
|
|
|
static inline unsigned long
|
2021-04-02 07:37:24 +08:00
|
|
|
__gfn_to_hva_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
|
2012-08-24 16:50:28 +08:00
|
|
|
{
|
kvm: avoid speculation-based attacks from out-of-range memslot accesses
KVM's mechanism for accessing guest memory translates a guest physical
address (gpa) to a host virtual address using the right-shifted gpa
(also known as gfn) and a struct kvm_memory_slot. The translation is
performed in __gfn_to_hva_memslot using the following formula:
hva = slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE
It is expected that gfn falls within the boundaries of the guest's
physical memory. However, a guest can access invalid physical addresses
in such a way that the gfn is invalid.
__gfn_to_hva_memslot is called from kvm_vcpu_gfn_to_hva_prot, which first
retrieves a memslot through __gfn_to_memslot. While __gfn_to_memslot
does check that the gfn falls within the boundaries of the guest's
physical memory or not, a CPU can speculate the result of the check and
continue execution speculatively using an illegal gfn. The speculation
can result in calculating an out-of-bounds hva. If the resulting host
virtual address is used to load another guest physical address, this
is effectively a Spectre gadget consisting of two consecutive reads,
the second of which is data dependent on the first.
Right now it's not clear if there are any cases in which this is
exploitable. One interesting case was reported by the original author
of this patch, and involves visiting guest page tables on x86. Right
now these are not vulnerable because the hva read goes through get_user(),
which contains an LFENCE speculation barrier. However, there are
patches in progress for x86 uaccess.h to mask kernel addresses instead of
using LFENCE; once these land, a guest could use speculation to read
from the VMM's ring 3 address space. Other architectures such as ARM
already use the address masking method, and would be susceptible to
this same kind of data-dependent access gadgets. Therefore, this patch
proactively protects from these attacks by masking out-of-bounds gfns
in __gfn_to_hva_memslot, which blocks speculation of invalid hvas.
Sean Christopherson noted that this patch does not cover
kvm_read_guest_offset_cached. This however is limited to a few bytes
past the end of the cache, and therefore it is unlikely to be useful in
the context of building a chain of data dependent accesses.
Reported-by: Artemiy Margaritov <artemiy.margaritov@gmail.com>
Co-developed-by: Artemiy Margaritov <artemiy.margaritov@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-09 03:31:42 +08:00
|
|
|
/*
|
|
|
|
* The index was checked originally in search_memslots. To avoid
|
|
|
|
* that a malicious guest builds a Spectre gadget out of e.g. page
|
|
|
|
* table walks, do not let the processor speculate loads outside
|
|
|
|
* the guest's registered memslots.
|
|
|
|
*/
|
2021-06-09 13:49:13 +08:00
|
|
|
unsigned long offset = gfn - slot->base_gfn;
|
|
|
|
offset = array_index_nospec(offset, slot->npages);
|
kvm: avoid speculation-based attacks from out-of-range memslot accesses
KVM's mechanism for accessing guest memory translates a guest physical
address (gpa) to a host virtual address using the right-shifted gpa
(also known as gfn) and a struct kvm_memory_slot. The translation is
performed in __gfn_to_hva_memslot using the following formula:
hva = slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE
It is expected that gfn falls within the boundaries of the guest's
physical memory. However, a guest can access invalid physical addresses
in such a way that the gfn is invalid.
__gfn_to_hva_memslot is called from kvm_vcpu_gfn_to_hva_prot, which first
retrieves a memslot through __gfn_to_memslot. While __gfn_to_memslot
does check that the gfn falls within the boundaries of the guest's
physical memory or not, a CPU can speculate the result of the check and
continue execution speculatively using an illegal gfn. The speculation
can result in calculating an out-of-bounds hva. If the resulting host
virtual address is used to load another guest physical address, this
is effectively a Spectre gadget consisting of two consecutive reads,
the second of which is data dependent on the first.
Right now it's not clear if there are any cases in which this is
exploitable. One interesting case was reported by the original author
of this patch, and involves visiting guest page tables on x86. Right
now these are not vulnerable because the hva read goes through get_user(),
which contains an LFENCE speculation barrier. However, there are
patches in progress for x86 uaccess.h to mask kernel addresses instead of
using LFENCE; once these land, a guest could use speculation to read
from the VMM's ring 3 address space. Other architectures such as ARM
already use the address masking method, and would be susceptible to
this same kind of data-dependent access gadgets. Therefore, this patch
proactively protects from these attacks by masking out-of-bounds gfns
in __gfn_to_hva_memslot, which blocks speculation of invalid hvas.
Sean Christopherson noted that this patch does not cover
kvm_read_guest_offset_cached. This however is limited to a few bytes
past the end of the cache, and therefore it is unlikely to be useful in
the context of building a chain of data dependent accesses.
Reported-by: Artemiy Margaritov <artemiy.margaritov@gmail.com>
Co-developed-by: Artemiy Margaritov <artemiy.margaritov@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-09 03:31:42 +08:00
|
|
|
return slot->userspace_addr + offset * PAGE_SIZE;
|
2012-08-24 16:50:28 +08:00
|
|
|
}
|
|
|
|
|
2011-03-09 15:41:59 +08:00
|
|
|
static inline int memslot_id(struct kvm *kvm, gfn_t gfn)
|
|
|
|
{
|
|
|
|
return gfn_to_memslot(kvm, gfn)->id;
|
|
|
|
}
|
|
|
|
|
2012-07-02 16:54:30 +08:00
|
|
|
static inline gfn_t
|
|
|
|
hva_to_gfn_memslot(unsigned long hva, struct kvm_memory_slot *slot)
|
2010-08-22 19:10:28 +08:00
|
|
|
{
|
2012-07-02 16:54:30 +08:00
|
|
|
gfn_t gfn_offset = (hva - slot->userspace_addr) >> PAGE_SHIFT;
|
|
|
|
|
|
|
|
return slot->base_gfn + gfn_offset;
|
2010-08-22 19:10:28 +08:00
|
|
|
}
|
|
|
|
|
2007-11-21 20:44:45 +08:00
|
|
|
static inline gpa_t gfn_to_gpa(gfn_t gfn)
|
|
|
|
{
|
|
|
|
return (gpa_t)gfn << PAGE_SHIFT;
|
|
|
|
}
|
[PATCH] kvm: userspace interface
web site: http://kvm.sourceforge.net
mailing list: kvm-devel@lists.sourceforge.net
(http://lists.sourceforge.net/lists/listinfo/kvm-devel)
The following patchset adds a driver for Intel's hardware virtualization
extensions to the x86 architecture. The driver adds a character device
(/dev/kvm) that exposes the virtualization capabilities to userspace. Using
this driver, a process can run a virtual machine (a "guest") in a fully
virtualized PC containing its own virtual hard disks, network adapters, and
display.
Using this driver, one can start multiple virtual machines on a host.
Each virtual machine is a process on the host; a virtual cpu is a thread in
that process. kill(1), nice(1), top(1) work as expected. In effect, the
driver adds a third execution mode to the existing two: we now have kernel
mode, user mode, and guest mode. Guest mode has its own address space mapping
guest physical memory (which is accessible to user mode by mmap()ing
/dev/kvm). Guest mode has no access to any I/O devices; any such access is
intercepted and directed to user mode for emulation.
The driver supports i386 and x86_64 hosts and guests. All combinations are
allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae
and non-pae paging modes are supported.
SMP hosts and UP guests are supported. At the moment only Intel
hardware is supported, but AMD virtualization support is being worked on.
Performance currently is non-stellar due to the naive implementation of the
mmu virtualization, which throws away most of the shadow page table entries
every context switch. We plan to address this in two ways:
- cache shadow page tables across tlb flushes
- wait until AMD and Intel release processors with nested page tables
Currently a virtual desktop is responsive but consumes a lot of CPU. Under
Windows I tried playing pinball and watching a few flash movies; with a recent
CPU one can hardly feel the virtualization. Linux/X is slower, probably due
to X being in a separate process.
In addition to the driver, you need a slightly modified qemu to provide I/O
device emulation and the BIOS.
Caveats (akpm: might no longer be true):
- The Windows install currently bluescreens due to a problem with the
virtual APIC. We are working on a fix. A temporary workaround is to
use an existing image or install through qemu
- Windows 64-bit does not work. That's also true for qemu, so it's
probably a problem with the device model.
[bero@arklinux.org: build fix]
[simon.kagstrom@bth.se: build fix, other fixes]
[uril@qumranet.com: KVM: Expose interrupt bitmap]
[akpm@osdl.org: i386 build fix]
[mingo@elte.hu: i386 fixes]
[rdreier@cisco.com: add log levels to all printks]
[randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings]
[anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support]
Signed-off-by: Yaniv Kamay <yaniv@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Simon Kagstrom <simon.kagstrom@bth.se>
Cc: Bernhard Rosenkraenzer <bero@arklinux.org>
Signed-off-by: Uri Lublin <uril@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Anthony Liguori <anthony@codemonkey.ws>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 18:21:36 +08:00
|
|
|
|
2010-09-10 23:30:48 +08:00
|
|
|
static inline gfn_t gpa_to_gfn(gpa_t gpa)
|
|
|
|
{
|
|
|
|
return (gfn_t)(gpa >> PAGE_SHIFT);
|
|
|
|
}
|
|
|
|
|
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 08:56:11 +08:00
|
|
|
static inline hpa_t pfn_to_hpa(kvm_pfn_t pfn)
|
2008-09-14 08:48:28 +08:00
|
|
|
{
|
|
|
|
return (hpa_t)pfn << PAGE_SHIFT;
|
|
|
|
}
|
|
|
|
|
2014-01-01 23:09:21 +08:00
|
|
|
static inline bool kvm_is_error_gpa(struct kvm *kvm, gpa_t gpa)
|
|
|
|
{
|
|
|
|
unsigned long hva = gfn_to_hva(kvm, gpa_to_gfn(gpa));
|
|
|
|
|
|
|
|
return kvm_is_error_hva(hva);
|
|
|
|
}
|
|
|
|
|
2007-11-18 22:24:12 +08:00
|
|
|
enum kvm_stat_kind {
|
|
|
|
KVM_STAT_VM,
|
|
|
|
KVM_STAT_VCPU,
|
|
|
|
};
|
|
|
|
|
2016-05-18 19:26:23 +08:00
|
|
|
struct kvm_stat_data {
|
|
|
|
struct kvm *kvm;
|
2021-06-24 05:28:46 +08:00
|
|
|
const struct _kvm_stats_desc *desc;
|
2007-11-18 22:24:12 +08:00
|
|
|
enum kvm_stat_kind kind;
|
2007-11-01 06:24:23 +08:00
|
|
|
};
|
2019-12-13 21:07:21 +08:00
|
|
|
|
2021-06-19 06:27:04 +08:00
|
|
|
struct _kvm_stats_desc {
|
|
|
|
struct kvm_stats_desc desc;
|
|
|
|
char name[KVM_STATS_NAME_SIZE];
|
|
|
|
};
|
|
|
|
|
2021-08-03 00:56:29 +08:00
|
|
|
#define STATS_DESC_COMMON(type, unit, base, exp, sz, bsz) \
|
2021-06-19 06:27:04 +08:00
|
|
|
.flags = type | unit | base | \
|
|
|
|
BUILD_BUG_ON_ZERO(type & ~KVM_STATS_TYPE_MASK) | \
|
|
|
|
BUILD_BUG_ON_ZERO(unit & ~KVM_STATS_UNIT_MASK) | \
|
|
|
|
BUILD_BUG_ON_ZERO(base & ~KVM_STATS_BASE_MASK), \
|
|
|
|
.exponent = exp, \
|
2021-08-03 00:56:29 +08:00
|
|
|
.size = sz, \
|
|
|
|
.bucket_size = bsz
|
2021-06-19 06:27:04 +08:00
|
|
|
|
2021-08-03 00:56:29 +08:00
|
|
|
#define VM_GENERIC_STATS_DESC(stat, type, unit, base, exp, sz, bsz) \
|
2021-06-19 06:27:04 +08:00
|
|
|
{ \
|
|
|
|
{ \
|
2021-08-03 00:56:29 +08:00
|
|
|
STATS_DESC_COMMON(type, unit, base, exp, sz, bsz), \
|
2021-06-19 06:27:04 +08:00
|
|
|
.offset = offsetof(struct kvm_vm_stat, generic.stat) \
|
|
|
|
}, \
|
|
|
|
.name = #stat, \
|
|
|
|
}
|
2021-08-03 00:56:29 +08:00
|
|
|
#define VCPU_GENERIC_STATS_DESC(stat, type, unit, base, exp, sz, bsz) \
|
2021-06-19 06:27:04 +08:00
|
|
|
{ \
|
|
|
|
{ \
|
2021-08-03 00:56:29 +08:00
|
|
|
STATS_DESC_COMMON(type, unit, base, exp, sz, bsz), \
|
2021-06-19 06:27:04 +08:00
|
|
|
.offset = offsetof(struct kvm_vcpu_stat, generic.stat) \
|
|
|
|
}, \
|
|
|
|
.name = #stat, \
|
|
|
|
}
|
2021-08-03 00:56:29 +08:00
|
|
|
#define VM_STATS_DESC(stat, type, unit, base, exp, sz, bsz) \
|
2021-06-19 06:27:04 +08:00
|
|
|
{ \
|
|
|
|
{ \
|
2021-08-03 00:56:29 +08:00
|
|
|
STATS_DESC_COMMON(type, unit, base, exp, sz, bsz), \
|
2021-06-19 06:27:04 +08:00
|
|
|
.offset = offsetof(struct kvm_vm_stat, stat) \
|
|
|
|
}, \
|
|
|
|
.name = #stat, \
|
|
|
|
}
|
2021-08-03 00:56:29 +08:00
|
|
|
#define VCPU_STATS_DESC(stat, type, unit, base, exp, sz, bsz) \
|
2021-06-19 06:27:04 +08:00
|
|
|
{ \
|
|
|
|
{ \
|
2021-08-03 00:56:29 +08:00
|
|
|
STATS_DESC_COMMON(type, unit, base, exp, sz, bsz), \
|
2021-06-19 06:27:04 +08:00
|
|
|
.offset = offsetof(struct kvm_vcpu_stat, stat) \
|
|
|
|
}, \
|
|
|
|
.name = #stat, \
|
|
|
|
}
|
|
|
|
/* SCOPE: VM, VM_GENERIC, VCPU, VCPU_GENERIC */
|
2021-08-03 00:56:29 +08:00
|
|
|
#define STATS_DESC(SCOPE, stat, type, unit, base, exp, sz, bsz) \
|
|
|
|
SCOPE##_STATS_DESC(stat, type, unit, base, exp, sz, bsz)
|
2021-06-19 06:27:04 +08:00
|
|
|
|
|
|
|
#define STATS_DESC_CUMULATIVE(SCOPE, name, unit, base, exponent) \
|
2021-08-03 00:56:29 +08:00
|
|
|
STATS_DESC(SCOPE, name, KVM_STATS_TYPE_CUMULATIVE, \
|
|
|
|
unit, base, exponent, 1, 0)
|
2021-06-19 06:27:04 +08:00
|
|
|
#define STATS_DESC_INSTANT(SCOPE, name, unit, base, exponent) \
|
2021-08-03 00:56:29 +08:00
|
|
|
STATS_DESC(SCOPE, name, KVM_STATS_TYPE_INSTANT, \
|
|
|
|
unit, base, exponent, 1, 0)
|
2021-06-19 06:27:04 +08:00
|
|
|
#define STATS_DESC_PEAK(SCOPE, name, unit, base, exponent) \
|
2021-08-03 00:56:29 +08:00
|
|
|
STATS_DESC(SCOPE, name, KVM_STATS_TYPE_PEAK, \
|
|
|
|
unit, base, exponent, 1, 0)
|
|
|
|
#define STATS_DESC_LINEAR_HIST(SCOPE, name, unit, base, exponent, sz, bsz) \
|
|
|
|
STATS_DESC(SCOPE, name, KVM_STATS_TYPE_LINEAR_HIST, \
|
|
|
|
unit, base, exponent, sz, bsz)
|
|
|
|
#define STATS_DESC_LOG_HIST(SCOPE, name, unit, base, exponent, sz) \
|
|
|
|
STATS_DESC(SCOPE, name, KVM_STATS_TYPE_LOG_HIST, \
|
|
|
|
unit, base, exponent, sz, 0)
|
2021-06-19 06:27:04 +08:00
|
|
|
|
|
|
|
/* Cumulative counter, read/write */
|
|
|
|
#define STATS_DESC_COUNTER(SCOPE, name) \
|
|
|
|
STATS_DESC_CUMULATIVE(SCOPE, name, KVM_STATS_UNIT_NONE, \
|
|
|
|
KVM_STATS_BASE_POW10, 0)
|
|
|
|
/* Instantaneous counter, read only */
|
|
|
|
#define STATS_DESC_ICOUNTER(SCOPE, name) \
|
|
|
|
STATS_DESC_INSTANT(SCOPE, name, KVM_STATS_UNIT_NONE, \
|
|
|
|
KVM_STATS_BASE_POW10, 0)
|
|
|
|
/* Peak counter, read/write */
|
|
|
|
#define STATS_DESC_PCOUNTER(SCOPE, name) \
|
|
|
|
STATS_DESC_PEAK(SCOPE, name, KVM_STATS_UNIT_NONE, \
|
|
|
|
KVM_STATS_BASE_POW10, 0)
|
|
|
|
|
2022-07-14 19:27:31 +08:00
|
|
|
/* Instantaneous boolean value, read only */
|
|
|
|
#define STATS_DESC_IBOOLEAN(SCOPE, name) \
|
|
|
|
STATS_DESC_INSTANT(SCOPE, name, KVM_STATS_UNIT_BOOLEAN, \
|
|
|
|
KVM_STATS_BASE_POW10, 0)
|
|
|
|
/* Peak (sticky) boolean value, read/write */
|
|
|
|
#define STATS_DESC_PBOOLEAN(SCOPE, name) \
|
|
|
|
STATS_DESC_PEAK(SCOPE, name, KVM_STATS_UNIT_BOOLEAN, \
|
|
|
|
KVM_STATS_BASE_POW10, 0)
|
|
|
|
|
2021-06-19 06:27:04 +08:00
|
|
|
/* Cumulative time in nanosecond */
|
|
|
|
#define STATS_DESC_TIME_NSEC(SCOPE, name) \
|
|
|
|
STATS_DESC_CUMULATIVE(SCOPE, name, KVM_STATS_UNIT_SECONDS, \
|
|
|
|
KVM_STATS_BASE_POW10, -9)
|
2021-08-03 00:56:29 +08:00
|
|
|
/* Linear histogram for time in nanosecond */
|
|
|
|
#define STATS_DESC_LINHIST_TIME_NSEC(SCOPE, name, sz, bsz) \
|
|
|
|
STATS_DESC_LINEAR_HIST(SCOPE, name, KVM_STATS_UNIT_SECONDS, \
|
|
|
|
KVM_STATS_BASE_POW10, -9, sz, bsz)
|
|
|
|
/* Logarithmic histogram for time in nanosecond */
|
|
|
|
#define STATS_DESC_LOGHIST_TIME_NSEC(SCOPE, name, sz) \
|
|
|
|
STATS_DESC_LOG_HIST(SCOPE, name, KVM_STATS_UNIT_SECONDS, \
|
|
|
|
KVM_STATS_BASE_POW10, -9, sz)
|
2021-06-19 06:27:04 +08:00
|
|
|
|
2021-06-19 06:27:05 +08:00
|
|
|
#define KVM_GENERIC_VM_STATS() \
|
2021-08-17 08:26:39 +08:00
|
|
|
STATS_DESC_COUNTER(VM_GENERIC, remote_tlb_flush), \
|
|
|
|
STATS_DESC_COUNTER(VM_GENERIC, remote_tlb_flush_requests)
|
2021-06-19 06:27:05 +08:00
|
|
|
|
2021-06-19 06:27:06 +08:00
|
|
|
#define KVM_GENERIC_VCPU_STATS() \
|
|
|
|
STATS_DESC_COUNTER(VCPU_GENERIC, halt_successful_poll), \
|
|
|
|
STATS_DESC_COUNTER(VCPU_GENERIC, halt_attempted_poll), \
|
|
|
|
STATS_DESC_COUNTER(VCPU_GENERIC, halt_poll_invalid), \
|
|
|
|
STATS_DESC_COUNTER(VCPU_GENERIC, halt_wakeup), \
|
|
|
|
STATS_DESC_TIME_NSEC(VCPU_GENERIC, halt_poll_success_ns), \
|
2021-08-03 00:56:32 +08:00
|
|
|
STATS_DESC_TIME_NSEC(VCPU_GENERIC, halt_poll_fail_ns), \
|
2021-08-03 00:56:33 +08:00
|
|
|
STATS_DESC_TIME_NSEC(VCPU_GENERIC, halt_wait_ns), \
|
|
|
|
STATS_DESC_LOGHIST_TIME_NSEC(VCPU_GENERIC, halt_poll_success_hist, \
|
|
|
|
HALT_POLL_HIST_COUNT), \
|
|
|
|
STATS_DESC_LOGHIST_TIME_NSEC(VCPU_GENERIC, halt_poll_fail_hist, \
|
|
|
|
HALT_POLL_HIST_COUNT), \
|
|
|
|
STATS_DESC_LOGHIST_TIME_NSEC(VCPU_GENERIC, halt_wait_hist, \
|
2021-10-09 10:12:08 +08:00
|
|
|
HALT_POLL_HIST_COUNT), \
|
2022-07-14 19:27:31 +08:00
|
|
|
STATS_DESC_IBOOLEAN(VCPU_GENERIC, blocking)
|
2021-06-19 06:27:06 +08:00
|
|
|
|
2008-04-16 05:05:42 +08:00
|
|
|
extern struct dentry *kvm_debugfs_dir;
|
2021-08-03 00:56:29 +08:00
|
|
|
|
2021-06-19 06:27:04 +08:00
|
|
|
ssize_t kvm_stats_read(char *id, const struct kvm_stats_header *header,
|
|
|
|
const struct _kvm_stats_desc *desc,
|
|
|
|
void *stats, size_t size_stats,
|
|
|
|
char __user *user_buffer, size_t size, loff_t *offset);
|
2021-08-03 00:56:29 +08:00
|
|
|
|
|
|
|
/**
|
|
|
|
* kvm_stats_linear_hist_update() - Update bucket value for linear histogram
|
|
|
|
* statistics data.
|
|
|
|
*
|
|
|
|
* @data: start address of the stats data
|
|
|
|
* @size: the number of bucket of the stats data
|
|
|
|
* @value: the new value used to update the linear histogram's bucket
|
|
|
|
* @bucket_size: the size (width) of a bucket
|
|
|
|
*/
|
|
|
|
static inline void kvm_stats_linear_hist_update(u64 *data, size_t size,
|
|
|
|
u64 value, size_t bucket_size)
|
|
|
|
{
|
|
|
|
size_t index = div64_u64(value, bucket_size);
|
|
|
|
|
|
|
|
index = min(index, size - 1);
|
|
|
|
++data[index];
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* kvm_stats_log_hist_update() - Update bucket value for logarithmic histogram
|
|
|
|
* statistics data.
|
|
|
|
*
|
|
|
|
* @data: start address of the stats data
|
|
|
|
* @size: the number of bucket of the stats data
|
|
|
|
* @value: the new value used to update the logarithmic histogram's bucket
|
|
|
|
*/
|
|
|
|
static inline void kvm_stats_log_hist_update(u64 *data, size_t size, u64 value)
|
|
|
|
{
|
|
|
|
size_t index = fls64(value);
|
|
|
|
|
|
|
|
index = min(index, size - 1);
|
|
|
|
++data[index];
|
|
|
|
}
|
|
|
|
|
|
|
|
#define KVM_STATS_LINEAR_HIST_UPDATE(array, value, bsize) \
|
|
|
|
kvm_stats_linear_hist_update(array, ARRAY_SIZE(array), value, bsize)
|
|
|
|
#define KVM_STATS_LOG_HIST_UPDATE(array, value) \
|
|
|
|
kvm_stats_log_hist_update(array, ARRAY_SIZE(array), value)
|
|
|
|
|
|
|
|
|
2021-06-19 06:27:05 +08:00
|
|
|
extern const struct kvm_stats_header kvm_vm_stats_header;
|
|
|
|
extern const struct _kvm_stats_desc kvm_vm_stats_desc[];
|
2021-06-19 06:27:06 +08:00
|
|
|
extern const struct kvm_stats_header kvm_vcpu_stats_header;
|
|
|
|
extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
|
2008-04-10 20:47:53 +08:00
|
|
|
|
2012-06-16 03:07:24 +08:00
|
|
|
#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
|
2022-08-16 20:53:22 +08:00
|
|
|
static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
|
2008-07-25 22:24:52 +08:00
|
|
|
{
|
2022-08-16 20:53:22 +08:00
|
|
|
if (unlikely(kvm->mmu_invalidate_in_progress))
|
2008-07-25 22:24:52 +08:00
|
|
|
return 1;
|
|
|
|
/*
|
2022-08-16 20:53:22 +08:00
|
|
|
* Ensure the read of mmu_invalidate_in_progress happens before
|
|
|
|
* the read of mmu_invalidate_seq. This interacts with the
|
|
|
|
* smp_wmb() in mmu_notifier_invalidate_range_end to make sure
|
|
|
|
* that the caller either sees the old (non-zero) value of
|
|
|
|
* mmu_invalidate_in_progress or the new (incremented) value of
|
|
|
|
* mmu_invalidate_seq.
|
|
|
|
*
|
|
|
|
* PowerPC Book3s HV KVM calls this under a per-page lock rather
|
|
|
|
* than under kvm->mmu_lock, for scalability, so can't rely on
|
|
|
|
* kvm->mmu_lock to keep things ordered.
|
2008-07-25 22:24:52 +08:00
|
|
|
*/
|
2011-12-12 20:37:21 +08:00
|
|
|
smp_rmb();
|
2022-08-16 20:53:22 +08:00
|
|
|
if (kvm->mmu_invalidate_seq != mmu_seq)
|
2008-07-25 22:24:52 +08:00
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
2021-02-22 10:45:22 +08:00
|
|
|
|
2022-08-16 20:53:22 +08:00
|
|
|
static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
|
|
|
|
unsigned long mmu_seq,
|
|
|
|
unsigned long hva)
|
2021-02-22 10:45:22 +08:00
|
|
|
{
|
|
|
|
lockdep_assert_held(&kvm->mmu_lock);
|
|
|
|
/*
|
2022-08-16 20:53:22 +08:00
|
|
|
* If mmu_invalidate_in_progress is non-zero, then the range maintained
|
|
|
|
* by kvm_mmu_notifier_invalidate_range_start contains all addresses
|
|
|
|
* that might be being invalidated. Note that it may include some false
|
2021-02-22 10:45:22 +08:00
|
|
|
* positives, due to shortcuts when handing concurrent invalidations.
|
|
|
|
*/
|
2022-08-16 20:53:22 +08:00
|
|
|
if (unlikely(kvm->mmu_invalidate_in_progress) &&
|
|
|
|
hva >= kvm->mmu_invalidate_range_start &&
|
|
|
|
hva < kvm->mmu_invalidate_range_end)
|
2021-02-22 10:45:22 +08:00
|
|
|
return 1;
|
2022-08-16 20:53:22 +08:00
|
|
|
if (kvm->mmu_invalidate_seq != mmu_seq)
|
2021-02-22 10:45:22 +08:00
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
2008-07-25 22:24:52 +08:00
|
|
|
#endif
|
|
|
|
|
2013-04-17 19:29:30 +08:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
|
2008-11-19 19:58:46 +08:00
|
|
|
|
2018-04-27 08:55:03 +08:00
|
|
|
#define KVM_MAX_IRQ_ROUTES 4096 /* might need extension/rework in the future */
|
2008-11-19 19:58:46 +08:00
|
|
|
|
2017-04-28 23:06:20 +08:00
|
|
|
bool kvm_arch_can_set_irq_routing(struct kvm *kvm);
|
2008-11-19 19:58:46 +08:00
|
|
|
int kvm_set_irq_routing(struct kvm *kvm,
|
|
|
|
const struct kvm_irq_routing_entry *entries,
|
|
|
|
unsigned nr,
|
|
|
|
unsigned flags);
|
2016-07-13 04:09:26 +08:00
|
|
|
int kvm_set_routing_entry(struct kvm *kvm,
|
|
|
|
struct kvm_kernel_irq_routing_entry *e,
|
2013-04-16 05:23:21 +08:00
|
|
|
const struct kvm_irq_routing_entry *ue);
|
2008-11-19 19:58:46 +08:00
|
|
|
void kvm_free_irq_routing(struct kvm *kvm);
|
|
|
|
|
|
|
|
#else
|
|
|
|
|
|
|
|
static inline void kvm_free_irq_routing(struct kvm *kvm) {}
|
|
|
|
|
|
|
|
#endif
|
|
|
|
|
2014-06-30 18:51:13 +08:00
|
|
|
int kvm_send_userspace_msi(struct kvm *kvm, struct kvm_msi *msi);
|
|
|
|
|
2009-05-20 22:30:49 +08:00
|
|
|
#ifdef CONFIG_HAVE_KVM_EVENTFD
|
|
|
|
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-08 05:08:49 +08:00
|
|
|
void kvm_eventfd_init(struct kvm *kvm);
|
2012-10-09 06:22:59 +08:00
|
|
|
int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args);
|
|
|
|
|
2014-06-30 18:51:13 +08:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQFD
|
2012-06-29 23:56:08 +08:00
|
|
|
int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args);
|
2009-05-20 22:30:49 +08:00
|
|
|
void kvm_irqfd_release(struct kvm *kvm);
|
2014-06-30 18:51:11 +08:00
|
|
|
void kvm_irq_routing_update(struct kvm *);
|
2012-10-09 06:22:59 +08:00
|
|
|
#else
|
|
|
|
static inline int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args)
|
|
|
|
{
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void kvm_irqfd_release(struct kvm *kvm) {}
|
|
|
|
#endif
|
2009-05-20 22:30:49 +08:00
|
|
|
|
|
|
|
#else
|
|
|
|
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-08 05:08:49 +08:00
|
|
|
static inline void kvm_eventfd_init(struct kvm *kvm) {}
|
2010-11-19 01:09:08 +08:00
|
|
|
|
2012-06-29 23:56:08 +08:00
|
|
|
static inline int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args)
|
2009-05-20 22:30:49 +08:00
|
|
|
{
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void kvm_irqfd_release(struct kvm *kvm) {}
|
2010-11-19 01:09:08 +08:00
|
|
|
|
2010-11-25 17:25:44 +08:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQCHIP
|
2014-06-30 18:51:11 +08:00
|
|
|
static inline void kvm_irq_routing_update(struct kvm *kvm)
|
2010-11-19 01:09:08 +08:00
|
|
|
{
|
|
|
|
}
|
2010-11-25 17:25:44 +08:00
|
|
|
#endif
|
2010-11-19 01:09:08 +08:00
|
|
|
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-08 05:08:49 +08:00
|
|
|
static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
|
|
|
|
{
|
|
|
|
return -ENOSYS;
|
|
|
|
}
|
2009-05-20 22:30:49 +08:00
|
|
|
|
|
|
|
#endif /* CONFIG_HAVE_KVM_EVENTFD */
|
|
|
|
|
2018-02-22 20:05:41 +08:00
|
|
|
void kvm_arch_irq_routing_update(struct kvm *kvm);
|
|
|
|
|
2022-02-24 00:53:02 +08:00
|
|
|
static inline void __kvm_make_request(int req, struct kvm_vcpu *vcpu)
|
2010-05-10 17:34:53 +08:00
|
|
|
{
|
2016-03-10 23:30:22 +08:00
|
|
|
/*
|
|
|
|
* Ensure the rest of the request is published to kvm_check_request's
|
|
|
|
* caller. Paired with the smp_mb__after_atomic in kvm_check_request.
|
|
|
|
*/
|
|
|
|
smp_wmb();
|
2018-07-10 17:27:19 +08:00
|
|
|
set_bit(req & KVM_REQUEST_MASK, (void *)&vcpu->requests);
|
2010-05-10 17:34:53 +08:00
|
|
|
}
|
|
|
|
|
2022-02-24 00:53:02 +08:00
|
|
|
static __always_inline void kvm_make_request(int req, struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Request that don't require vCPU action should never be logged in
|
|
|
|
* vcpu->requests. The vCPU won't clear the request, so it will stay
|
|
|
|
* logged indefinitely and prevent the vCPU from entering the guest.
|
|
|
|
*/
|
|
|
|
BUILD_BUG_ON(!__builtin_constant_p(req) ||
|
|
|
|
(req & KVM_REQUEST_NO_ACTION));
|
|
|
|
|
|
|
|
__kvm_make_request(req, vcpu);
|
|
|
|
}
|
|
|
|
|
2017-06-04 20:43:52 +08:00
|
|
|
static inline bool kvm_request_pending(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return READ_ONCE(vcpu->requests);
|
|
|
|
}
|
|
|
|
|
2017-04-27 04:32:19 +08:00
|
|
|
static inline bool kvm_test_request(int req, struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2018-07-10 17:27:19 +08:00
|
|
|
return test_bit(req & KVM_REQUEST_MASK, (void *)&vcpu->requests);
|
2017-04-27 04:32:19 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void kvm_clear_request(int req, struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2018-07-10 17:27:19 +08:00
|
|
|
clear_bit(req & KVM_REQUEST_MASK, (void *)&vcpu->requests);
|
2017-04-27 04:32:19 +08:00
|
|
|
}
|
|
|
|
|
2010-05-10 17:34:53 +08:00
|
|
|
static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2017-04-27 04:32:19 +08:00
|
|
|
if (kvm_test_request(req, vcpu)) {
|
|
|
|
kvm_clear_request(req, vcpu);
|
2016-03-10 23:30:22 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Ensure the rest of the request is visible to kvm_check_request's
|
|
|
|
* caller. Paired with the smp_wmb in kvm_make_request.
|
|
|
|
*/
|
|
|
|
smp_mb__after_atomic();
|
2010-05-10 18:08:26 +08:00
|
|
|
return true;
|
|
|
|
} else {
|
|
|
|
return false;
|
|
|
|
}
|
2010-05-10 17:34:53 +08:00
|
|
|
}
|
|
|
|
|
2013-04-06 03:20:30 +08:00
|
|
|
extern bool kvm_rebooting;
|
|
|
|
|
2016-10-14 08:53:19 +08:00
|
|
|
extern unsigned int halt_poll_ns;
|
|
|
|
extern unsigned int halt_poll_ns_grow;
|
2019-01-27 18:17:15 +08:00
|
|
|
extern unsigned int halt_poll_ns_grow_start;
|
2016-10-14 08:53:19 +08:00
|
|
|
extern unsigned int halt_poll_ns_shrink;
|
|
|
|
|
2013-04-12 22:08:42 +08:00
|
|
|
struct kvm_device {
|
2019-10-21 23:28:19 +08:00
|
|
|
const struct kvm_device_ops *ops;
|
2013-04-12 22:08:42 +08:00
|
|
|
struct kvm *kvm;
|
|
|
|
void *private;
|
2013-04-25 22:11:23 +08:00
|
|
|
struct list_head vm_node;
|
2013-04-12 22:08:42 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
/* create, destroy, and name are mandatory */
|
|
|
|
struct kvm_device_ops {
|
|
|
|
const char *name;
|
2016-08-10 01:13:01 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* create is called holding kvm->lock and any operations not suitable
|
|
|
|
* to do while holding the lock should be deferred to init (see
|
|
|
|
* below).
|
|
|
|
*/
|
2013-04-12 22:08:42 +08:00
|
|
|
int (*create)(struct kvm_device *dev, u32 type);
|
|
|
|
|
2016-08-10 01:13:00 +08:00
|
|
|
/*
|
|
|
|
* init is called after create if create is successful and is called
|
|
|
|
* outside of holding kvm->lock.
|
|
|
|
*/
|
|
|
|
void (*init)(struct kvm_device *dev);
|
|
|
|
|
2013-04-12 22:08:42 +08:00
|
|
|
/*
|
|
|
|
* Destroy is responsible for freeing dev.
|
|
|
|
*
|
|
|
|
* Destroy may be called before or after destructors are called
|
|
|
|
* on emulated I/O regions, depending on whether a reference is
|
|
|
|
* held by a vcpu or other kvm component that gets destroyed
|
|
|
|
* after the emulated I/O.
|
|
|
|
*/
|
|
|
|
void (*destroy)(struct kvm_device *dev);
|
|
|
|
|
2019-04-18 18:39:41 +08:00
|
|
|
/*
|
|
|
|
* Release is an alternative method to free the device. It is
|
|
|
|
* called when the device file descriptor is closed. Once
|
|
|
|
* release is called, the destroy method will not be called
|
|
|
|
* anymore as the device is removed from the device list of
|
|
|
|
* the VM. kvm->lock is held.
|
|
|
|
*/
|
|
|
|
void (*release)(struct kvm_device *dev);
|
|
|
|
|
2013-04-12 22:08:42 +08:00
|
|
|
int (*set_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
|
|
|
|
int (*get_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
|
|
|
|
int (*has_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
|
|
|
|
long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
|
|
|
|
unsigned long arg);
|
2019-04-18 18:39:36 +08:00
|
|
|
int (*mmap)(struct kvm_device *dev, struct vm_area_struct *vma);
|
2013-04-12 22:08:42 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
void kvm_device_get(struct kvm_device *dev);
|
|
|
|
void kvm_device_put(struct kvm_device *dev);
|
|
|
|
struct kvm_device *kvm_device_from_filp(struct file *filp);
|
2019-10-21 23:28:19 +08:00
|
|
|
int kvm_register_device_ops(const struct kvm_device_ops *ops, u32 type);
|
2014-10-09 18:30:08 +08:00
|
|
|
void kvm_unregister_device_ops(u32 type);
|
2013-04-12 22:08:42 +08:00
|
|
|
|
2013-04-12 22:08:46 +08:00
|
|
|
extern struct kvm_device_ops kvm_mpic_ops;
|
2014-10-27 07:17:00 +08:00
|
|
|
extern struct kvm_device_ops kvm_arm_vgic_v2_ops;
|
2014-06-07 06:54:51 +08:00
|
|
|
extern struct kvm_device_ops kvm_arm_vgic_v3_ops;
|
2013-04-12 22:08:46 +08:00
|
|
|
|
2012-07-18 21:37:46 +08:00
|
|
|
#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
|
|
|
|
|
|
|
|
static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
|
|
|
|
{
|
|
|
|
vcpu->spin_loop.in_spin_loop = val;
|
|
|
|
}
|
|
|
|
static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
|
|
|
|
{
|
|
|
|
vcpu->spin_loop.dy_eligible = val;
|
|
|
|
}
|
|
|
|
|
|
|
|
#else /* !CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
|
|
|
|
|
|
|
|
static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void kvm_vcpu_set_dy_eligible(struct kvm_vcpu *vcpu, bool val)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
|
2015-09-18 22:29:43 +08:00
|
|
|
|
2020-04-16 21:48:07 +08:00
|
|
|
static inline bool kvm_is_visible_memslot(struct kvm_memory_slot *memslot)
|
|
|
|
{
|
|
|
|
return (memslot && memslot->id < KVM_USER_MEM_SLOTS &&
|
|
|
|
!(memslot->flags & KVM_MEMSLOT_INVALID));
|
|
|
|
}
|
|
|
|
|
2020-01-09 22:57:19 +08:00
|
|
|
struct kvm_vcpu *kvm_get_running_vcpu(void);
|
2020-02-28 16:49:41 +08:00
|
|
|
struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void);
|
2020-01-09 22:57:19 +08:00
|
|
|
|
2015-09-18 22:29:43 +08:00
|
|
|
#ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
|
2016-05-06 01:58:35 +08:00
|
|
|
bool kvm_arch_has_irq_bypass(void);
|
2015-09-18 22:29:43 +08:00
|
|
|
int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *,
|
|
|
|
struct irq_bypass_producer *);
|
|
|
|
void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *,
|
|
|
|
struct irq_bypass_producer *);
|
|
|
|
void kvm_arch_irq_bypass_stop(struct irq_bypass_consumer *);
|
|
|
|
void kvm_arch_irq_bypass_start(struct irq_bypass_consumer *);
|
2015-09-18 22:29:53 +08:00
|
|
|
int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
|
|
|
|
uint32_t guest_irq, bool set);
|
2021-08-27 16:00:03 +08:00
|
|
|
bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *,
|
|
|
|
struct kvm_kernel_irq_routing_entry *);
|
2015-09-18 22:29:43 +08:00
|
|
|
#endif /* CONFIG_HAVE_KVM_IRQ_BYPASS */
|
2015-10-20 15:39:03 +08:00
|
|
|
|
2016-05-13 18:16:35 +08:00
|
|
|
#ifdef CONFIG_HAVE_KVM_INVALID_WAKEUPS
|
|
|
|
/* If we wakeup during the poll time, was it a sucessful poll? */
|
|
|
|
static inline bool vcpu_valid_wakeup(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return vcpu->valid_wakeup;
|
|
|
|
}
|
|
|
|
|
|
|
|
#else
|
|
|
|
static inline bool vcpu_valid_wakeup(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_HAVE_KVM_INVALID_WAKEUPS */
|
|
|
|
|
2019-03-05 18:30:01 +08:00
|
|
|
#ifdef CONFIG_HAVE_KVM_NO_POLL
|
|
|
|
/* Callback that tells if we must not poll */
|
|
|
|
bool kvm_arch_no_poll(struct kvm_vcpu *vcpu);
|
|
|
|
#else
|
|
|
|
static inline bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_HAVE_KVM_NO_POLL */
|
|
|
|
|
2017-12-13 00:41:34 +08:00
|
|
|
#ifdef CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL
|
|
|
|
long kvm_arch_vcpu_async_ioctl(struct file *filp,
|
|
|
|
unsigned int ioctl, unsigned long arg);
|
|
|
|
#else
|
|
|
|
static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
|
|
|
|
unsigned int ioctl,
|
|
|
|
unsigned long arg)
|
|
|
|
{
|
|
|
|
return -ENOIOCTLCMD;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
|
|
|
|
|
2020-06-06 12:26:27 +08:00
|
|
|
void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
|
|
|
|
unsigned long start, unsigned long end);
|
2018-02-22 20:04:39 +08:00
|
|
|
|
2022-04-21 11:14:07 +08:00
|
|
|
void kvm_arch_guest_memory_reclaimed(struct kvm *kvm);
|
|
|
|
|
2018-02-24 00:23:57 +08:00
|
|
|
#ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
|
|
|
|
int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
|
|
|
|
#else
|
|
|
|
static inline int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE */
|
|
|
|
|
2019-11-04 19:22:02 +08:00
|
|
|
typedef int (*kvm_vm_thread_fn_t)(struct kvm *kvm, uintptr_t data);
|
|
|
|
|
|
|
|
int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
|
|
|
|
uintptr_t data, const char *name,
|
|
|
|
struct task_struct **thread_ptr);
|
|
|
|
|
2020-07-23 05:59:59 +08:00
|
|
|
#ifdef CONFIG_KVM_XFER_TO_GUEST_WORK
|
|
|
|
static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
vcpu->run->exit_reason = KVM_EXIT_INTR;
|
|
|
|
vcpu->stat.signal_exits++;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */
|
|
|
|
|
2022-08-23 08:46:37 +08:00
|
|
|
/*
|
|
|
|
* If more than one page is being (un)accounted, @virt must be the address of
|
|
|
|
* the first page of a block of pages what were allocated together (i.e
|
|
|
|
* accounted together).
|
|
|
|
*
|
|
|
|
* kvm_account_pgtable_pages() is thread-safe because mod_lruvec_page_state()
|
|
|
|
* is thread-safe.
|
|
|
|
*/
|
|
|
|
static inline void kvm_account_pgtable_pages(void *virt, int nr)
|
|
|
|
{
|
|
|
|
mod_lruvec_page_state(virt_to_page(virt), NR_SECONDARY_PAGETABLE, nr);
|
|
|
|
}
|
|
|
|
|
2020-10-01 09:22:22 +08:00
|
|
|
/*
|
|
|
|
* This defines how many reserved entries we want to keep before we
|
|
|
|
* kick the vcpu to the userspace to avoid dirty ring full. This
|
|
|
|
* value can be tuned to higher if e.g. PML is enabled on the host.
|
|
|
|
*/
|
|
|
|
#define KVM_DIRTY_RING_RSVD_ENTRIES 64
|
|
|
|
|
|
|
|
/* Max number of entries allowed for each kvm dirty ring */
|
|
|
|
#define KVM_DIRTY_RING_MAX_ENTRIES 65536
|
|
|
|
|
2009-08-26 19:57:50 +08:00
|
|
|
#endif
|