379 lines
13 KiB
ReStructuredText
379 lines
13 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
=================
|
|
KVM-specific MSRs
|
|
=================
|
|
|
|
:Author: Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
|
|
|
|
KVM makes use of some custom MSRs to service some requests.
|
|
|
|
Custom MSRs have a range reserved for them, that goes from
|
|
0x4b564d00 to 0x4b564dff. There are MSRs outside this area,
|
|
but they are deprecated and their use is discouraged.
|
|
|
|
Custom MSR list
|
|
---------------
|
|
|
|
The current supported Custom MSR list is:
|
|
|
|
MSR_KVM_WALL_CLOCK_NEW:
|
|
0x4b564d00
|
|
|
|
data:
|
|
4-byte alignment physical address of a memory area which must be
|
|
in guest RAM. This memory is expected to hold a copy of the following
|
|
structure::
|
|
|
|
struct pvclock_wall_clock {
|
|
u32 version;
|
|
u32 sec;
|
|
u32 nsec;
|
|
} __attribute__((__packed__));
|
|
|
|
whose data will be filled in by the hypervisor. The hypervisor is only
|
|
guaranteed to update this data at the moment of MSR write.
|
|
Users that want to reliably query this information more than once have
|
|
to write more than once to this MSR. Fields have the following meanings:
|
|
|
|
version:
|
|
guest has to check version before and after grabbing
|
|
time information and check that they are both equal and even.
|
|
An odd version indicates an in-progress update.
|
|
|
|
sec:
|
|
number of seconds for wallclock at time of boot.
|
|
|
|
nsec:
|
|
number of nanoseconds for wallclock at time of boot.
|
|
|
|
In order to get the current wallclock time, the system_time from
|
|
MSR_KVM_SYSTEM_TIME_NEW needs to be added.
|
|
|
|
Note that although MSRs are per-CPU entities, the effect of this
|
|
particular MSR is global.
|
|
|
|
Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
|
|
leaf prior to usage.
|
|
|
|
MSR_KVM_SYSTEM_TIME_NEW:
|
|
0x4b564d01
|
|
|
|
data:
|
|
4-byte aligned physical address of a memory area which must be in
|
|
guest RAM, plus an enable bit in bit 0. This memory is expected to hold
|
|
a copy of the following structure::
|
|
|
|
struct pvclock_vcpu_time_info {
|
|
u32 version;
|
|
u32 pad0;
|
|
u64 tsc_timestamp;
|
|
u64 system_time;
|
|
u32 tsc_to_system_mul;
|
|
s8 tsc_shift;
|
|
u8 flags;
|
|
u8 pad[2];
|
|
} __attribute__((__packed__)); /* 32 bytes */
|
|
|
|
whose data will be filled in by the hypervisor periodically. Only one
|
|
write, or registration, is needed for each VCPU. The interval between
|
|
updates of this structure is arbitrary and implementation-dependent.
|
|
The hypervisor may update this structure at any time it sees fit until
|
|
anything with bit0 == 0 is written to it.
|
|
|
|
Fields have the following meanings:
|
|
|
|
version:
|
|
guest has to check version before and after grabbing
|
|
time information and check that they are both equal and even.
|
|
An odd version indicates an in-progress update.
|
|
|
|
tsc_timestamp:
|
|
the tsc value at the current VCPU at the time
|
|
of the update of this structure. Guests can subtract this value
|
|
from current tsc to derive a notion of elapsed time since the
|
|
structure update.
|
|
|
|
system_time:
|
|
a host notion of monotonic time, including sleep
|
|
time at the time this structure was last updated. Unit is
|
|
nanoseconds.
|
|
|
|
tsc_to_system_mul:
|
|
multiplier to be used when converting
|
|
tsc-related quantity to nanoseconds
|
|
|
|
tsc_shift:
|
|
shift to be used when converting tsc-related
|
|
quantity to nanoseconds. This shift will ensure that
|
|
multiplication with tsc_to_system_mul does not overflow.
|
|
A positive value denotes a left shift, a negative value
|
|
a right shift.
|
|
|
|
The conversion from tsc to nanoseconds involves an additional
|
|
right shift by 32 bits. With this information, guests can
|
|
derive per-CPU time by doing::
|
|
|
|
time = (current_tsc - tsc_timestamp)
|
|
if (tsc_shift >= 0)
|
|
time <<= tsc_shift;
|
|
else
|
|
time >>= -tsc_shift;
|
|
time = (time * tsc_to_system_mul) >> 32
|
|
time = time + system_time
|
|
|
|
flags:
|
|
bits in this field indicate extended capabilities
|
|
coordinated between the guest and the hypervisor. Availability
|
|
of specific flags has to be checked in 0x40000001 cpuid leaf.
|
|
Current flags are:
|
|
|
|
|
|
+-----------+--------------+----------------------------------+
|
|
| flag bit | cpuid bit | meaning |
|
|
+-----------+--------------+----------------------------------+
|
|
| | | time measures taken across |
|
|
| 0 | 24 | multiple cpus are guaranteed to |
|
|
| | | be monotonic |
|
|
+-----------+--------------+----------------------------------+
|
|
| | | guest vcpu has been paused by |
|
|
| 1 | N/A | the host |
|
|
| | | See 4.70 in api.txt |
|
|
+-----------+--------------+----------------------------------+
|
|
|
|
Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
|
|
leaf prior to usage.
|
|
|
|
|
|
MSR_KVM_WALL_CLOCK:
|
|
0x11
|
|
|
|
data and functioning:
|
|
same as MSR_KVM_WALL_CLOCK_NEW. Use that instead.
|
|
|
|
This MSR falls outside the reserved KVM range and may be removed in the
|
|
future. Its usage is deprecated.
|
|
|
|
Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
|
|
leaf prior to usage.
|
|
|
|
MSR_KVM_SYSTEM_TIME:
|
|
0x12
|
|
|
|
data and functioning:
|
|
same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead.
|
|
|
|
This MSR falls outside the reserved KVM range and may be removed in the
|
|
future. Its usage is deprecated.
|
|
|
|
Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
|
|
leaf prior to usage.
|
|
|
|
The suggested algorithm for detecting kvmclock presence is then::
|
|
|
|
if (!kvm_para_available()) /* refer to cpuid.txt */
|
|
return NON_PRESENT;
|
|
|
|
flags = cpuid_eax(0x40000001);
|
|
if (flags & 3) {
|
|
msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
|
|
msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
|
|
return PRESENT;
|
|
} else if (flags & 0) {
|
|
msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
|
|
msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
|
|
return PRESENT;
|
|
} else
|
|
return NON_PRESENT;
|
|
|
|
MSR_KVM_ASYNC_PF_EN:
|
|
0x4b564d02
|
|
|
|
data:
|
|
Asynchronous page fault (APF) control MSR.
|
|
|
|
Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area
|
|
which must be in guest RAM and must be zeroed. This memory is expected
|
|
to hold a copy of the following structure::
|
|
|
|
struct kvm_vcpu_pv_apf_data {
|
|
/* Used for 'page not present' events delivered via #PF */
|
|
__u32 flags;
|
|
|
|
/* Used for 'page ready' events delivered via interrupt notification */
|
|
__u32 token;
|
|
|
|
__u8 pad[56];
|
|
__u32 enabled;
|
|
};
|
|
|
|
Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
|
|
when asynchronous page faults are enabled on the vcpu, 0 when disabled.
|
|
Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
|
|
cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
|
|
#PF vmexits. Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
|
|
present in CPUID. Bit 3 enables interrupt based delivery of 'page ready'
|
|
events. Bit 3 can only be set if KVM_FEATURE_ASYNC_PF_INT is present in
|
|
CPUID.
|
|
|
|
'Page not present' events are currently always delivered as synthetic
|
|
#PF exception. During delivery of these events APF CR2 register contains
|
|
a token that will be used to notify the guest when missing page becomes
|
|
available. Also, to make it possible to distinguish between real #PF and
|
|
APF, first 4 bytes of 64 byte memory location ('flags') will be written
|
|
to by the hypervisor at the time of injection. Only first bit of 'flags'
|
|
is currently supported, when set, it indicates that the guest is dealing
|
|
with asynchronous 'page not present' event. If during a page fault APF
|
|
'flags' is '0' it means that this is regular page fault. Guest is
|
|
supposed to clear 'flags' when it is done handling #PF exception so the
|
|
next event can be delivered.
|
|
|
|
Note, since APF 'page not present' events use the same exception vector
|
|
as regular page fault, guest must reset 'flags' to '0' before it does
|
|
something that can generate normal page fault.
|
|
|
|
Bytes 5-7 of 64 byte memory location ('token') will be written to by the
|
|
hypervisor at the time of APF 'page ready' event injection. The content
|
|
of these bytes is a token which was previously delivered as 'page not
|
|
present' event. The event indicates the page in now available. Guest is
|
|
supposed to write '0' to 'token' when it is done handling 'page ready'
|
|
event and to write 1' to MSR_KVM_ASYNC_PF_ACK after clearing the location;
|
|
writing to the MSR forces KVM to re-scan its queue and deliver the next
|
|
pending notification.
|
|
|
|
Note, MSR_KVM_ASYNC_PF_INT MSR specifying the interrupt vector for 'page
|
|
ready' APF delivery needs to be written to before enabling APF mechanism
|
|
in MSR_KVM_ASYNC_PF_EN or interrupt #0 can get injected. The MSR is
|
|
available if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
|
|
|
|
Note, previously, 'page ready' events were delivered via the same #PF
|
|
exception as 'page not present' events but this is now deprecated. If
|
|
bit 3 (interrupt based delivery) is not set APF events are not delivered.
|
|
|
|
If APF is disabled while there are outstanding APFs, they will
|
|
not be delivered.
|
|
|
|
Currently 'page ready' APF events will be always delivered on the
|
|
same vcpu as 'page not present' event was, but guest should not rely on
|
|
that.
|
|
|
|
MSR_KVM_STEAL_TIME:
|
|
0x4b564d03
|
|
|
|
data:
|
|
64-byte alignment physical address of a memory area which must be
|
|
in guest RAM, plus an enable bit in bit 0. This memory is expected to
|
|
hold a copy of the following structure::
|
|
|
|
struct kvm_steal_time {
|
|
__u64 steal;
|
|
__u32 version;
|
|
__u32 flags;
|
|
__u8 preempted;
|
|
__u8 u8_pad[3];
|
|
__u32 pad[11];
|
|
}
|
|
|
|
whose data will be filled in by the hypervisor periodically. Only one
|
|
write, or registration, is needed for each VCPU. The interval between
|
|
updates of this structure is arbitrary and implementation-dependent.
|
|
The hypervisor may update this structure at any time it sees fit until
|
|
anything with bit0 == 0 is written to it. Guest is required to make sure
|
|
this structure is initialized to zero.
|
|
|
|
Fields have the following meanings:
|
|
|
|
version:
|
|
a sequence counter. In other words, guest has to check
|
|
this field before and after grabbing time information and make
|
|
sure they are both equal and even. An odd version indicates an
|
|
in-progress update.
|
|
|
|
flags:
|
|
At this point, always zero. May be used to indicate
|
|
changes in this structure in the future.
|
|
|
|
steal:
|
|
the amount of time in which this vCPU did not run, in
|
|
nanoseconds. Time during which the vcpu is idle, will not be
|
|
reported as steal time.
|
|
|
|
preempted:
|
|
indicate the vCPU who owns this struct is running or
|
|
not. Non-zero values mean the vCPU has been preempted. Zero
|
|
means the vCPU is not preempted. NOTE, it is always zero if the
|
|
the hypervisor doesn't support this field.
|
|
|
|
MSR_KVM_EOI_EN:
|
|
0x4b564d04
|
|
|
|
data:
|
|
Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
|
|
when disabled. Bit 1 is reserved and must be zero. When PV end of
|
|
interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned
|
|
physical address of a 4 byte memory area which must be in guest RAM and
|
|
must be zeroed.
|
|
|
|
The first, least significant bit of 4 byte memory location will be
|
|
written to by the hypervisor, typically at the time of interrupt
|
|
injection. Value of 1 means that guest can skip writing EOI to the apic
|
|
(using MSR or MMIO write); instead, it is sufficient to signal
|
|
EOI by clearing the bit in guest memory - this location will
|
|
later be polled by the hypervisor.
|
|
Value of 0 means that the EOI write is required.
|
|
|
|
It is always safe for the guest to ignore the optimization and perform
|
|
the APIC EOI write anyway.
|
|
|
|
Hypervisor is guaranteed to only modify this least
|
|
significant bit while in the current VCPU context, this means that
|
|
guest does not need to use either lock prefix or memory ordering
|
|
primitives to synchronise with the hypervisor.
|
|
|
|
However, hypervisor can set and clear this memory bit at any time:
|
|
therefore to make sure hypervisor does not interrupt the
|
|
guest and clear the least significant bit in the memory area
|
|
in the window between guest testing it to detect
|
|
whether it can skip EOI apic write and between guest
|
|
clearing it to signal EOI to the hypervisor,
|
|
guest must both read the least significant bit in the memory area and
|
|
clear it using a single CPU instruction, such as test and clear, or
|
|
compare and exchange.
|
|
|
|
MSR_KVM_POLL_CONTROL:
|
|
0x4b564d05
|
|
|
|
Control host-side polling.
|
|
|
|
data:
|
|
Bit 0 enables (1) or disables (0) host-side HLT polling logic.
|
|
|
|
KVM guests can request the host not to poll on HLT, for example if
|
|
they are performing polling themselves.
|
|
|
|
MSR_KVM_ASYNC_PF_INT:
|
|
0x4b564d06
|
|
|
|
data:
|
|
Second asynchronous page fault (APF) control MSR.
|
|
|
|
Bits 0-7: APIC vector for delivery of 'page ready' APF events.
|
|
Bits 8-63: Reserved
|
|
|
|
Interrupt vector for asynchnonous 'page ready' notifications delivery.
|
|
The vector has to be set up before asynchronous page fault mechanism
|
|
is enabled in MSR_KVM_ASYNC_PF_EN. The MSR is only available if
|
|
KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
|
|
|
|
MSR_KVM_ASYNC_PF_ACK:
|
|
0x4b564d07
|
|
|
|
data:
|
|
Asynchronous page fault (APF) acknowledgment.
|
|
|
|
When the guest is done processing 'page ready' APF event and 'token'
|
|
field in 'struct kvm_vcpu_pv_apf_data' is cleared it is supposed to
|
|
write '1' to bit 0 of the MSR, this causes the host to re-scan its queue
|
|
and check if there are more notifications pending. The MSR is available
|
|
if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
|