613 lines
30 KiB
Plaintext
613 lines
30 KiB
Plaintext
|
|
||
|
Timekeeping Virtualization for X86-Based Architectures
|
||
|
|
||
|
Zachary Amsden <zamsden@redhat.com>
|
||
|
Copyright (c) 2010, Red Hat. All rights reserved.
|
||
|
|
||
|
1) Overview
|
||
|
2) Timing Devices
|
||
|
3) TSC Hardware
|
||
|
4) Virtualization Problems
|
||
|
|
||
|
=========================================================================
|
||
|
|
||
|
1) Overview
|
||
|
|
||
|
One of the most complicated parts of the X86 platform, and specifically,
|
||
|
the virtualization of this platform is the plethora of timing devices available
|
||
|
and the complexity of emulating those devices. In addition, virtualization of
|
||
|
time introduces a new set of challenges because it introduces a multiplexed
|
||
|
division of time beyond the control of the guest CPU.
|
||
|
|
||
|
First, we will describe the various timekeeping hardware available, then
|
||
|
present some of the problems which arise and solutions available, giving
|
||
|
specific recommendations for certain classes of KVM guests.
|
||
|
|
||
|
The purpose of this document is to collect data and information relevant to
|
||
|
timekeeping which may be difficult to find elsewhere, specifically,
|
||
|
information relevant to KVM and hardware-based virtualization.
|
||
|
|
||
|
=========================================================================
|
||
|
|
||
|
2) Timing Devices
|
||
|
|
||
|
First we discuss the basic hardware devices available. TSC and the related
|
||
|
KVM clock are special enough to warrant a full exposition and are described in
|
||
|
the following section.
|
||
|
|
||
|
2.1) i8254 - PIT
|
||
|
|
||
|
One of the first timer devices available is the programmable interrupt timer,
|
||
|
or PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three
|
||
|
channels which can be programmed to deliver periodic or one-shot interrupts.
|
||
|
These three channels can be configured in different modes and have individual
|
||
|
counters. Channel 1 and 2 were not available for general use in the original
|
||
|
IBM PC, and historically were connected to control RAM refresh and the PC
|
||
|
speaker. Now the PIT is typically integrated as part of an emulated chipset
|
||
|
and a separate physical PIT is not used.
|
||
|
|
||
|
The PIT uses I/O ports 0x40 - 0x43. Access to the 16-bit counters is done
|
||
|
using single or multiple byte access to the I/O ports. There are 6 modes
|
||
|
available, but not all modes are available to all timers, as only timer 2
|
||
|
has a connected gate input, required for modes 1 and 5. The gate line is
|
||
|
controlled by port 61h, bit 0, as illustrated in the following diagram.
|
||
|
|
||
|
-------------- ----------------
|
||
|
| | | |
|
||
|
| 1.1932 MHz |---------->| CLOCK OUT | ---------> IRQ 0
|
||
|
| Clock | | | |
|
||
|
-------------- | +->| GATE TIMER 0 |
|
||
|
| ----------------
|
||
|
|
|
||
|
| ----------------
|
||
|
| | |
|
||
|
|------>| CLOCK OUT | ---------> 66.3 KHZ DRAM
|
||
|
| | | (aka /dev/null)
|
||
|
| +->| GATE TIMER 1 |
|
||
|
| ----------------
|
||
|
|
|
||
|
| ----------------
|
||
|
| | |
|
||
|
|------>| CLOCK OUT | ---------> Port 61h, bit 5
|
||
|
| | |
|
||
|
Port 61h, bit 0 ---------->| GATE TIMER 2 | \_.---- ____
|
||
|
---------------- _| )--|LPF|---Speaker
|
||
|
/ *---- \___/
|
||
|
Port 61h, bit 1 -----------------------------------/
|
||
|
|
||
|
The timer modes are now described.
|
||
|
|
||
|
Mode 0: Single Timeout. This is a one-shot software timeout that counts down
|
||
|
when the gate is high (always true for timers 0 and 1). When the count
|
||
|
reaches zero, the output goes high.
|
||
|
|
||
|
Mode 1: Triggered One-shot. The output is intially set high. When the gate
|
||
|
line is set high, a countdown is initiated (which does not stop if the gate is
|
||
|
lowered), during which the output is set low. When the count reaches zero,
|
||
|
the output goes high.
|
||
|
|
||
|
Mode 2: Rate Generator. The output is initially set high. When the countdown
|
||
|
reaches 1, the output goes low for one count and then returns high. The value
|
||
|
is reloaded and the countdown automatically resumes. If the gate line goes
|
||
|
low, the count is halted. If the output is low when the gate is lowered, the
|
||
|
output automatically goes high (this only affects timer 2).
|
||
|
|
||
|
Mode 3: Square Wave. This generates a high / low square wave. The count
|
||
|
determines the length of the pulse, which alternates between high and low
|
||
|
when zero is reached. The count only proceeds when gate is high and is
|
||
|
automatically reloaded on reaching zero. The count is decremented twice at
|
||
|
each clock to generate a full high / low cycle at the full periodic rate.
|
||
|
If the count is even, the clock remains high for N/2 counts and low for N/2
|
||
|
counts; if the clock is odd, the clock is high for (N+1)/2 counts and low
|
||
|
for (N-1)/2 counts. Only even values are latched by the counter, so odd
|
||
|
values are not observed when reading. This is the intended mode for timer 2,
|
||
|
which generates sine-like tones by low-pass filtering the square wave output.
|
||
|
|
||
|
Mode 4: Software Strobe. After programming this mode and loading the counter,
|
||
|
the output remains high until the counter reaches zero. Then the output
|
||
|
goes low for 1 clock cycle and returns high. The counter is not reloaded.
|
||
|
Counting only occurs when gate is high.
|
||
|
|
||
|
Mode 5: Hardware Strobe. After programming and loading the counter, the
|
||
|
output remains high. When the gate is raised, a countdown is initiated
|
||
|
(which does not stop if the gate is lowered). When the counter reaches zero,
|
||
|
the output goes low for 1 clock cycle and then returns high. The counter is
|
||
|
not reloaded.
|
||
|
|
||
|
In addition to normal binary counting, the PIT supports BCD counting. The
|
||
|
command port, 0x43 is used to set the counter and mode for each of the three
|
||
|
timers.
|
||
|
|
||
|
PIT commands, issued to port 0x43, using the following bit encoding:
|
||
|
|
||
|
Bit 7-4: Command (See table below)
|
||
|
Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined)
|
||
|
Bit 0 : Binary (0) / BCD (1)
|
||
|
|
||
|
Command table:
|
||
|
|
||
|
0000 - Latch Timer 0 count for port 0x40
|
||
|
sample and hold the count to be read in port 0x40;
|
||
|
additional commands ignored until counter is read;
|
||
|
mode bits ignored.
|
||
|
|
||
|
0001 - Set Timer 0 LSB mode for port 0x40
|
||
|
set timer to read LSB only and force MSB to zero;
|
||
|
mode bits set timer mode
|
||
|
|
||
|
0010 - Set Timer 0 MSB mode for port 0x40
|
||
|
set timer to read MSB only and force LSB to zero;
|
||
|
mode bits set timer mode
|
||
|
|
||
|
0011 - Set Timer 0 16-bit mode for port 0x40
|
||
|
set timer to read / write LSB first, then MSB;
|
||
|
mode bits set timer mode
|
||
|
|
||
|
0100 - Latch Timer 1 count for port 0x41 - as described above
|
||
|
0101 - Set Timer 1 LSB mode for port 0x41 - as described above
|
||
|
0110 - Set Timer 1 MSB mode for port 0x41 - as described above
|
||
|
0111 - Set Timer 1 16-bit mode for port 0x41 - as described above
|
||
|
|
||
|
1000 - Latch Timer 2 count for port 0x42 - as described above
|
||
|
1001 - Set Timer 2 LSB mode for port 0x42 - as described above
|
||
|
1010 - Set Timer 2 MSB mode for port 0x42 - as described above
|
||
|
1011 - Set Timer 2 16-bit mode for port 0x42 as described above
|
||
|
|
||
|
1101 - General counter latch
|
||
|
Latch combination of counters into corresponding ports
|
||
|
Bit 3 = Counter 2
|
||
|
Bit 2 = Counter 1
|
||
|
Bit 1 = Counter 0
|
||
|
Bit 0 = Unused
|
||
|
|
||
|
1110 - Latch timer status
|
||
|
Latch combination of counter mode into corresponding ports
|
||
|
Bit 3 = Counter 2
|
||
|
Bit 2 = Counter 1
|
||
|
Bit 1 = Counter 0
|
||
|
|
||
|
The output of ports 0x40-0x42 following this command will be:
|
||
|
|
||
|
Bit 7 = Output pin
|
||
|
Bit 6 = Count loaded (0 if timer has expired)
|
||
|
Bit 5-4 = Read / Write mode
|
||
|
01 = MSB only
|
||
|
10 = LSB only
|
||
|
11 = LSB / MSB (16-bit)
|
||
|
Bit 3-1 = Mode
|
||
|
Bit 0 = Binary (0) / BCD mode (1)
|
||
|
|
||
|
2.2) RTC
|
||
|
|
||
|
The second device which was available in the original PC was the MC146818 real
|
||
|
time clock. The original device is now obsolete, and usually emulated by the
|
||
|
system chipset, sometimes by an HPET and some frankenstein IRQ routing.
|
||
|
|
||
|
The RTC is accessed through CMOS variables, which uses an index register to
|
||
|
control which bytes are read. Since there is only one index register, read
|
||
|
of the CMOS and read of the RTC require lock protection (in addition, it is
|
||
|
dangerous to allow userspace utilities such as hwclock to have direct RTC
|
||
|
access, as they could corrupt kernel reads and writes of CMOS memory).
|
||
|
|
||
|
The RTC generates an interrupt which is usually routed to IRQ 8. The interrupt
|
||
|
can function as a periodic timer, an additional once a day alarm, and can issue
|
||
|
interrupts after an update of the CMOS registers by the MC146818 is complete.
|
||
|
The type of interrupt is signalled in the RTC status registers.
|
||
|
|
||
|
The RTC will update the current time fields by battery power even while the
|
||
|
system is off. The current time fields should not be read while an update is
|
||
|
in progress, as indicated in the status register.
|
||
|
|
||
|
The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be
|
||
|
programmed to a 32kHz divider if the RTC is to count seconds.
|
||
|
|
||
|
This is the RAM map originally used for the RTC/CMOS:
|
||
|
|
||
|
Location Size Description
|
||
|
------------------------------------------
|
||
|
00h byte Current second (BCD)
|
||
|
01h byte Seconds alarm (BCD)
|
||
|
02h byte Current minute (BCD)
|
||
|
03h byte Minutes alarm (BCD)
|
||
|
04h byte Current hour (BCD)
|
||
|
05h byte Hours alarm (BCD)
|
||
|
06h byte Current day of week (BCD)
|
||
|
07h byte Current day of month (BCD)
|
||
|
08h byte Current month (BCD)
|
||
|
09h byte Current year (BCD)
|
||
|
0Ah byte Register A
|
||
|
bit 7 = Update in progress
|
||
|
bit 6-4 = Divider for clock
|
||
|
000 = 4.194 MHz
|
||
|
001 = 1.049 MHz
|
||
|
010 = 32 kHz
|
||
|
10X = test modes
|
||
|
110 = reset / disable
|
||
|
111 = reset / disable
|
||
|
bit 3-0 = Rate selection for periodic interrupt
|
||
|
000 = periodic timer disabled
|
||
|
001 = 3.90625 uS
|
||
|
010 = 7.8125 uS
|
||
|
011 = .122070 mS
|
||
|
100 = .244141 mS
|
||
|
...
|
||
|
1101 = 125 mS
|
||
|
1110 = 250 mS
|
||
|
1111 = 500 mS
|
||
|
0Bh byte Register B
|
||
|
bit 7 = Run (0) / Halt (1)
|
||
|
bit 6 = Periodic interrupt enable
|
||
|
bit 5 = Alarm interrupt enable
|
||
|
bit 4 = Update-ended interrupt enable
|
||
|
bit 3 = Square wave interrupt enable
|
||
|
bit 2 = BCD calendar (0) / Binary (1)
|
||
|
bit 1 = 12-hour mode (0) / 24-hour mode (1)
|
||
|
bit 0 = 0 (DST off) / 1 (DST enabled)
|
||
|
OCh byte Register C (read only)
|
||
|
bit 7 = interrupt request flag (IRQF)
|
||
|
bit 6 = periodic interrupt flag (PF)
|
||
|
bit 5 = alarm interrupt flag (AF)
|
||
|
bit 4 = update interrupt flag (UF)
|
||
|
bit 3-0 = reserved
|
||
|
ODh byte Register D (read only)
|
||
|
bit 7 = RTC has power
|
||
|
bit 6-0 = reserved
|
||
|
32h byte Current century BCD (*)
|
||
|
(*) location vendor specific and now determined from ACPI global tables
|
||
|
|
||
|
2.3) APIC
|
||
|
|
||
|
On Pentium and later processors, an on-board timer is available to each CPU
|
||
|
as part of the Advanced Programmable Interrupt Controller. The APIC is
|
||
|
accessed through memory-mapped registers and provides interrupt service to each
|
||
|
CPU, used for IPIs and local timer interrupts.
|
||
|
|
||
|
Although in theory the APIC is a safe and stable source for local interrupts,
|
||
|
in practice, many bugs and glitches have occurred due to the special nature of
|
||
|
the APIC CPU-local memory-mapped hardware. Beware that CPU errata may affect
|
||
|
the use of the APIC and that workarounds may be required. In addition, some of
|
||
|
these workarounds pose unique constraints for virtualization - requiring either
|
||
|
extra overhead incurred from extra reads of memory-mapped I/O or additional
|
||
|
functionality that may be more computationally expensive to implement.
|
||
|
|
||
|
Since the APIC is documented quite well in the Intel and AMD manuals, we will
|
||
|
avoid repetition of the detail here. It should be pointed out that the APIC
|
||
|
timer is programmed through the LVT (local vector timer) register, is capable
|
||
|
of one-shot or periodic operation, and is based on the bus clock divided down
|
||
|
by the programmable divider register.
|
||
|
|
||
|
2.4) HPET
|
||
|
|
||
|
HPET is quite complex, and was originally intended to replace the PIT / RTC
|
||
|
support of the X86 PC. It remains to be seen whether that will be the case, as
|
||
|
the de facto standard of PC hardware is to emulate these older devices. Some
|
||
|
systems designated as legacy free may support only the HPET as a hardware timer
|
||
|
device.
|
||
|
|
||
|
The HPET spec is rather loose and vague, requiring at least 3 hardware timers,
|
||
|
but allowing implementation freedom to support many more. It also imposes no
|
||
|
fixed rate on the timer frequency, but does impose some extremal values on
|
||
|
frequency, error and slew.
|
||
|
|
||
|
In general, the HPET is recommended as a high precision (compared to PIT /RTC)
|
||
|
time source which is independent of local variation (as there is only one HPET
|
||
|
in any given system). The HPET is also memory-mapped, and its presence is
|
||
|
indicated through ACPI tables by the BIOS.
|
||
|
|
||
|
Detailed specification of the HPET is beyond the current scope of this
|
||
|
document, as it is also very well documented elsewhere.
|
||
|
|
||
|
2.5) Offboard Timers
|
||
|
|
||
|
Several cards, both proprietary (watchdog boards) and commonplace (e1000) have
|
||
|
timing chips built into the cards which may have registers which are accessible
|
||
|
to kernel or user drivers. To the author's knowledge, using these to generate
|
||
|
a clocksource for a Linux or other kernel has not yet been attempted and is in
|
||
|
general frowned upon as not playing by the agreed rules of the game. Such a
|
||
|
timer device would require additional support to be virtualized properly and is
|
||
|
not considered important at this time as no known operating system does this.
|
||
|
|
||
|
=========================================================================
|
||
|
|
||
|
3) TSC Hardware
|
||
|
|
||
|
The TSC or time stamp counter is relatively simple in theory; it counts
|
||
|
instruction cycles issued by the processor, which can be used as a measure of
|
||
|
time. In practice, due to a number of problems, it is the most complicated
|
||
|
timekeeping device to use.
|
||
|
|
||
|
The TSC is represented internally as a 64-bit MSR which can be read with the
|
||
|
RDMSR, RDTSC, or RDTSCP (when available) instructions. In the past, hardware
|
||
|
limitations made it possible to write the TSC, but generally on old hardware it
|
||
|
was only possible to write the low 32-bits of the 64-bit counter, and the upper
|
||
|
32-bits of the counter were cleared. Now, however, on Intel processors family
|
||
|
0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction
|
||
|
has been lifted and all 64-bits are writable. On AMD systems, the ability to
|
||
|
write the TSC MSR is not an architectural guarantee.
|
||
|
|
||
|
The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by
|
||
|
means of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access.
|
||
|
|
||
|
Some vendors have implemented an additional instruction, RDTSCP, which returns
|
||
|
atomically not just the TSC, but an indicator which corresponds to the
|
||
|
processor number. This can be used to index into an array of TSC variables to
|
||
|
determine offset information in SMP systems where TSCs are not synchronized.
|
||
|
The presence of this instruction must be determined by consulting CPUID feature
|
||
|
bits.
|
||
|
|
||
|
Both VMX and SVM provide extension fields in the virtualization hardware which
|
||
|
allows the guest visible TSC to be offset by a constant. Newer implementations
|
||
|
promise to allow the TSC to additionally be scaled, but this hardware is not
|
||
|
yet widely available.
|
||
|
|
||
|
3.1) TSC synchronization
|
||
|
|
||
|
The TSC is a CPU-local clock in most implementations. This means, on SMP
|
||
|
platforms, the TSCs of different CPUs may start at different times depending
|
||
|
on when the CPUs are powered on. Generally, CPUs on the same die will share
|
||
|
the same clock, however, this is not always the case.
|
||
|
|
||
|
The BIOS may attempt to resynchronize the TSCs during the poweron process and
|
||
|
the operating system or other system software may attempt to do this as well.
|
||
|
Several hardware limitations make the problem worse - if it is not possible to
|
||
|
write the full 64-bits of the TSC, it may be impossible to match the TSC in
|
||
|
newly arriving CPUs to that of the rest of the system, resulting in
|
||
|
unsynchronized TSCs. This may be done by BIOS or system software, but in
|
||
|
practice, getting a perfectly synchronized TSC will not be possible unless all
|
||
|
values are read from the same clock, which generally only is possible on single
|
||
|
socket systems or those with special hardware support.
|
||
|
|
||
|
3.2) TSC and CPU hotplug
|
||
|
|
||
|
As touched on already, CPUs which arrive later than the boot time of the system
|
||
|
may not have a TSC value that is synchronized with the rest of the system.
|
||
|
Either system software, BIOS, or SMM code may actually try to establish the TSC
|
||
|
to a value matching the rest of the system, but a perfect match is usually not
|
||
|
a guarantee. This can have the effect of bringing a system from a state where
|
||
|
TSC is synchronized back to a state where TSC synchronization flaws, however
|
||
|
small, may be exposed to the OS and any virtualization environment.
|
||
|
|
||
|
3.3) TSC and multi-socket / NUMA
|
||
|
|
||
|
Multi-socket systems, especially large multi-socket systems are likely to have
|
||
|
individual clocksources rather than a single, universally distributed clock.
|
||
|
Since these clocks are driven by different crystals, they will not have
|
||
|
perfectly matched frequency, and temperature and electrical variations will
|
||
|
cause the CPU clocks, and thus the TSCs to drift over time. Depending on the
|
||
|
exact clock and bus design, the drift may or may not be fixed in absolute
|
||
|
error, and may accumulate over time.
|
||
|
|
||
|
In addition, very large systems may deliberately slew the clocks of individual
|
||
|
cores. This technique, known as spread-spectrum clocking, reduces EMI at the
|
||
|
clock frequency and harmonics of it, which may be required to pass FCC
|
||
|
standards for telecommunications and computer equipment.
|
||
|
|
||
|
It is recommended not to trust the TSCs to remain synchronized on NUMA or
|
||
|
multiple socket systems for these reasons.
|
||
|
|
||
|
3.4) TSC and C-states
|
||
|
|
||
|
C-states, or idling states of the processor, especially C1E and deeper sleep
|
||
|
states may be problematic for TSC as well. The TSC may stop advancing in such
|
||
|
a state, resulting in a TSC which is behind that of other CPUs when execution
|
||
|
is resumed. Such CPUs must be detected and flagged by the operating system
|
||
|
based on CPU and chipset identifications.
|
||
|
|
||
|
The TSC in such a case may be corrected by catching it up to a known external
|
||
|
clocksource.
|
||
|
|
||
|
3.5) TSC frequency change / P-states
|
||
|
|
||
|
To make things slightly more interesting, some CPUs may change frequency. They
|
||
|
may or may not run the TSC at the same rate, and because the frequency change
|
||
|
may be staggered or slewed, at some points in time, the TSC rate may not be
|
||
|
known other than falling within a range of values. In this case, the TSC will
|
||
|
not be a stable time source, and must be calibrated against a known, stable,
|
||
|
external clock to be a usable source of time.
|
||
|
|
||
|
Whether the TSC runs at a constant rate or scales with the P-state is model
|
||
|
dependent and must be determined by inspecting CPUID, chipset or vendor
|
||
|
specific MSR fields.
|
||
|
|
||
|
In addition, some vendors have known bugs where the P-state is actually
|
||
|
compensated for properly during normal operation, but when the processor is
|
||
|
inactive, the P-state may be raised temporarily to service cache misses from
|
||
|
other processors. In such cases, the TSC on halted CPUs could advance faster
|
||
|
than that of non-halted processors. AMD Turion processors are known to have
|
||
|
this problem.
|
||
|
|
||
|
3.6) TSC and STPCLK / T-states
|
||
|
|
||
|
External signals given to the processor may also have the effect of stopping
|
||
|
the TSC. This is typically done for thermal emergency power control to prevent
|
||
|
an overheating condition, and typically, there is no way to detect that this
|
||
|
condition has happened.
|
||
|
|
||
|
3.7) TSC virtualization - VMX
|
||
|
|
||
|
VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
|
||
|
instructions, which is enough for full virtualization of TSC in any manner. In
|
||
|
addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET
|
||
|
field specified in the VMCS. Special instructions must be used to read and
|
||
|
write the VMCS field.
|
||
|
|
||
|
3.8) TSC virtualization - SVM
|
||
|
|
||
|
SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
|
||
|
instructions, which is enough for full virtualization of TSC in any manner. In
|
||
|
addition, SVM allows passing through the host TSC plus an additional offset
|
||
|
field specified in the SVM control block.
|
||
|
|
||
|
3.9) TSC feature bits in Linux
|
||
|
|
||
|
In summary, there is no way to guarantee the TSC remains in perfect
|
||
|
synchronization unless it is explicitly guaranteed by the architecture. Even
|
||
|
if so, the TSCs in multi-sockets or NUMA systems may still run independently
|
||
|
despite being locally consistent.
|
||
|
|
||
|
The following feature bits are used by Linux to signal various TSC attributes,
|
||
|
but they can only be taken to be meaningful for UP or single node systems.
|
||
|
|
||
|
X86_FEATURE_TSC : The TSC is available in hardware
|
||
|
X86_FEATURE_RDTSCP : The RDTSCP instruction is available
|
||
|
X86_FEATURE_CONSTANT_TSC : The TSC rate is unchanged with P-states
|
||
|
X86_FEATURE_NONSTOP_TSC : The TSC does not stop in C-states
|
||
|
X86_FEATURE_TSC_RELIABLE : TSC sync checks are skipped (VMware)
|
||
|
|
||
|
4) Virtualization Problems
|
||
|
|
||
|
Timekeeping is especially problematic for virtualization because a number of
|
||
|
challenges arise. The most obvious problem is that time is now shared between
|
||
|
the host and, potentially, a number of virtual machines. Thus the virtual
|
||
|
operating system does not run with 100% usage of the CPU, despite the fact that
|
||
|
it may very well make that assumption. It may expect it to remain true to very
|
||
|
exacting bounds when interrupt sources are disabled, but in reality only its
|
||
|
virtual interrupt sources are disabled, and the machine may still be preempted
|
||
|
at any time. This causes problems as the passage of real time, the injection
|
||
|
of machine interrupts and the associated clock sources are no longer completely
|
||
|
synchronized with real time.
|
||
|
|
||
|
This same problem can occur on native harware to a degree, as SMM mode may
|
||
|
steal cycles from the naturally on X86 systems when SMM mode is used by the
|
||
|
BIOS, but not in such an extreme fashion. However, the fact that SMM mode may
|
||
|
cause similar problems to virtualization makes it a good justification for
|
||
|
solving many of these problems on bare metal.
|
||
|
|
||
|
4.1) Interrupt clocking
|
||
|
|
||
|
One of the most immediate problems that occurs with legacy operating systems
|
||
|
is that the system timekeeping routines are often designed to keep track of
|
||
|
time by counting periodic interrupts. These interrupts may come from the PIT
|
||
|
or the RTC, but the problem is the same: the host virtualization engine may not
|
||
|
be able to deliver the proper number of interrupts per second, and so guest
|
||
|
time may fall behind. This is especially problematic if a high interrupt rate
|
||
|
is selected, such as 1000 HZ, which is unfortunately the default for many Linux
|
||
|
guests.
|
||
|
|
||
|
There are three approaches to solving this problem; first, it may be possible
|
||
|
to simply ignore it. Guests which have a separate time source for tracking
|
||
|
'wall clock' or 'real time' may not need any adjustment of their interrupts to
|
||
|
maintain proper time. If this is not sufficient, it may be necessary to inject
|
||
|
additional interrupts into the guest in order to increase the effective
|
||
|
interrupt rate. This approach leads to complications in extreme conditions,
|
||
|
where host load or guest lag is too much to compensate for, and thus another
|
||
|
solution to the problem has risen: the guest may need to become aware of lost
|
||
|
ticks and compensate for them internally. Although promising in theory, the
|
||
|
implementation of this policy in Linux has been extremely error prone, and a
|
||
|
number of buggy variants of lost tick compensation are distributed across
|
||
|
commonly used Linux systems.
|
||
|
|
||
|
Windows uses periodic RTC clocking as a means of keeping time internally, and
|
||
|
thus requires interrupt slewing to keep proper time. It does use a low enough
|
||
|
rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in
|
||
|
practice.
|
||
|
|
||
|
4.2) TSC sampling and serialization
|
||
|
|
||
|
As the highest precision time source available, the cycle counter of the CPU
|
||
|
has aroused much interest from developers. As explained above, this timer has
|
||
|
many problems unique to its nature as a local, potentially unstable and
|
||
|
potentially unsynchronized source. One issue which is not unique to the TSC,
|
||
|
but is highlighted because of its very precise nature is sampling delay. By
|
||
|
definition, the counter, once read is already old. However, it is also
|
||
|
possible for the counter to be read ahead of the actual use of the result.
|
||
|
This is a consequence of the superscalar execution of the instruction stream,
|
||
|
which may execute instructions out of order. Such execution is called
|
||
|
non-serialized. Forcing serialized execution is necessary for precise
|
||
|
measurement with the TSC, and requires a serializing instruction, such as CPUID
|
||
|
or an MSR read.
|
||
|
|
||
|
Since CPUID may actually be virtualized by a trap and emulate mechanism, this
|
||
|
serialization can pose a performance issue for hardware virtualization. An
|
||
|
accurate time stamp counter reading may therefore not always be available, and
|
||
|
it may be necessary for an implementation to guard against "backwards" reads of
|
||
|
the TSC as seen from other CPUs, even in an otherwise perfectly synchronized
|
||
|
system.
|
||
|
|
||
|
4.3) Timespec aliasing
|
||
|
|
||
|
Additionally, this lack of serialization from the TSC poses another challenge
|
||
|
when using results of the TSC when measured against another time source. As
|
||
|
the TSC is much higher precision, many possible values of the TSC may be read
|
||
|
while another clock is still expressing the same value.
|
||
|
|
||
|
That is, you may read (T,T+10) while external clock C maintains the same value.
|
||
|
Due to non-serialized reads, you may actually end up with a range which
|
||
|
fluctuates - from (T-1.. T+10). Thus, any time calculated from a TSC, but
|
||
|
calibrated against an external value may have a range of valid values.
|
||
|
Re-calibrating this computation may actually cause time, as computed after the
|
||
|
calibration, to go backwards, compared with time computed before the
|
||
|
calibration.
|
||
|
|
||
|
This problem is particularly pronounced with an internal time source in Linux,
|
||
|
the kernel time, which is expressed in the theoretically high resolution
|
||
|
timespec - but which advances in much larger granularity intervals, sometimes
|
||
|
at the rate of jiffies, and possibly in catchup modes, at a much larger step.
|
||
|
|
||
|
This aliasing requires care in the computation and recalibration of kvmclock
|
||
|
and any other values derived from TSC computation (such as TSC virtualization
|
||
|
itself).
|
||
|
|
||
|
4.4) Migration
|
||
|
|
||
|
Migration of a virtual machine raises problems for timekeeping in two ways.
|
||
|
First, the migration itself may take time, during which interrupts cannot be
|
||
|
delivered, and after which, the guest time may need to be caught up. NTP may
|
||
|
be able to help to some degree here, as the clock correction required is
|
||
|
typically small enough to fall in the NTP-correctable window.
|
||
|
|
||
|
An additional concern is that timers based off the TSC (or HPET, if the raw bus
|
||
|
clock is exposed) may now be running at different rates, requiring compensation
|
||
|
in some way in the hypervisor by virtualizing these timers. In addition,
|
||
|
migrating to a faster machine may preclude the use of a passthrough TSC, as a
|
||
|
faster clock cannot be made visible to a guest without the potential of time
|
||
|
advancing faster than usual. A slower clock is less of a problem, as it can
|
||
|
always be caught up to the original rate. KVM clock avoids these problems by
|
||
|
simply storing multipliers and offsets against the TSC for the guest to convert
|
||
|
back into nanosecond resolution values.
|
||
|
|
||
|
4.5) Scheduling
|
||
|
|
||
|
Since scheduling may be based on precise timing and firing of interrupts, the
|
||
|
scheduling algorithms of an operating system may be adversely affected by
|
||
|
virtualization. In theory, the effect is random and should be universally
|
||
|
distributed, but in contrived as well as real scenarios (guest device access,
|
||
|
causes of virtualization exits, possible context switch), this may not always
|
||
|
be the case. The effect of this has not been well studied.
|
||
|
|
||
|
In an attempt to work around this, several implementations have provided a
|
||
|
paravirtualized scheduler clock, which reveals the true amount of CPU time for
|
||
|
which a virtual machine has been running.
|
||
|
|
||
|
4.6) Watchdogs
|
||
|
|
||
|
Watchdog timers, such as the lock detector in Linux may fire accidentally when
|
||
|
running under hardware virtualization due to timer interrupts being delayed or
|
||
|
misinterpretation of the passage of real time. Usually, these warnings are
|
||
|
spurious and can be ignored, but in some circumstances it may be necessary to
|
||
|
disable such detection.
|
||
|
|
||
|
4.7) Delays and precision timing
|
||
|
|
||
|
Precise timing and delays may not be possible in a virtualized system. This
|
||
|
can happen if the system is controlling physical hardware, or issues delays to
|
||
|
compensate for slower I/O to and from devices. The first issue is not solvable
|
||
|
in general for a virtualized system; hardware control software can't be
|
||
|
adequately virtualized without a full real-time operating system, which would
|
||
|
require an RT aware virtualization platform.
|
||
|
|
||
|
The second issue may cause performance problems, but this is unlikely to be a
|
||
|
significant issue. In many cases these delays may be eliminated through
|
||
|
configuration or paravirtualization.
|
||
|
|
||
|
4.8) Covert channels and leaks
|
||
|
|
||
|
In addition to the above problems, time information will inevitably leak to the
|
||
|
guest about the host in anything but a perfect implementation of virtualized
|
||
|
time. This may allow the guest to infer the presence of a hypervisor (as in a
|
||
|
red-pill type detection), and it may allow information to leak between guests
|
||
|
by using CPU utilization itself as a signalling channel. Preventing such
|
||
|
problems would require completely isolated virtual time which may not track
|
||
|
real time any longer. This may be useful in certain security or QA contexts,
|
||
|
but in general isn't recommended for real-world deployment scenarios.
|