2019-06-04 16:11:33 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2011-06-29 08:22:05 +08:00
|
|
|
/*
|
|
|
|
*
|
|
|
|
* Copyright 2010-2011 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/types.h>
|
|
|
|
#include <linux/string.h>
|
|
|
|
#include <linux/kvm.h>
|
|
|
|
#include <linux/kvm_host.h>
|
|
|
|
#include <linux/hugetlb.h>
|
KVM: PPC: Only get pages when actually needed, not in prepare_memory_region()
This removes the code from kvmppc_core_prepare_memory_region() that
looked up the VMA for the region being added and called hva_to_page
to get the pfns for the memory. We have no guarantee that there will
be anything mapped there at the time of the KVM_SET_USER_MEMORY_REGION
ioctl call; userspace can do that ioctl and then map memory into the
region later.
Instead we defer looking up the pfn for each memory page until it is
needed, which generally means when the guest does an H_ENTER hcall on
the page. Since we can't call get_user_pages in real mode, if we don't
already have the pfn for the page, kvmppc_h_enter() will return
H_TOO_HARD and we then call kvmppc_virtmode_h_enter() once we get back
to kernel context. That calls kvmppc_get_guest_page() to get the pfn
for the page, and then calls back to kvmppc_h_enter() to redo the HPTE
insertion.
When the first vcpu starts executing, we need to have the RMO or VRMA
region mapped so that the guest's real mode accesses will work. Thus
we now have a check in kvmppc_vcpu_run() to see if the RMO/VRMA is set
up and if not, call kvmppc_hv_setup_rma(). It checks if the memslot
starting at guest physical 0 now has RMO memory mapped there; if so it
sets it up for the guest, otherwise on POWER7 it sets up the VRMA.
The function that does that, kvmppc_map_vrma, is now a bit simpler,
as it calls kvmppc_virtmode_h_enter instead of creating the HPTE itself.
Since we are now potentially updating entries in the slot_phys[]
arrays from multiple vcpu threads, we now have a spinlock protecting
those updates to ensure that we don't lose track of any references
to pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:31:00 +08:00
|
|
|
#include <linux/module.h>
|
KVM: PPC: Book3S HV: Fix bug in dirty page tracking
This fixes a bug in the tracking of pages that get modified by the
guest. If the guest creates a large-page HPTE, writes to memory
somewhere within the large page, and then removes the HPTE, we only
record the modified state for the first normal page within the large
page, when in fact the guest might have modified some other normal
page within the large page.
To fix this we use some unused bits in the rmap entry to record the
order (log base 2) of the size of the page that was modified, when
removing an HPTE. Then in kvm_test_clear_dirty_npages() we use that
order to return the correct number of modified pages.
The same thing could in principle happen when removing a HPTE at the
host's request, i.e. when paging out a page, except that we never
page out large pages, and the guest can only create large-page HPTEs
if the guest RAM is backed by large pages. However, we also fix
this case for the sake of future-proofing.
The reference bit is also subject to the same loss of information. We
don't make the same fix here for the reference bit because there isn't
an interface for userspace to find out which pages the guest has
referenced, whereas there is one for userspace to find out which pages
the guest has modified. Because of this loss of information, the
kvm_age_hva_hv() and kvm_test_age_hva_hv() functions might incorrectly
say that a page has not been referenced when it has, but that doesn't
matter greatly because we never page or swap out large pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-06-24 19:18:06 +08:00
|
|
|
#include <linux/log2.h>
|
2019-03-22 14:05:45 +08:00
|
|
|
#include <linux/sizes.h>
|
2011-06-29 08:22:05 +08:00
|
|
|
|
2017-04-11 13:23:25 +08:00
|
|
|
#include <asm/trace.h>
|
2011-06-29 08:22:05 +08:00
|
|
|
#include <asm/kvm_ppc.h>
|
|
|
|
#include <asm/kvm_book3s.h>
|
2016-03-01 15:29:20 +08:00
|
|
|
#include <asm/book3s/64/mmu-hash.h>
|
2011-06-29 08:22:05 +08:00
|
|
|
#include <asm/hvcall.h>
|
|
|
|
#include <asm/synch.h>
|
|
|
|
#include <asm/ppc-opcode.h>
|
2017-07-27 14:24:53 +08:00
|
|
|
#include <asm/pte-walk.h>
|
2011-06-29 08:22:05 +08:00
|
|
|
|
2011-12-12 20:27:39 +08:00
|
|
|
/* Translate address of a vmalloc'd thing to a linear map address */
|
|
|
|
static void *real_vmalloc_addr(void *x)
|
|
|
|
{
|
|
|
|
unsigned long addr = (unsigned long) x;
|
|
|
|
pte_t *p;
|
2015-03-30 13:11:03 +08:00
|
|
|
/*
|
|
|
|
* assume we don't have huge pages in vmalloc space...
|
|
|
|
* So don't worry about THP collapse/split. Called
|
2017-07-27 14:24:53 +08:00
|
|
|
* Only in realmode with MSR_EE = 0, hence won't need irq_save/restore.
|
2015-03-30 13:11:03 +08:00
|
|
|
*/
|
2017-07-27 14:24:53 +08:00
|
|
|
p = find_init_mm_pte(addr, NULL);
|
2011-12-12 20:27:39 +08:00
|
|
|
if (!p || !pte_present(*p))
|
|
|
|
return NULL;
|
|
|
|
addr = (pte_pfn(*p) << PAGE_SHIFT) | (addr & ~PAGE_MASK);
|
|
|
|
return __va(addr);
|
|
|
|
}
|
2011-06-29 08:22:05 +08:00
|
|
|
|
KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
When we change or remove a HPT (hashed page table) entry, we can do
either a global TLB invalidation (tlbie) that works across the whole
machine, or a local invalidation (tlbiel) that only affects this core.
Currently we do local invalidations if the VM has only one vcpu or if
the guest requests it with the H_LOCAL flag, though the guest Linux
kernel currently doesn't ever use H_LOCAL. Then, to cope with the
possibility that vcpus moving around to different physical cores might
expose stale TLB entries, there is some code in kvmppc_hv_entry to
flush the whole TLB of entries for this VM if either this vcpu is now
running on a different physical core from where it last ran, or if this
physical core last ran a different vcpu.
There are a number of problems on POWER7 with this as it stands:
- The TLB invalidation is done per thread, whereas it only needs to be
done per core, since the TLB is shared between the threads.
- With the possibility of the host paging out guest pages, the use of
H_LOCAL by an SMP guest is dangerous since the guest could possibly
retain and use a stale TLB entry pointing to a page that had been
removed from the guest.
- The TLB invalidations that we do when a vcpu moves from one physical
core to another are unnecessary in the case of an SMP guest that isn't
using H_LOCAL.
- The optimization of using local invalidations rather than global should
apply to guests with one virtual core, not just one vcpu.
(None of this applies on PPC970, since there we always have to
invalidate the whole TLB when entering and leaving the guest, and we
can't support paging out guest memory.)
To fix these problems and simplify the code, we now maintain a simple
cpumask of which cpus need to flush the TLB on entry to the guest.
(This is indexed by cpu, though we only ever use the bits for thread
0 of each core.) Whenever we do a local TLB invalidation, we set the
bits for every cpu except the bit for thread 0 of the core that we're
currently running on. Whenever we enter a guest, we test and clear the
bit for our core, and flush the TLB if it was set.
On initial startup of the VM, and when resetting the HPT, we set all the
bits in the need_tlb_flush cpumask, since any core could potentially have
stale TLB entries from the previous VM to use the same LPID, or the
previous contents of the HPT.
Then, we maintain a count of the number of online virtual cores, and use
that when deciding whether to use a local invalidation rather than the
number of online vcpus. The code to make that decision is extracted out
into a new function, global_invalidates(). For multi-core guests on
POWER7 (i.e. when we are using mmu notifiers), we now never do local
invalidations regardless of the H_LOCAL flag.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-11-22 07:28:08 +08:00
|
|
|
/* Return 1 if we need to do a global tlbie, 0 if we can use tlbiel */
|
2017-11-06 20:27:44 +08:00
|
|
|
static int global_invalidates(struct kvm *kvm)
|
KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
When we change or remove a HPT (hashed page table) entry, we can do
either a global TLB invalidation (tlbie) that works across the whole
machine, or a local invalidation (tlbiel) that only affects this core.
Currently we do local invalidations if the VM has only one vcpu or if
the guest requests it with the H_LOCAL flag, though the guest Linux
kernel currently doesn't ever use H_LOCAL. Then, to cope with the
possibility that vcpus moving around to different physical cores might
expose stale TLB entries, there is some code in kvmppc_hv_entry to
flush the whole TLB of entries for this VM if either this vcpu is now
running on a different physical core from where it last ran, or if this
physical core last ran a different vcpu.
There are a number of problems on POWER7 with this as it stands:
- The TLB invalidation is done per thread, whereas it only needs to be
done per core, since the TLB is shared between the threads.
- With the possibility of the host paging out guest pages, the use of
H_LOCAL by an SMP guest is dangerous since the guest could possibly
retain and use a stale TLB entry pointing to a page that had been
removed from the guest.
- The TLB invalidations that we do when a vcpu moves from one physical
core to another are unnecessary in the case of an SMP guest that isn't
using H_LOCAL.
- The optimization of using local invalidations rather than global should
apply to guests with one virtual core, not just one vcpu.
(None of this applies on PPC970, since there we always have to
invalidate the whole TLB when entering and leaving the guest, and we
can't support paging out guest memory.)
To fix these problems and simplify the code, we now maintain a simple
cpumask of which cpus need to flush the TLB on entry to the guest.
(This is indexed by cpu, though we only ever use the bits for thread
0 of each core.) Whenever we do a local TLB invalidation, we set the
bits for every cpu except the bit for thread 0 of the core that we're
currently running on. Whenever we enter a guest, we test and clear the
bit for our core, and flush the TLB if it was set.
On initial startup of the VM, and when resetting the HPT, we set all the
bits in the need_tlb_flush cpumask, since any core could potentially have
stale TLB entries from the previous VM to use the same LPID, or the
previous contents of the HPT.
Then, we maintain a count of the number of online virtual cores, and use
that when deciding whether to use a local invalidation rather than the
number of online vcpus. The code to make that decision is extracted out
into a new function, global_invalidates(). For multi-core guests on
POWER7 (i.e. when we are using mmu notifiers), we now never do local
invalidations regardless of the H_LOCAL flag.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-11-22 07:28:08 +08:00
|
|
|
{
|
|
|
|
int global;
|
KVM: PPC: Book3S HV: Invalidate TLB on radix guest vcpu movement
With radix, the guest can do TLB invalidations itself using the tlbie
(global) and tlbiel (local) TLB invalidation instructions. Linux guests
use local TLB invalidations for translations that have only ever been
accessed on one vcpu. However, that doesn't mean that the translations
have only been accessed on one physical cpu (pcpu) since vcpus can move
around from one pcpu to another. Thus a tlbiel might leave behind stale
TLB entries on a pcpu where the vcpu previously ran, and if that task
then moves back to that previous pcpu, it could see those stale TLB
entries and thus access memory incorrectly. The usual symptom of this
is random segfaults in userspace programs in the guest.
To cope with this, we detect when a vcpu is about to start executing on
a thread in a core that is a different core from the last time it
executed. If that is the case, then we mark the core as needing a
TLB flush and then send an interrupt to any thread in the core that is
currently running a vcpu from the same guest. This will get those vcpus
out of the guest, and the first one to re-enter the guest will do the
TLB flush. The reason for interrupting the vcpus executing on the old
core is to cope with the following scenario:
CPU 0 CPU 1 CPU 4
(core 0) (core 0) (core 1)
VCPU 0 runs task X VCPU 1 runs
core 0 TLB gets
entries from task X
VCPU 0 moves to CPU 4
VCPU 0 runs task X
Unmap pages of task X
tlbiel
(still VCPU 1) task X moves to VCPU 1
task X runs
task X sees stale TLB
entries
That is, as soon as the VCPU starts executing on the new core, it
could unmap and tlbiel some page table entries, and then the task
could migrate to one of the VCPUs running on the old core and
potentially see stale TLB entries.
Since the TLB is shared between all the threads in a core, we only
use the bit of kvm->arch.need_tlb_flush corresponding to the first
thread in the core. To ensure that we don't have a window where we
can miss a flush, this moves the clearing of the bit from before the
actual flush to after it. This way, two threads might both do the
flush, but we prevent the situation where one thread can enter the
guest before the flush is finished.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-30 18:21:50 +08:00
|
|
|
int cpu;
|
KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
When we change or remove a HPT (hashed page table) entry, we can do
either a global TLB invalidation (tlbie) that works across the whole
machine, or a local invalidation (tlbiel) that only affects this core.
Currently we do local invalidations if the VM has only one vcpu or if
the guest requests it with the H_LOCAL flag, though the guest Linux
kernel currently doesn't ever use H_LOCAL. Then, to cope with the
possibility that vcpus moving around to different physical cores might
expose stale TLB entries, there is some code in kvmppc_hv_entry to
flush the whole TLB of entries for this VM if either this vcpu is now
running on a different physical core from where it last ran, or if this
physical core last ran a different vcpu.
There are a number of problems on POWER7 with this as it stands:
- The TLB invalidation is done per thread, whereas it only needs to be
done per core, since the TLB is shared between the threads.
- With the possibility of the host paging out guest pages, the use of
H_LOCAL by an SMP guest is dangerous since the guest could possibly
retain and use a stale TLB entry pointing to a page that had been
removed from the guest.
- The TLB invalidations that we do when a vcpu moves from one physical
core to another are unnecessary in the case of an SMP guest that isn't
using H_LOCAL.
- The optimization of using local invalidations rather than global should
apply to guests with one virtual core, not just one vcpu.
(None of this applies on PPC970, since there we always have to
invalidate the whole TLB when entering and leaving the guest, and we
can't support paging out guest memory.)
To fix these problems and simplify the code, we now maintain a simple
cpumask of which cpus need to flush the TLB on entry to the guest.
(This is indexed by cpu, though we only ever use the bits for thread
0 of each core.) Whenever we do a local TLB invalidation, we set the
bits for every cpu except the bit for thread 0 of the core that we're
currently running on. Whenever we enter a guest, we test and clear the
bit for our core, and flush the TLB if it was set.
On initial startup of the VM, and when resetting the HPT, we set all the
bits in the need_tlb_flush cpumask, since any core could potentially have
stale TLB entries from the previous VM to use the same LPID, or the
previous contents of the HPT.
Then, we maintain a count of the number of online virtual cores, and use
that when deciding whether to use a local invalidation rather than the
number of online vcpus. The code to make that decision is extracted out
into a new function, global_invalidates(). For multi-core guests on
POWER7 (i.e. when we are using mmu notifiers), we now never do local
invalidations regardless of the H_LOCAL flag.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-11-22 07:28:08 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If there is only one vcore, and it's currently running,
|
2014-05-26 17:48:36 +08:00
|
|
|
* as indicated by local_paca->kvm_hstate.kvm_vcpu being set,
|
KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
When we change or remove a HPT (hashed page table) entry, we can do
either a global TLB invalidation (tlbie) that works across the whole
machine, or a local invalidation (tlbiel) that only affects this core.
Currently we do local invalidations if the VM has only one vcpu or if
the guest requests it with the H_LOCAL flag, though the guest Linux
kernel currently doesn't ever use H_LOCAL. Then, to cope with the
possibility that vcpus moving around to different physical cores might
expose stale TLB entries, there is some code in kvmppc_hv_entry to
flush the whole TLB of entries for this VM if either this vcpu is now
running on a different physical core from where it last ran, or if this
physical core last ran a different vcpu.
There are a number of problems on POWER7 with this as it stands:
- The TLB invalidation is done per thread, whereas it only needs to be
done per core, since the TLB is shared between the threads.
- With the possibility of the host paging out guest pages, the use of
H_LOCAL by an SMP guest is dangerous since the guest could possibly
retain and use a stale TLB entry pointing to a page that had been
removed from the guest.
- The TLB invalidations that we do when a vcpu moves from one physical
core to another are unnecessary in the case of an SMP guest that isn't
using H_LOCAL.
- The optimization of using local invalidations rather than global should
apply to guests with one virtual core, not just one vcpu.
(None of this applies on PPC970, since there we always have to
invalidate the whole TLB when entering and leaving the guest, and we
can't support paging out guest memory.)
To fix these problems and simplify the code, we now maintain a simple
cpumask of which cpus need to flush the TLB on entry to the guest.
(This is indexed by cpu, though we only ever use the bits for thread
0 of each core.) Whenever we do a local TLB invalidation, we set the
bits for every cpu except the bit for thread 0 of the core that we're
currently running on. Whenever we enter a guest, we test and clear the
bit for our core, and flush the TLB if it was set.
On initial startup of the VM, and when resetting the HPT, we set all the
bits in the need_tlb_flush cpumask, since any core could potentially have
stale TLB entries from the previous VM to use the same LPID, or the
previous contents of the HPT.
Then, we maintain a count of the number of online virtual cores, and use
that when deciding whether to use a local invalidation rather than the
number of online vcpus. The code to make that decision is extracted out
into a new function, global_invalidates(). For multi-core guests on
POWER7 (i.e. when we are using mmu notifiers), we now never do local
invalidations regardless of the H_LOCAL flag.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-11-22 07:28:08 +08:00
|
|
|
* we can use tlbiel as long as we mark all other physical
|
|
|
|
* cores as potentially having stale TLB entries for this lpid.
|
|
|
|
* Otherwise, don't use tlbiel.
|
|
|
|
*/
|
2014-05-26 17:48:36 +08:00
|
|
|
if (kvm->arch.online_vcores == 1 && local_paca->kvm_hstate.kvm_vcpu)
|
KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
When we change or remove a HPT (hashed page table) entry, we can do
either a global TLB invalidation (tlbie) that works across the whole
machine, or a local invalidation (tlbiel) that only affects this core.
Currently we do local invalidations if the VM has only one vcpu or if
the guest requests it with the H_LOCAL flag, though the guest Linux
kernel currently doesn't ever use H_LOCAL. Then, to cope with the
possibility that vcpus moving around to different physical cores might
expose stale TLB entries, there is some code in kvmppc_hv_entry to
flush the whole TLB of entries for this VM if either this vcpu is now
running on a different physical core from where it last ran, or if this
physical core last ran a different vcpu.
There are a number of problems on POWER7 with this as it stands:
- The TLB invalidation is done per thread, whereas it only needs to be
done per core, since the TLB is shared between the threads.
- With the possibility of the host paging out guest pages, the use of
H_LOCAL by an SMP guest is dangerous since the guest could possibly
retain and use a stale TLB entry pointing to a page that had been
removed from the guest.
- The TLB invalidations that we do when a vcpu moves from one physical
core to another are unnecessary in the case of an SMP guest that isn't
using H_LOCAL.
- The optimization of using local invalidations rather than global should
apply to guests with one virtual core, not just one vcpu.
(None of this applies on PPC970, since there we always have to
invalidate the whole TLB when entering and leaving the guest, and we
can't support paging out guest memory.)
To fix these problems and simplify the code, we now maintain a simple
cpumask of which cpus need to flush the TLB on entry to the guest.
(This is indexed by cpu, though we only ever use the bits for thread
0 of each core.) Whenever we do a local TLB invalidation, we set the
bits for every cpu except the bit for thread 0 of the core that we're
currently running on. Whenever we enter a guest, we test and clear the
bit for our core, and flush the TLB if it was set.
On initial startup of the VM, and when resetting the HPT, we set all the
bits in the need_tlb_flush cpumask, since any core could potentially have
stale TLB entries from the previous VM to use the same LPID, or the
previous contents of the HPT.
Then, we maintain a count of the number of online virtual cores, and use
that when deciding whether to use a local invalidation rather than the
number of online vcpus. The code to make that decision is extracted out
into a new function, global_invalidates(). For multi-core guests on
POWER7 (i.e. when we are using mmu notifiers), we now never do local
invalidations regardless of the H_LOCAL flag.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-11-22 07:28:08 +08:00
|
|
|
global = 0;
|
|
|
|
else
|
2014-12-03 10:30:38 +08:00
|
|
|
global = 1;
|
KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
When we change or remove a HPT (hashed page table) entry, we can do
either a global TLB invalidation (tlbie) that works across the whole
machine, or a local invalidation (tlbiel) that only affects this core.
Currently we do local invalidations if the VM has only one vcpu or if
the guest requests it with the H_LOCAL flag, though the guest Linux
kernel currently doesn't ever use H_LOCAL. Then, to cope with the
possibility that vcpus moving around to different physical cores might
expose stale TLB entries, there is some code in kvmppc_hv_entry to
flush the whole TLB of entries for this VM if either this vcpu is now
running on a different physical core from where it last ran, or if this
physical core last ran a different vcpu.
There are a number of problems on POWER7 with this as it stands:
- The TLB invalidation is done per thread, whereas it only needs to be
done per core, since the TLB is shared between the threads.
- With the possibility of the host paging out guest pages, the use of
H_LOCAL by an SMP guest is dangerous since the guest could possibly
retain and use a stale TLB entry pointing to a page that had been
removed from the guest.
- The TLB invalidations that we do when a vcpu moves from one physical
core to another are unnecessary in the case of an SMP guest that isn't
using H_LOCAL.
- The optimization of using local invalidations rather than global should
apply to guests with one virtual core, not just one vcpu.
(None of this applies on PPC970, since there we always have to
invalidate the whole TLB when entering and leaving the guest, and we
can't support paging out guest memory.)
To fix these problems and simplify the code, we now maintain a simple
cpumask of which cpus need to flush the TLB on entry to the guest.
(This is indexed by cpu, though we only ever use the bits for thread
0 of each core.) Whenever we do a local TLB invalidation, we set the
bits for every cpu except the bit for thread 0 of the core that we're
currently running on. Whenever we enter a guest, we test and clear the
bit for our core, and flush the TLB if it was set.
On initial startup of the VM, and when resetting the HPT, we set all the
bits in the need_tlb_flush cpumask, since any core could potentially have
stale TLB entries from the previous VM to use the same LPID, or the
previous contents of the HPT.
Then, we maintain a count of the number of online virtual cores, and use
that when deciding whether to use a local invalidation rather than the
number of online vcpus. The code to make that decision is extracted out
into a new function, global_invalidates(). For multi-core guests on
POWER7 (i.e. when we are using mmu notifiers), we now never do local
invalidations regardless of the H_LOCAL flag.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-11-22 07:28:08 +08:00
|
|
|
|
|
|
|
if (!global) {
|
|
|
|
/* any other core might now have stale TLB entries... */
|
|
|
|
smp_wmb();
|
|
|
|
cpumask_setall(&kvm->arch.need_tlb_flush);
|
KVM: PPC: Book3S HV: Invalidate TLB on radix guest vcpu movement
With radix, the guest can do TLB invalidations itself using the tlbie
(global) and tlbiel (local) TLB invalidation instructions. Linux guests
use local TLB invalidations for translations that have only ever been
accessed on one vcpu. However, that doesn't mean that the translations
have only been accessed on one physical cpu (pcpu) since vcpus can move
around from one pcpu to another. Thus a tlbiel might leave behind stale
TLB entries on a pcpu where the vcpu previously ran, and if that task
then moves back to that previous pcpu, it could see those stale TLB
entries and thus access memory incorrectly. The usual symptom of this
is random segfaults in userspace programs in the guest.
To cope with this, we detect when a vcpu is about to start executing on
a thread in a core that is a different core from the last time it
executed. If that is the case, then we mark the core as needing a
TLB flush and then send an interrupt to any thread in the core that is
currently running a vcpu from the same guest. This will get those vcpus
out of the guest, and the first one to re-enter the guest will do the
TLB flush. The reason for interrupting the vcpus executing on the old
core is to cope with the following scenario:
CPU 0 CPU 1 CPU 4
(core 0) (core 0) (core 1)
VCPU 0 runs task X VCPU 1 runs
core 0 TLB gets
entries from task X
VCPU 0 moves to CPU 4
VCPU 0 runs task X
Unmap pages of task X
tlbiel
(still VCPU 1) task X moves to VCPU 1
task X runs
task X sees stale TLB
entries
That is, as soon as the VCPU starts executing on the new core, it
could unmap and tlbiel some page table entries, and then the task
could migrate to one of the VCPUs running on the old core and
potentially see stale TLB entries.
Since the TLB is shared between all the threads in a core, we only
use the bit of kvm->arch.need_tlb_flush corresponding to the first
thread in the core. To ensure that we don't have a window where we
can miss a flush, this moves the clearing of the bit from before the
actual flush to after it. This way, two threads might both do the
flush, but we prevent the situation where one thread can enter the
guest before the flush is finished.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-30 18:21:50 +08:00
|
|
|
cpu = local_paca->kvm_hstate.kvm_vcore->pcpu;
|
|
|
|
/*
|
|
|
|
* On POWER9, threads are independent but the TLB is shared,
|
|
|
|
* so use the bit for the first thread to represent the core.
|
|
|
|
*/
|
|
|
|
if (cpu_has_feature(CPU_FTR_ARCH_300))
|
2024-06-12 13:13:20 +08:00
|
|
|
cpu = cpu_first_tlb_thread_sibling(cpu);
|
KVM: PPC: Book3S HV: Invalidate TLB on radix guest vcpu movement
With radix, the guest can do TLB invalidations itself using the tlbie
(global) and tlbiel (local) TLB invalidation instructions. Linux guests
use local TLB invalidations for translations that have only ever been
accessed on one vcpu. However, that doesn't mean that the translations
have only been accessed on one physical cpu (pcpu) since vcpus can move
around from one pcpu to another. Thus a tlbiel might leave behind stale
TLB entries on a pcpu where the vcpu previously ran, and if that task
then moves back to that previous pcpu, it could see those stale TLB
entries and thus access memory incorrectly. The usual symptom of this
is random segfaults in userspace programs in the guest.
To cope with this, we detect when a vcpu is about to start executing on
a thread in a core that is a different core from the last time it
executed. If that is the case, then we mark the core as needing a
TLB flush and then send an interrupt to any thread in the core that is
currently running a vcpu from the same guest. This will get those vcpus
out of the guest, and the first one to re-enter the guest will do the
TLB flush. The reason for interrupting the vcpus executing on the old
core is to cope with the following scenario:
CPU 0 CPU 1 CPU 4
(core 0) (core 0) (core 1)
VCPU 0 runs task X VCPU 1 runs
core 0 TLB gets
entries from task X
VCPU 0 moves to CPU 4
VCPU 0 runs task X
Unmap pages of task X
tlbiel
(still VCPU 1) task X moves to VCPU 1
task X runs
task X sees stale TLB
entries
That is, as soon as the VCPU starts executing on the new core, it
could unmap and tlbiel some page table entries, and then the task
could migrate to one of the VCPUs running on the old core and
potentially see stale TLB entries.
Since the TLB is shared between all the threads in a core, we only
use the bit of kvm->arch.need_tlb_flush corresponding to the first
thread in the core. To ensure that we don't have a window where we
can miss a flush, this moves the clearing of the bit from before the
actual flush to after it. This way, two threads might both do the
flush, but we prevent the situation where one thread can enter the
guest before the flush is finished.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-30 18:21:50 +08:00
|
|
|
cpumask_clear_cpu(cpu, &kvm->arch.need_tlb_flush);
|
KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
When we change or remove a HPT (hashed page table) entry, we can do
either a global TLB invalidation (tlbie) that works across the whole
machine, or a local invalidation (tlbiel) that only affects this core.
Currently we do local invalidations if the VM has only one vcpu or if
the guest requests it with the H_LOCAL flag, though the guest Linux
kernel currently doesn't ever use H_LOCAL. Then, to cope with the
possibility that vcpus moving around to different physical cores might
expose stale TLB entries, there is some code in kvmppc_hv_entry to
flush the whole TLB of entries for this VM if either this vcpu is now
running on a different physical core from where it last ran, or if this
physical core last ran a different vcpu.
There are a number of problems on POWER7 with this as it stands:
- The TLB invalidation is done per thread, whereas it only needs to be
done per core, since the TLB is shared between the threads.
- With the possibility of the host paging out guest pages, the use of
H_LOCAL by an SMP guest is dangerous since the guest could possibly
retain and use a stale TLB entry pointing to a page that had been
removed from the guest.
- The TLB invalidations that we do when a vcpu moves from one physical
core to another are unnecessary in the case of an SMP guest that isn't
using H_LOCAL.
- The optimization of using local invalidations rather than global should
apply to guests with one virtual core, not just one vcpu.
(None of this applies on PPC970, since there we always have to
invalidate the whole TLB when entering and leaving the guest, and we
can't support paging out guest memory.)
To fix these problems and simplify the code, we now maintain a simple
cpumask of which cpus need to flush the TLB on entry to the guest.
(This is indexed by cpu, though we only ever use the bits for thread
0 of each core.) Whenever we do a local TLB invalidation, we set the
bits for every cpu except the bit for thread 0 of the core that we're
currently running on. Whenever we enter a guest, we test and clear the
bit for our core, and flush the TLB if it was set.
On initial startup of the VM, and when resetting the HPT, we set all the
bits in the need_tlb_flush cpumask, since any core could potentially have
stale TLB entries from the previous VM to use the same LPID, or the
previous contents of the HPT.
Then, we maintain a count of the number of online virtual cores, and use
that when deciding whether to use a local invalidation rather than the
number of online vcpus. The code to make that decision is extracted out
into a new function, global_invalidates(). For multi-core guests on
POWER7 (i.e. when we are using mmu notifiers), we now never do local
invalidations regardless of the H_LOCAL flag.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-11-22 07:28:08 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return global;
|
|
|
|
}
|
|
|
|
|
2011-12-12 20:33:07 +08:00
|
|
|
/*
|
|
|
|
* Add this HPTE into the chain for the real page.
|
|
|
|
* Must be called with the chain locked; it unlocks the chain.
|
|
|
|
*/
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
void kvmppc_add_revmap_chain(struct kvm *kvm, struct revmap_entry *rev,
|
2011-12-12 20:33:07 +08:00
|
|
|
unsigned long *rmap, long pte_index, int realmode)
|
|
|
|
{
|
|
|
|
struct revmap_entry *head, *tail;
|
|
|
|
unsigned long i;
|
|
|
|
|
|
|
|
if (*rmap & KVMPPC_RMAP_PRESENT) {
|
|
|
|
i = *rmap & KVMPPC_RMAP_INDEX;
|
2016-12-20 13:49:00 +08:00
|
|
|
head = &kvm->arch.hpt.rev[i];
|
2011-12-12 20:33:07 +08:00
|
|
|
if (realmode)
|
|
|
|
head = real_vmalloc_addr(head);
|
2016-12-20 13:49:00 +08:00
|
|
|
tail = &kvm->arch.hpt.rev[head->back];
|
2011-12-12 20:33:07 +08:00
|
|
|
if (realmode)
|
|
|
|
tail = real_vmalloc_addr(tail);
|
|
|
|
rev->forw = i;
|
|
|
|
rev->back = head->back;
|
|
|
|
tail->forw = pte_index;
|
|
|
|
head->back = pte_index;
|
|
|
|
} else {
|
|
|
|
rev->forw = rev->back = pte_index;
|
2012-11-20 07:01:34 +08:00
|
|
|
*rmap = (*rmap & ~KVMPPC_RMAP_INDEX) |
|
2019-08-20 14:13:49 +08:00
|
|
|
pte_index | KVMPPC_RMAP_PRESENT | KVMPPC_RMAP_HPT;
|
2011-12-12 20:33:07 +08:00
|
|
|
}
|
2012-11-20 07:01:34 +08:00
|
|
|
unlock_rmap(rmap);
|
2011-12-12 20:33:07 +08:00
|
|
|
}
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
EXPORT_SYMBOL_GPL(kvmppc_add_revmap_chain);
|
2011-12-12 20:33:07 +08:00
|
|
|
|
KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix
Currently, the HPT code in HV KVM maintains a dirty bit per guest page
in the rmap array, whether or not dirty page tracking has been enabled
for the memory slot. In contrast, the radix code maintains a dirty
bit per guest page in memslot->dirty_bitmap, and only does so when
dirty page tracking has been enabled.
This changes the HPT code to maintain the dirty bits in the memslot
dirty_bitmap like radix does. This results in slightly less code
overall, and will mean that we do not lose the dirty bits when
transitioning between HPT and radix mode in future.
There is one minor change to behaviour as a result. With HPT, when
dirty tracking was enabled for a memslot, we would previously clear
all the dirty bits at that point (both in the HPT entries and in the
rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately
following would show no pages as dirty (assuming no vcpus have run
in the meantime). With this change, the dirty bits on HPT entries
are not cleared at the point where dirty tracking is enabled, so
KVM_GET_DIRTY_LOG would show as dirty any guest pages that are
resident in the HPT and dirty. This is consistent with what happens
on radix.
This also fixes a bug in the mark_pages_dirty() function for radix
(in the sense that the function no longer exists). In the case where
a large page of 64 normal pages or more is marked dirty, the
addressing of the dirty bitmap was incorrect and could write past
the end of the bitmap. Fortunately this case was never hit in
practice because a 2MB large page is only 32 x 64kB pages, and we
don't support backing the guest with 1GB huge pages at this point.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-26 13:39:19 +08:00
|
|
|
/* Update the dirty bitmap of a memslot */
|
2018-12-12 12:16:48 +08:00
|
|
|
void kvmppc_update_dirty_map(const struct kvm_memory_slot *memslot,
|
KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix
Currently, the HPT code in HV KVM maintains a dirty bit per guest page
in the rmap array, whether or not dirty page tracking has been enabled
for the memory slot. In contrast, the radix code maintains a dirty
bit per guest page in memslot->dirty_bitmap, and only does so when
dirty page tracking has been enabled.
This changes the HPT code to maintain the dirty bits in the memslot
dirty_bitmap like radix does. This results in slightly less code
overall, and will mean that we do not lose the dirty bits when
transitioning between HPT and radix mode in future.
There is one minor change to behaviour as a result. With HPT, when
dirty tracking was enabled for a memslot, we would previously clear
all the dirty bits at that point (both in the HPT entries and in the
rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately
following would show no pages as dirty (assuming no vcpus have run
in the meantime). With this change, the dirty bits on HPT entries
are not cleared at the point where dirty tracking is enabled, so
KVM_GET_DIRTY_LOG would show as dirty any guest pages that are
resident in the HPT and dirty. This is consistent with what happens
on radix.
This also fixes a bug in the mark_pages_dirty() function for radix
(in the sense that the function no longer exists). In the case where
a large page of 64 normal pages or more is marked dirty, the
addressing of the dirty bitmap was incorrect and could write past
the end of the bitmap. Fortunately this case was never hit in
practice because a 2MB large page is only 32 x 64kB pages, and we
don't support backing the guest with 1GB huge pages at this point.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-26 13:39:19 +08:00
|
|
|
unsigned long gfn, unsigned long psize)
|
KVM: PPC: Book3S HV: Fix bug in dirty page tracking
This fixes a bug in the tracking of pages that get modified by the
guest. If the guest creates a large-page HPTE, writes to memory
somewhere within the large page, and then removes the HPTE, we only
record the modified state for the first normal page within the large
page, when in fact the guest might have modified some other normal
page within the large page.
To fix this we use some unused bits in the rmap entry to record the
order (log base 2) of the size of the page that was modified, when
removing an HPTE. Then in kvm_test_clear_dirty_npages() we use that
order to return the correct number of modified pages.
The same thing could in principle happen when removing a HPTE at the
host's request, i.e. when paging out a page, except that we never
page out large pages, and the guest can only create large-page HPTEs
if the guest RAM is backed by large pages. However, we also fix
this case for the sake of future-proofing.
The reference bit is also subject to the same loss of information. We
don't make the same fix here for the reference bit because there isn't
an interface for userspace to find out which pages the guest has
referenced, whereas there is one for userspace to find out which pages
the guest has modified. Because of this loss of information, the
kvm_age_hva_hv() and kvm_test_age_hva_hv() functions might incorrectly
say that a page has not been referenced when it has, but that doesn't
matter greatly because we never page or swap out large pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-06-24 19:18:06 +08:00
|
|
|
{
|
KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix
Currently, the HPT code in HV KVM maintains a dirty bit per guest page
in the rmap array, whether or not dirty page tracking has been enabled
for the memory slot. In contrast, the radix code maintains a dirty
bit per guest page in memslot->dirty_bitmap, and only does so when
dirty page tracking has been enabled.
This changes the HPT code to maintain the dirty bits in the memslot
dirty_bitmap like radix does. This results in slightly less code
overall, and will mean that we do not lose the dirty bits when
transitioning between HPT and radix mode in future.
There is one minor change to behaviour as a result. With HPT, when
dirty tracking was enabled for a memslot, we would previously clear
all the dirty bits at that point (both in the HPT entries and in the
rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately
following would show no pages as dirty (assuming no vcpus have run
in the meantime). With this change, the dirty bits on HPT entries
are not cleared at the point where dirty tracking is enabled, so
KVM_GET_DIRTY_LOG would show as dirty any guest pages that are
resident in the HPT and dirty. This is consistent with what happens
on radix.
This also fixes a bug in the mark_pages_dirty() function for radix
(in the sense that the function no longer exists). In the case where
a large page of 64 normal pages or more is marked dirty, the
addressing of the dirty bitmap was incorrect and could write past
the end of the bitmap. Fortunately this case was never hit in
practice because a 2MB large page is only 32 x 64kB pages, and we
don't support backing the guest with 1GB huge pages at this point.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-26 13:39:19 +08:00
|
|
|
unsigned long npages;
|
KVM: PPC: Book3S HV: Fix bug in dirty page tracking
This fixes a bug in the tracking of pages that get modified by the
guest. If the guest creates a large-page HPTE, writes to memory
somewhere within the large page, and then removes the HPTE, we only
record the modified state for the first normal page within the large
page, when in fact the guest might have modified some other normal
page within the large page.
To fix this we use some unused bits in the rmap entry to record the
order (log base 2) of the size of the page that was modified, when
removing an HPTE. Then in kvm_test_clear_dirty_npages() we use that
order to return the correct number of modified pages.
The same thing could in principle happen when removing a HPTE at the
host's request, i.e. when paging out a page, except that we never
page out large pages, and the guest can only create large-page HPTEs
if the guest RAM is backed by large pages. However, we also fix
this case for the sake of future-proofing.
The reference bit is also subject to the same loss of information. We
don't make the same fix here for the reference bit because there isn't
an interface for userspace to find out which pages the guest has
referenced, whereas there is one for userspace to find out which pages
the guest has modified. Because of this loss of information, the
kvm_age_hva_hv() and kvm_test_age_hva_hv() functions might incorrectly
say that a page has not been referenced when it has, but that doesn't
matter greatly because we never page or swap out large pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-06-24 19:18:06 +08:00
|
|
|
|
KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix
Currently, the HPT code in HV KVM maintains a dirty bit per guest page
in the rmap array, whether or not dirty page tracking has been enabled
for the memory slot. In contrast, the radix code maintains a dirty
bit per guest page in memslot->dirty_bitmap, and only does so when
dirty page tracking has been enabled.
This changes the HPT code to maintain the dirty bits in the memslot
dirty_bitmap like radix does. This results in slightly less code
overall, and will mean that we do not lose the dirty bits when
transitioning between HPT and radix mode in future.
There is one minor change to behaviour as a result. With HPT, when
dirty tracking was enabled for a memslot, we would previously clear
all the dirty bits at that point (both in the HPT entries and in the
rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately
following would show no pages as dirty (assuming no vcpus have run
in the meantime). With this change, the dirty bits on HPT entries
are not cleared at the point where dirty tracking is enabled, so
KVM_GET_DIRTY_LOG would show as dirty any guest pages that are
resident in the HPT and dirty. This is consistent with what happens
on radix.
This also fixes a bug in the mark_pages_dirty() function for radix
(in the sense that the function no longer exists). In the case where
a large page of 64 normal pages or more is marked dirty, the
addressing of the dirty bitmap was incorrect and could write past
the end of the bitmap. Fortunately this case was never hit in
practice because a 2MB large page is only 32 x 64kB pages, and we
don't support backing the guest with 1GB huge pages at this point.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-26 13:39:19 +08:00
|
|
|
if (!psize || !memslot->dirty_bitmap)
|
KVM: PPC: Book3S HV: Fix bug in dirty page tracking
This fixes a bug in the tracking of pages that get modified by the
guest. If the guest creates a large-page HPTE, writes to memory
somewhere within the large page, and then removes the HPTE, we only
record the modified state for the first normal page within the large
page, when in fact the guest might have modified some other normal
page within the large page.
To fix this we use some unused bits in the rmap entry to record the
order (log base 2) of the size of the page that was modified, when
removing an HPTE. Then in kvm_test_clear_dirty_npages() we use that
order to return the correct number of modified pages.
The same thing could in principle happen when removing a HPTE at the
host's request, i.e. when paging out a page, except that we never
page out large pages, and the guest can only create large-page HPTEs
if the guest RAM is backed by large pages. However, we also fix
this case for the sake of future-proofing.
The reference bit is also subject to the same loss of information. We
don't make the same fix here for the reference bit because there isn't
an interface for userspace to find out which pages the guest has
referenced, whereas there is one for userspace to find out which pages
the guest has modified. Because of this loss of information, the
kvm_age_hva_hv() and kvm_test_age_hva_hv() functions might incorrectly
say that a page has not been referenced when it has, but that doesn't
matter greatly because we never page or swap out large pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-06-24 19:18:06 +08:00
|
|
|
return;
|
KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix
Currently, the HPT code in HV KVM maintains a dirty bit per guest page
in the rmap array, whether or not dirty page tracking has been enabled
for the memory slot. In contrast, the radix code maintains a dirty
bit per guest page in memslot->dirty_bitmap, and only does so when
dirty page tracking has been enabled.
This changes the HPT code to maintain the dirty bits in the memslot
dirty_bitmap like radix does. This results in slightly less code
overall, and will mean that we do not lose the dirty bits when
transitioning between HPT and radix mode in future.
There is one minor change to behaviour as a result. With HPT, when
dirty tracking was enabled for a memslot, we would previously clear
all the dirty bits at that point (both in the HPT entries and in the
rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately
following would show no pages as dirty (assuming no vcpus have run
in the meantime). With this change, the dirty bits on HPT entries
are not cleared at the point where dirty tracking is enabled, so
KVM_GET_DIRTY_LOG would show as dirty any guest pages that are
resident in the HPT and dirty. This is consistent with what happens
on radix.
This also fixes a bug in the mark_pages_dirty() function for radix
(in the sense that the function no longer exists). In the case where
a large page of 64 normal pages or more is marked dirty, the
addressing of the dirty bitmap was incorrect and could write past
the end of the bitmap. Fortunately this case was never hit in
practice because a 2MB large page is only 32 x 64kB pages, and we
don't support backing the guest with 1GB huge pages at this point.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-26 13:39:19 +08:00
|
|
|
npages = (psize + PAGE_SIZE - 1) / PAGE_SIZE;
|
|
|
|
gfn -= memslot->base_gfn;
|
|
|
|
set_dirty_bits_atomic(memslot->dirty_bitmap, gfn, npages);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvmppc_update_dirty_map);
|
|
|
|
|
|
|
|
static void kvmppc_set_dirty_from_hpte(struct kvm *kvm,
|
|
|
|
unsigned long hpte_v, unsigned long hpte_gr)
|
|
|
|
{
|
|
|
|
struct kvm_memory_slot *memslot;
|
|
|
|
unsigned long gfn;
|
|
|
|
unsigned long psize;
|
|
|
|
|
|
|
|
psize = kvmppc_actual_pgsz(hpte_v, hpte_gr);
|
|
|
|
gfn = hpte_rpn(hpte_gr, psize);
|
|
|
|
memslot = __gfn_to_memslot(kvm_memslots_raw(kvm), gfn);
|
|
|
|
if (memslot && memslot->dirty_bitmap)
|
|
|
|
kvmppc_update_dirty_map(memslot, gfn, psize);
|
KVM: PPC: Book3S HV: Fix bug in dirty page tracking
This fixes a bug in the tracking of pages that get modified by the
guest. If the guest creates a large-page HPTE, writes to memory
somewhere within the large page, and then removes the HPTE, we only
record the modified state for the first normal page within the large
page, when in fact the guest might have modified some other normal
page within the large page.
To fix this we use some unused bits in the rmap entry to record the
order (log base 2) of the size of the page that was modified, when
removing an HPTE. Then in kvm_test_clear_dirty_npages() we use that
order to return the correct number of modified pages.
The same thing could in principle happen when removing a HPTE at the
host's request, i.e. when paging out a page, except that we never
page out large pages, and the guest can only create large-page HPTEs
if the guest RAM is backed by large pages. However, we also fix
this case for the sake of future-proofing.
The reference bit is also subject to the same loss of information. We
don't make the same fix here for the reference bit because there isn't
an interface for userspace to find out which pages the guest has
referenced, whereas there is one for userspace to find out which pages
the guest has modified. Because of this loss of information, the
kvm_age_hva_hv() and kvm_test_age_hva_hv() functions might incorrectly
say that a page has not been referenced when it has, but that doesn't
matter greatly because we never page or swap out large pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-06-24 19:18:06 +08:00
|
|
|
}
|
|
|
|
|
2015-06-24 19:18:07 +08:00
|
|
|
/* Returns a pointer to the revmap entry for the page mapped by a HPTE */
|
|
|
|
static unsigned long *revmap_for_hpte(struct kvm *kvm, unsigned long hpte_v,
|
KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix
Currently, the HPT code in HV KVM maintains a dirty bit per guest page
in the rmap array, whether or not dirty page tracking has been enabled
for the memory slot. In contrast, the radix code maintains a dirty
bit per guest page in memslot->dirty_bitmap, and only does so when
dirty page tracking has been enabled.
This changes the HPT code to maintain the dirty bits in the memslot
dirty_bitmap like radix does. This results in slightly less code
overall, and will mean that we do not lose the dirty bits when
transitioning between HPT and radix mode in future.
There is one minor change to behaviour as a result. With HPT, when
dirty tracking was enabled for a memslot, we would previously clear
all the dirty bits at that point (both in the HPT entries and in the
rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately
following would show no pages as dirty (assuming no vcpus have run
in the meantime). With this change, the dirty bits on HPT entries
are not cleared at the point where dirty tracking is enabled, so
KVM_GET_DIRTY_LOG would show as dirty any guest pages that are
resident in the HPT and dirty. This is consistent with what happens
on radix.
This also fixes a bug in the mark_pages_dirty() function for radix
(in the sense that the function no longer exists). In the case where
a large page of 64 normal pages or more is marked dirty, the
addressing of the dirty bitmap was incorrect and could write past
the end of the bitmap. Fortunately this case was never hit in
practice because a 2MB large page is only 32 x 64kB pages, and we
don't support backing the guest with 1GB huge pages at this point.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-26 13:39:19 +08:00
|
|
|
unsigned long hpte_gr,
|
|
|
|
struct kvm_memory_slot **memslotp,
|
|
|
|
unsigned long *gfnp)
|
2015-06-24 19:18:07 +08:00
|
|
|
{
|
|
|
|
struct kvm_memory_slot *memslot;
|
|
|
|
unsigned long *rmap;
|
|
|
|
unsigned long gfn;
|
|
|
|
|
2017-09-11 13:29:45 +08:00
|
|
|
gfn = hpte_rpn(hpte_gr, kvmppc_actual_pgsz(hpte_v, hpte_gr));
|
2015-06-24 19:18:07 +08:00
|
|
|
memslot = __gfn_to_memslot(kvm_memslots_raw(kvm), gfn);
|
KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix
Currently, the HPT code in HV KVM maintains a dirty bit per guest page
in the rmap array, whether or not dirty page tracking has been enabled
for the memory slot. In contrast, the radix code maintains a dirty
bit per guest page in memslot->dirty_bitmap, and only does so when
dirty page tracking has been enabled.
This changes the HPT code to maintain the dirty bits in the memslot
dirty_bitmap like radix does. This results in slightly less code
overall, and will mean that we do not lose the dirty bits when
transitioning between HPT and radix mode in future.
There is one minor change to behaviour as a result. With HPT, when
dirty tracking was enabled for a memslot, we would previously clear
all the dirty bits at that point (both in the HPT entries and in the
rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately
following would show no pages as dirty (assuming no vcpus have run
in the meantime). With this change, the dirty bits on HPT entries
are not cleared at the point where dirty tracking is enabled, so
KVM_GET_DIRTY_LOG would show as dirty any guest pages that are
resident in the HPT and dirty. This is consistent with what happens
on radix.
This also fixes a bug in the mark_pages_dirty() function for radix
(in the sense that the function no longer exists). In the case where
a large page of 64 normal pages or more is marked dirty, the
addressing of the dirty bitmap was incorrect and could write past
the end of the bitmap. Fortunately this case was never hit in
practice because a 2MB large page is only 32 x 64kB pages, and we
don't support backing the guest with 1GB huge pages at this point.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-26 13:39:19 +08:00
|
|
|
if (memslotp)
|
|
|
|
*memslotp = memslot;
|
|
|
|
if (gfnp)
|
|
|
|
*gfnp = gfn;
|
2015-06-24 19:18:07 +08:00
|
|
|
if (!memslot)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
rmap = real_vmalloc_addr(&memslot->arch.rmap[gfn - memslot->base_gfn]);
|
|
|
|
return rmap;
|
|
|
|
}
|
|
|
|
|
2011-12-12 20:33:07 +08:00
|
|
|
/* Remove this HPTE from the chain for a real page */
|
|
|
|
static void remove_revmap_chain(struct kvm *kvm, long pte_index,
|
2011-12-15 10:02:02 +08:00
|
|
|
struct revmap_entry *rev,
|
|
|
|
unsigned long hpte_v, unsigned long hpte_r)
|
2011-12-12 20:33:07 +08:00
|
|
|
{
|
2011-12-15 10:02:02 +08:00
|
|
|
struct revmap_entry *next, *prev;
|
2015-06-24 19:18:07 +08:00
|
|
|
unsigned long ptel, head;
|
2011-12-12 20:33:07 +08:00
|
|
|
unsigned long *rmap;
|
2011-12-15 10:02:02 +08:00
|
|
|
unsigned long rcbits;
|
KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix
Currently, the HPT code in HV KVM maintains a dirty bit per guest page
in the rmap array, whether or not dirty page tracking has been enabled
for the memory slot. In contrast, the radix code maintains a dirty
bit per guest page in memslot->dirty_bitmap, and only does so when
dirty page tracking has been enabled.
This changes the HPT code to maintain the dirty bits in the memslot
dirty_bitmap like radix does. This results in slightly less code
overall, and will mean that we do not lose the dirty bits when
transitioning between HPT and radix mode in future.
There is one minor change to behaviour as a result. With HPT, when
dirty tracking was enabled for a memslot, we would previously clear
all the dirty bits at that point (both in the HPT entries and in the
rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately
following would show no pages as dirty (assuming no vcpus have run
in the meantime). With this change, the dirty bits on HPT entries
are not cleared at the point where dirty tracking is enabled, so
KVM_GET_DIRTY_LOG would show as dirty any guest pages that are
resident in the HPT and dirty. This is consistent with what happens
on radix.
This also fixes a bug in the mark_pages_dirty() function for radix
(in the sense that the function no longer exists). In the case where
a large page of 64 normal pages or more is marked dirty, the
addressing of the dirty bitmap was incorrect and could write past
the end of the bitmap. Fortunately this case was never hit in
practice because a 2MB large page is only 32 x 64kB pages, and we
don't support backing the guest with 1GB huge pages at this point.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-26 13:39:19 +08:00
|
|
|
struct kvm_memory_slot *memslot;
|
|
|
|
unsigned long gfn;
|
2011-12-12 20:33:07 +08:00
|
|
|
|
2011-12-15 10:02:02 +08:00
|
|
|
rcbits = hpte_r & (HPTE_R_R | HPTE_R_C);
|
|
|
|
ptel = rev->guest_rpte |= rcbits;
|
KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix
Currently, the HPT code in HV KVM maintains a dirty bit per guest page
in the rmap array, whether or not dirty page tracking has been enabled
for the memory slot. In contrast, the radix code maintains a dirty
bit per guest page in memslot->dirty_bitmap, and only does so when
dirty page tracking has been enabled.
This changes the HPT code to maintain the dirty bits in the memslot
dirty_bitmap like radix does. This results in slightly less code
overall, and will mean that we do not lose the dirty bits when
transitioning between HPT and radix mode in future.
There is one minor change to behaviour as a result. With HPT, when
dirty tracking was enabled for a memslot, we would previously clear
all the dirty bits at that point (both in the HPT entries and in the
rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately
following would show no pages as dirty (assuming no vcpus have run
in the meantime). With this change, the dirty bits on HPT entries
are not cleared at the point where dirty tracking is enabled, so
KVM_GET_DIRTY_LOG would show as dirty any guest pages that are
resident in the HPT and dirty. This is consistent with what happens
on radix.
This also fixes a bug in the mark_pages_dirty() function for radix
(in the sense that the function no longer exists). In the case where
a large page of 64 normal pages or more is marked dirty, the
addressing of the dirty bitmap was incorrect and could write past
the end of the bitmap. Fortunately this case was never hit in
practice because a 2MB large page is only 32 x 64kB pages, and we
don't support backing the guest with 1GB huge pages at this point.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-26 13:39:19 +08:00
|
|
|
rmap = revmap_for_hpte(kvm, hpte_v, ptel, &memslot, &gfn);
|
2015-06-24 19:18:07 +08:00
|
|
|
if (!rmap)
|
2011-12-12 20:33:07 +08:00
|
|
|
return;
|
|
|
|
lock_rmap(rmap);
|
|
|
|
|
|
|
|
head = *rmap & KVMPPC_RMAP_INDEX;
|
2016-12-20 13:49:00 +08:00
|
|
|
next = real_vmalloc_addr(&kvm->arch.hpt.rev[rev->forw]);
|
|
|
|
prev = real_vmalloc_addr(&kvm->arch.hpt.rev[rev->back]);
|
2011-12-12 20:33:07 +08:00
|
|
|
next->back = rev->back;
|
|
|
|
prev->forw = rev->forw;
|
|
|
|
if (head == pte_index) {
|
|
|
|
head = rev->forw;
|
|
|
|
if (head == pte_index)
|
|
|
|
*rmap &= ~(KVMPPC_RMAP_PRESENT | KVMPPC_RMAP_INDEX);
|
|
|
|
else
|
|
|
|
*rmap = (*rmap & ~KVMPPC_RMAP_INDEX) | head;
|
|
|
|
}
|
2011-12-15 10:02:02 +08:00
|
|
|
*rmap |= rcbits << KVMPPC_RMAP_RC_SHIFT;
|
KVM: PPC: Book3S HV: Fix bug in dirty page tracking
This fixes a bug in the tracking of pages that get modified by the
guest. If the guest creates a large-page HPTE, writes to memory
somewhere within the large page, and then removes the HPTE, we only
record the modified state for the first normal page within the large
page, when in fact the guest might have modified some other normal
page within the large page.
To fix this we use some unused bits in the rmap entry to record the
order (log base 2) of the size of the page that was modified, when
removing an HPTE. Then in kvm_test_clear_dirty_npages() we use that
order to return the correct number of modified pages.
The same thing could in principle happen when removing a HPTE at the
host's request, i.e. when paging out a page, except that we never
page out large pages, and the guest can only create large-page HPTEs
if the guest RAM is backed by large pages. However, we also fix
this case for the sake of future-proofing.
The reference bit is also subject to the same loss of information. We
don't make the same fix here for the reference bit because there isn't
an interface for userspace to find out which pages the guest has
referenced, whereas there is one for userspace to find out which pages
the guest has modified. Because of this loss of information, the
kvm_age_hva_hv() and kvm_test_age_hva_hv() functions might incorrectly
say that a page has not been referenced when it has, but that doesn't
matter greatly because we never page or swap out large pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2015-06-24 19:18:06 +08:00
|
|
|
if (rcbits & HPTE_R_C)
|
KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix
Currently, the HPT code in HV KVM maintains a dirty bit per guest page
in the rmap array, whether or not dirty page tracking has been enabled
for the memory slot. In contrast, the radix code maintains a dirty
bit per guest page in memslot->dirty_bitmap, and only does so when
dirty page tracking has been enabled.
This changes the HPT code to maintain the dirty bits in the memslot
dirty_bitmap like radix does. This results in slightly less code
overall, and will mean that we do not lose the dirty bits when
transitioning between HPT and radix mode in future.
There is one minor change to behaviour as a result. With HPT, when
dirty tracking was enabled for a memslot, we would previously clear
all the dirty bits at that point (both in the HPT entries and in the
rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately
following would show no pages as dirty (assuming no vcpus have run
in the meantime). With this change, the dirty bits on HPT entries
are not cleared at the point where dirty tracking is enabled, so
KVM_GET_DIRTY_LOG would show as dirty any guest pages that are
resident in the HPT and dirty. This is consistent with what happens
on radix.
This also fixes a bug in the mark_pages_dirty() function for radix
(in the sense that the function no longer exists). In the case where
a large page of 64 normal pages or more is marked dirty, the
addressing of the dirty bitmap was incorrect and could write past
the end of the bitmap. Fortunately this case was never hit in
practice because a 2MB large page is only 32 x 64kB pages, and we
don't support backing the guest with 1GB huge pages at this point.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-26 13:39:19 +08:00
|
|
|
kvmppc_update_dirty_map(memslot, gfn,
|
|
|
|
kvmppc_actual_pgsz(hpte_v, hpte_r));
|
2011-12-12 20:33:07 +08:00
|
|
|
unlock_rmap(rmap);
|
|
|
|
}
|
|
|
|
|
KVM: PPC: Book3S HV: Restructure HPT entry creation code
This restructures the code that creates HPT (hashed page table)
entries so that it can be called in situations where we don't have a
struct vcpu pointer, only a struct kvm pointer. It also fixes a bug
where kvmppc_map_vrma() would corrupt the guest R4 value.
Most of the work of kvmppc_virtmode_h_enter is now done by a new
function, kvmppc_virtmode_do_h_enter, which itself calls another new
function, kvmppc_do_h_enter, which contains most of the old
kvmppc_h_enter. The new kvmppc_do_h_enter takes explicit arguments
for the place to return the HPTE index, the Linux page tables to use,
and whether it is being called in real mode, thus removing the need
for it to have the vcpu as an argument.
Currently kvmppc_map_vrma creates the VRMA (virtual real mode area)
HPTEs by calling kvmppc_virtmode_h_enter, which is designed primarily
to handle H_ENTER hcalls from the guest that need to pin a page of
memory. Since H_ENTER returns the index of the created HPTE in R4,
kvmppc_virtmode_h_enter updates the guest R4, corrupting the guest R4
in the case when it gets called from kvmppc_map_vrma on the first
VCPU_RUN ioctl. With this, kvmppc_map_vrma instead calls
kvmppc_virtmode_do_h_enter with the address of a dummy word as the
place to store the HPTE index, thus avoiding corrupting the guest R4.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-11-14 02:31:32 +08:00
|
|
|
long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
|
|
|
|
long pte_index, unsigned long pteh, unsigned long ptel,
|
|
|
|
pgd_t *pgdir, bool realmode, unsigned long *pte_idx_ret)
|
2011-06-29 08:22:05 +08:00
|
|
|
{
|
KVM: PPC: Only get pages when actually needed, not in prepare_memory_region()
This removes the code from kvmppc_core_prepare_memory_region() that
looked up the VMA for the region being added and called hva_to_page
to get the pfns for the memory. We have no guarantee that there will
be anything mapped there at the time of the KVM_SET_USER_MEMORY_REGION
ioctl call; userspace can do that ioctl and then map memory into the
region later.
Instead we defer looking up the pfn for each memory page until it is
needed, which generally means when the guest does an H_ENTER hcall on
the page. Since we can't call get_user_pages in real mode, if we don't
already have the pfn for the page, kvmppc_h_enter() will return
H_TOO_HARD and we then call kvmppc_virtmode_h_enter() once we get back
to kernel context. That calls kvmppc_get_guest_page() to get the pfn
for the page, and then calls back to kvmppc_h_enter() to redo the HPTE
insertion.
When the first vcpu starts executing, we need to have the RMO or VRMA
region mapped so that the guest's real mode accesses will work. Thus
we now have a check in kvmppc_vcpu_run() to see if the RMO/VRMA is set
up and if not, call kvmppc_hv_setup_rma(). It checks if the memslot
starting at guest physical 0 now has RMO memory mapped there; if so it
sets it up for the guest, otherwise on POWER7 it sets up the VRMA.
The function that does that, kvmppc_map_vrma, is now a bit simpler,
as it calls kvmppc_virtmode_h_enter instead of creating the HPTE itself.
Since we are now potentially updating entries in the slot_phys[]
arrays from multiple vcpu threads, we now have a spinlock protecting
those updates to ensure that we don't lose track of any references
to pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:31:00 +08:00
|
|
|
unsigned long i, pa, gpa, gfn, psize;
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
unsigned long slot_fn, hva;
|
2014-06-11 16:16:06 +08:00
|
|
|
__be64 *hpte;
|
2011-12-12 20:27:39 +08:00
|
|
|
struct revmap_entry *rev;
|
2012-11-20 06:52:49 +08:00
|
|
|
unsigned long g_ptel;
|
2011-12-12 20:28:21 +08:00
|
|
|
struct kvm_memory_slot *memslot;
|
2015-03-30 13:09:13 +08:00
|
|
|
unsigned hpage_shift;
|
2016-04-29 21:25:38 +08:00
|
|
|
bool is_ci;
|
2011-12-12 20:33:07 +08:00
|
|
|
unsigned long *rmap;
|
2015-03-30 13:09:13 +08:00
|
|
|
pte_t *ptep;
|
2011-12-12 20:38:51 +08:00
|
|
|
unsigned int writing;
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
unsigned long mmu_seq;
|
2015-03-30 13:11:03 +08:00
|
|
|
unsigned long rcbits, irq_flags = 0;
|
KVM: PPC: Only get pages when actually needed, not in prepare_memory_region()
This removes the code from kvmppc_core_prepare_memory_region() that
looked up the VMA for the region being added and called hva_to_page
to get the pfns for the memory. We have no guarantee that there will
be anything mapped there at the time of the KVM_SET_USER_MEMORY_REGION
ioctl call; userspace can do that ioctl and then map memory into the
region later.
Instead we defer looking up the pfn for each memory page until it is
needed, which generally means when the guest does an H_ENTER hcall on
the page. Since we can't call get_user_pages in real mode, if we don't
already have the pfn for the page, kvmppc_h_enter() will return
H_TOO_HARD and we then call kvmppc_virtmode_h_enter() once we get back
to kernel context. That calls kvmppc_get_guest_page() to get the pfn
for the page, and then calls back to kvmppc_h_enter() to redo the HPTE
insertion.
When the first vcpu starts executing, we need to have the RMO or VRMA
region mapped so that the guest's real mode accesses will work. Thus
we now have a check in kvmppc_vcpu_run() to see if the RMO/VRMA is set
up and if not, call kvmppc_hv_setup_rma(). It checks if the memslot
starting at guest physical 0 now has RMO memory mapped there; if so it
sets it up for the guest, otherwise on POWER7 it sets up the VRMA.
The function that does that, kvmppc_map_vrma, is now a bit simpler,
as it calls kvmppc_virtmode_h_enter instead of creating the HPTE itself.
Since we are now potentially updating entries in the slot_phys[]
arrays from multiple vcpu threads, we now have a spinlock protecting
those updates to ensure that we don't lose track of any references
to pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:31:00 +08:00
|
|
|
|
2017-01-30 18:21:49 +08:00
|
|
|
if (kvm_is_radix(kvm))
|
|
|
|
return H_FUNCTION;
|
2017-09-11 13:29:45 +08:00
|
|
|
psize = kvmppc_actual_pgsz(pteh, ptel);
|
KVM: PPC: Only get pages when actually needed, not in prepare_memory_region()
This removes the code from kvmppc_core_prepare_memory_region() that
looked up the VMA for the region being added and called hva_to_page
to get the pfns for the memory. We have no guarantee that there will
be anything mapped there at the time of the KVM_SET_USER_MEMORY_REGION
ioctl call; userspace can do that ioctl and then map memory into the
region later.
Instead we defer looking up the pfn for each memory page until it is
needed, which generally means when the guest does an H_ENTER hcall on
the page. Since we can't call get_user_pages in real mode, if we don't
already have the pfn for the page, kvmppc_h_enter() will return
H_TOO_HARD and we then call kvmppc_virtmode_h_enter() once we get back
to kernel context. That calls kvmppc_get_guest_page() to get the pfn
for the page, and then calls back to kvmppc_h_enter() to redo the HPTE
insertion.
When the first vcpu starts executing, we need to have the RMO or VRMA
region mapped so that the guest's real mode accesses will work. Thus
we now have a check in kvmppc_vcpu_run() to see if the RMO/VRMA is set
up and if not, call kvmppc_hv_setup_rma(). It checks if the memslot
starting at guest physical 0 now has RMO memory mapped there; if so it
sets it up for the guest, otherwise on POWER7 it sets up the VRMA.
The function that does that, kvmppc_map_vrma, is now a bit simpler,
as it calls kvmppc_virtmode_h_enter instead of creating the HPTE itself.
Since we are now potentially updating entries in the slot_phys[]
arrays from multiple vcpu threads, we now have a spinlock protecting
those updates to ensure that we don't lose track of any references
to pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:31:00 +08:00
|
|
|
if (!psize)
|
2011-06-29 08:22:05 +08:00
|
|
|
return H_PARAMETER;
|
2011-12-12 20:38:51 +08:00
|
|
|
writing = hpte_is_writable(ptel);
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
pteh &= ~(HPTE_V_HVLOCK | HPTE_V_ABSENT | HPTE_V_VALID);
|
2012-11-20 06:52:49 +08:00
|
|
|
ptel &= ~HPTE_GR_RESERVED;
|
|
|
|
g_ptel = ptel;
|
2011-12-12 20:28:21 +08:00
|
|
|
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
/* used later to detect if we might have been invalidated */
|
|
|
|
mmu_seq = kvm->mmu_notifier_seq;
|
|
|
|
smp_rmb();
|
|
|
|
|
KVM: PPC: Only get pages when actually needed, not in prepare_memory_region()
This removes the code from kvmppc_core_prepare_memory_region() that
looked up the VMA for the region being added and called hva_to_page
to get the pfns for the memory. We have no guarantee that there will
be anything mapped there at the time of the KVM_SET_USER_MEMORY_REGION
ioctl call; userspace can do that ioctl and then map memory into the
region later.
Instead we defer looking up the pfn for each memory page until it is
needed, which generally means when the guest does an H_ENTER hcall on
the page. Since we can't call get_user_pages in real mode, if we don't
already have the pfn for the page, kvmppc_h_enter() will return
H_TOO_HARD and we then call kvmppc_virtmode_h_enter() once we get back
to kernel context. That calls kvmppc_get_guest_page() to get the pfn
for the page, and then calls back to kvmppc_h_enter() to redo the HPTE
insertion.
When the first vcpu starts executing, we need to have the RMO or VRMA
region mapped so that the guest's real mode accesses will work. Thus
we now have a check in kvmppc_vcpu_run() to see if the RMO/VRMA is set
up and if not, call kvmppc_hv_setup_rma(). It checks if the memslot
starting at guest physical 0 now has RMO memory mapped there; if so it
sets it up for the guest, otherwise on POWER7 it sets up the VRMA.
The function that does that, kvmppc_map_vrma, is now a bit simpler,
as it calls kvmppc_virtmode_h_enter instead of creating the HPTE itself.
Since we are now potentially updating entries in the slot_phys[]
arrays from multiple vcpu threads, we now have a spinlock protecting
those updates to ensure that we don't lose track of any references
to pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:31:00 +08:00
|
|
|
/* Find the memslot (if any) for this address */
|
|
|
|
gpa = (ptel & HPTE_R_RPN) & ~(psize - 1);
|
|
|
|
gfn = gpa >> PAGE_SHIFT;
|
KVM: PPC: Book3S HV: Don't use kvm_memslots() in real mode
With HV KVM, some high-frequency hypercalls such as H_ENTER are handled
in real mode, and need to access the memslots array for the guest.
Accessing the memslots array is safe, because we hold the SRCU read
lock for the whole time that a guest vcpu is running. However, the
checks that kvm_memslots() does when lockdep is enabled are potentially
unsafe in real mode, when only the linear mapping is available.
Furthermore, kvm_memslots() can be called from a secondary CPU thread,
which is an offline CPU from the point of view of the host kernel,
and is not running the task which holds the SRCU read lock.
To avoid false positives in the checks in kvm_memslots(), and to avoid
possible side effects from doing the checks in real mode, this replaces
kvm_memslots() with kvm_memslots_raw() in all the places that execute
in real mode. kvm_memslots_raw() is a new function that is like
kvm_memslots() but uses rcu_dereference_raw_notrace() instead of
kvm_dereference_check().
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Scott Wood <scottwood@freescale.com>
2014-03-25 07:47:06 +08:00
|
|
|
memslot = __gfn_to_memslot(kvm_memslots_raw(kvm), gfn);
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
pa = 0;
|
2016-04-29 21:25:38 +08:00
|
|
|
is_ci = false;
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
rmap = NULL;
|
|
|
|
if (!(memslot && !(memslot->flags & KVM_MEMSLOT_INVALID))) {
|
|
|
|
/* Emulated MMIO - mark this with key=31 */
|
|
|
|
pteh |= HPTE_V_ABSENT;
|
|
|
|
ptel |= HPTE_R_KEY_HI | HPTE_R_KEY_LO;
|
|
|
|
goto do_insert;
|
|
|
|
}
|
2011-12-12 20:31:41 +08:00
|
|
|
|
|
|
|
/* Check if the requested page fits entirely in the memslot. */
|
|
|
|
if (!slot_is_aligned(memslot, psize))
|
|
|
|
return H_PARAMETER;
|
KVM: PPC: Only get pages when actually needed, not in prepare_memory_region()
This removes the code from kvmppc_core_prepare_memory_region() that
looked up the VMA for the region being added and called hva_to_page
to get the pfns for the memory. We have no guarantee that there will
be anything mapped there at the time of the KVM_SET_USER_MEMORY_REGION
ioctl call; userspace can do that ioctl and then map memory into the
region later.
Instead we defer looking up the pfn for each memory page until it is
needed, which generally means when the guest does an H_ENTER hcall on
the page. Since we can't call get_user_pages in real mode, if we don't
already have the pfn for the page, kvmppc_h_enter() will return
H_TOO_HARD and we then call kvmppc_virtmode_h_enter() once we get back
to kernel context. That calls kvmppc_get_guest_page() to get the pfn
for the page, and then calls back to kvmppc_h_enter() to redo the HPTE
insertion.
When the first vcpu starts executing, we need to have the RMO or VRMA
region mapped so that the guest's real mode accesses will work. Thus
we now have a check in kvmppc_vcpu_run() to see if the RMO/VRMA is set
up and if not, call kvmppc_hv_setup_rma(). It checks if the memslot
starting at guest physical 0 now has RMO memory mapped there; if so it
sets it up for the guest, otherwise on POWER7 it sets up the VRMA.
The function that does that, kvmppc_map_vrma, is now a bit simpler,
as it calls kvmppc_virtmode_h_enter instead of creating the HPTE itself.
Since we are now potentially updating entries in the slot_phys[]
arrays from multiple vcpu threads, we now have a spinlock protecting
those updates to ensure that we don't lose track of any references
to pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:31:00 +08:00
|
|
|
slot_fn = gfn - memslot->base_gfn;
|
2012-08-01 17:03:28 +08:00
|
|
|
rmap = &memslot->arch.rmap[slot_fn];
|
KVM: PPC: Only get pages when actually needed, not in prepare_memory_region()
This removes the code from kvmppc_core_prepare_memory_region() that
looked up the VMA for the region being added and called hva_to_page
to get the pfns for the memory. We have no guarantee that there will
be anything mapped there at the time of the KVM_SET_USER_MEMORY_REGION
ioctl call; userspace can do that ioctl and then map memory into the
region later.
Instead we defer looking up the pfn for each memory page until it is
needed, which generally means when the guest does an H_ENTER hcall on
the page. Since we can't call get_user_pages in real mode, if we don't
already have the pfn for the page, kvmppc_h_enter() will return
H_TOO_HARD and we then call kvmppc_virtmode_h_enter() once we get back
to kernel context. That calls kvmppc_get_guest_page() to get the pfn
for the page, and then calls back to kvmppc_h_enter() to redo the HPTE
insertion.
When the first vcpu starts executing, we need to have the RMO or VRMA
region mapped so that the guest's real mode accesses will work. Thus
we now have a check in kvmppc_vcpu_run() to see if the RMO/VRMA is set
up and if not, call kvmppc_hv_setup_rma(). It checks if the memslot
starting at guest physical 0 now has RMO memory mapped there; if so it
sets it up for the guest, otherwise on POWER7 it sets up the VRMA.
The function that does that, kvmppc_map_vrma, is now a bit simpler,
as it calls kvmppc_virtmode_h_enter instead of creating the HPTE itself.
Since we are now potentially updating entries in the slot_phys[]
arrays from multiple vcpu threads, we now have a spinlock protecting
those updates to ensure that we don't lose track of any references
to pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:31:00 +08:00
|
|
|
|
2014-12-03 10:30:38 +08:00
|
|
|
/* Translate to host virtual address */
|
|
|
|
hva = __gfn_to_hva_memslot(memslot, gfn);
|
2015-03-30 13:11:03 +08:00
|
|
|
/*
|
|
|
|
* If we had a page table table change after lookup, we would
|
|
|
|
* retry via mmu_notifier_retry.
|
|
|
|
*/
|
2017-07-27 14:24:53 +08:00
|
|
|
if (!realmode)
|
2015-03-30 13:11:03 +08:00
|
|
|
local_irq_save(irq_flags);
|
2017-07-27 14:24:53 +08:00
|
|
|
/*
|
|
|
|
* If called in real mode we have MSR_EE = 0. Otherwise
|
|
|
|
* we disable irq above.
|
|
|
|
*/
|
|
|
|
ptep = __find_linux_pte(pgdir, hva, NULL, &hpage_shift);
|
2015-03-30 13:09:13 +08:00
|
|
|
if (ptep) {
|
|
|
|
pte_t pte;
|
|
|
|
unsigned int host_pte_size;
|
KVM: PPC: Book3S HV: Restructure HPT entry creation code
This restructures the code that creates HPT (hashed page table)
entries so that it can be called in situations where we don't have a
struct vcpu pointer, only a struct kvm pointer. It also fixes a bug
where kvmppc_map_vrma() would corrupt the guest R4 value.
Most of the work of kvmppc_virtmode_h_enter is now done by a new
function, kvmppc_virtmode_do_h_enter, which itself calls another new
function, kvmppc_do_h_enter, which contains most of the old
kvmppc_h_enter. The new kvmppc_do_h_enter takes explicit arguments
for the place to return the HPTE index, the Linux page tables to use,
and whether it is being called in real mode, thus removing the need
for it to have the vcpu as an argument.
Currently kvmppc_map_vrma creates the VRMA (virtual real mode area)
HPTEs by calling kvmppc_virtmode_h_enter, which is designed primarily
to handle H_ENTER hcalls from the guest that need to pin a page of
memory. Since H_ENTER returns the index of the created HPTE in R4,
kvmppc_virtmode_h_enter updates the guest R4, corrupting the guest R4
in the case when it gets called from kvmppc_map_vrma on the first
VCPU_RUN ioctl. With this, kvmppc_map_vrma instead calls
kvmppc_virtmode_do_h_enter with the address of a dummy word as the
place to store the HPTE index, thus avoiding corrupting the guest R4.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-11-14 02:31:32 +08:00
|
|
|
|
2015-03-30 13:09:13 +08:00
|
|
|
if (hpage_shift)
|
|
|
|
host_pte_size = 1ul << hpage_shift;
|
|
|
|
else
|
|
|
|
host_pte_size = PAGE_SIZE;
|
|
|
|
/*
|
|
|
|
* We should always find the guest page size
|
|
|
|
* to <= host page size, if host is using hugepage
|
|
|
|
*/
|
2015-03-30 13:11:03 +08:00
|
|
|
if (host_pte_size < psize) {
|
|
|
|
if (!realmode)
|
|
|
|
local_irq_restore(flags);
|
2015-03-30 13:09:13 +08:00
|
|
|
return H_PARAMETER;
|
2015-03-30 13:11:03 +08:00
|
|
|
}
|
2015-03-30 13:11:04 +08:00
|
|
|
pte = kvmppc_read_update_linux_pte(ptep, writing);
|
2015-03-30 13:09:13 +08:00
|
|
|
if (pte_present(pte) && !pte_protnone(pte)) {
|
2017-03-10 08:16:39 +08:00
|
|
|
if (writing && !__pte_write(pte))
|
2015-03-30 13:09:13 +08:00
|
|
|
/* make the actual HPTE be read-only */
|
|
|
|
ptel = hpte_make_readonly(ptel);
|
2016-04-29 21:25:38 +08:00
|
|
|
is_ci = pte_ci(pte);
|
2015-03-30 13:09:13 +08:00
|
|
|
pa = pte_pfn(pte) << PAGE_SHIFT;
|
|
|
|
pa |= hva & (host_pte_size - 1);
|
|
|
|
pa |= gpa & ~PAGE_MASK;
|
|
|
|
}
|
|
|
|
}
|
2015-03-30 13:11:03 +08:00
|
|
|
if (!realmode)
|
|
|
|
local_irq_restore(irq_flags);
|
KVM: PPC: Only get pages when actually needed, not in prepare_memory_region()
This removes the code from kvmppc_core_prepare_memory_region() that
looked up the VMA for the region being added and called hva_to_page
to get the pfns for the memory. We have no guarantee that there will
be anything mapped there at the time of the KVM_SET_USER_MEMORY_REGION
ioctl call; userspace can do that ioctl and then map memory into the
region later.
Instead we defer looking up the pfn for each memory page until it is
needed, which generally means when the guest does an H_ENTER hcall on
the page. Since we can't call get_user_pages in real mode, if we don't
already have the pfn for the page, kvmppc_h_enter() will return
H_TOO_HARD and we then call kvmppc_virtmode_h_enter() once we get back
to kernel context. That calls kvmppc_get_guest_page() to get the pfn
for the page, and then calls back to kvmppc_h_enter() to redo the HPTE
insertion.
When the first vcpu starts executing, we need to have the RMO or VRMA
region mapped so that the guest's real mode accesses will work. Thus
we now have a check in kvmppc_vcpu_run() to see if the RMO/VRMA is set
up and if not, call kvmppc_hv_setup_rma(). It checks if the memslot
starting at guest physical 0 now has RMO memory mapped there; if so it
sets it up for the guest, otherwise on POWER7 it sets up the VRMA.
The function that does that, kvmppc_map_vrma, is now a bit simpler,
as it calls kvmppc_virtmode_h_enter instead of creating the HPTE itself.
Since we are now potentially updating entries in the slot_phys[]
arrays from multiple vcpu threads, we now have a spinlock protecting
those updates to ensure that we don't lose track of any references
to pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:31:00 +08:00
|
|
|
|
2017-08-01 05:39:59 +08:00
|
|
|
ptel &= HPTE_R_KEY | HPTE_R_PP0 | (psize-1);
|
KVM: PPC: Only get pages when actually needed, not in prepare_memory_region()
This removes the code from kvmppc_core_prepare_memory_region() that
looked up the VMA for the region being added and called hva_to_page
to get the pfns for the memory. We have no guarantee that there will
be anything mapped there at the time of the KVM_SET_USER_MEMORY_REGION
ioctl call; userspace can do that ioctl and then map memory into the
region later.
Instead we defer looking up the pfn for each memory page until it is
needed, which generally means when the guest does an H_ENTER hcall on
the page. Since we can't call get_user_pages in real mode, if we don't
already have the pfn for the page, kvmppc_h_enter() will return
H_TOO_HARD and we then call kvmppc_virtmode_h_enter() once we get back
to kernel context. That calls kvmppc_get_guest_page() to get the pfn
for the page, and then calls back to kvmppc_h_enter() to redo the HPTE
insertion.
When the first vcpu starts executing, we need to have the RMO or VRMA
region mapped so that the guest's real mode accesses will work. Thus
we now have a check in kvmppc_vcpu_run() to see if the RMO/VRMA is set
up and if not, call kvmppc_hv_setup_rma(). It checks if the memslot
starting at guest physical 0 now has RMO memory mapped there; if so it
sets it up for the guest, otherwise on POWER7 it sets up the VRMA.
The function that does that, kvmppc_map_vrma, is now a bit simpler,
as it calls kvmppc_virtmode_h_enter instead of creating the HPTE itself.
Since we are now potentially updating entries in the slot_phys[]
arrays from multiple vcpu threads, we now have a spinlock protecting
those updates to ensure that we don't lose track of any references
to pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:31:00 +08:00
|
|
|
ptel |= pa;
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
|
|
|
|
if (pa)
|
|
|
|
pteh |= HPTE_V_VALID;
|
2016-11-04 13:55:11 +08:00
|
|
|
else {
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
pteh |= HPTE_V_ABSENT;
|
2016-11-04 13:55:11 +08:00
|
|
|
ptel &= ~(HPTE_R_KEY_HI | HPTE_R_KEY_LO);
|
|
|
|
}
|
KVM: PPC: Only get pages when actually needed, not in prepare_memory_region()
This removes the code from kvmppc_core_prepare_memory_region() that
looked up the VMA for the region being added and called hva_to_page
to get the pfns for the memory. We have no guarantee that there will
be anything mapped there at the time of the KVM_SET_USER_MEMORY_REGION
ioctl call; userspace can do that ioctl and then map memory into the
region later.
Instead we defer looking up the pfn for each memory page until it is
needed, which generally means when the guest does an H_ENTER hcall on
the page. Since we can't call get_user_pages in real mode, if we don't
already have the pfn for the page, kvmppc_h_enter() will return
H_TOO_HARD and we then call kvmppc_virtmode_h_enter() once we get back
to kernel context. That calls kvmppc_get_guest_page() to get the pfn
for the page, and then calls back to kvmppc_h_enter() to redo the HPTE
insertion.
When the first vcpu starts executing, we need to have the RMO or VRMA
region mapped so that the guest's real mode accesses will work. Thus
we now have a check in kvmppc_vcpu_run() to see if the RMO/VRMA is set
up and if not, call kvmppc_hv_setup_rma(). It checks if the memslot
starting at guest physical 0 now has RMO memory mapped there; if so it
sets it up for the guest, otherwise on POWER7 it sets up the VRMA.
The function that does that, kvmppc_map_vrma, is now a bit simpler,
as it calls kvmppc_virtmode_h_enter instead of creating the HPTE itself.
Since we are now potentially updating entries in the slot_phys[]
arrays from multiple vcpu threads, we now have a spinlock protecting
those updates to ensure that we don't lose track of any references
to pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:31:00 +08:00
|
|
|
|
2016-04-29 21:25:38 +08:00
|
|
|
/*If we had host pte mapping then Check WIMG */
|
|
|
|
if (ptep && !hpte_cache_flags_ok(ptel, is_ci)) {
|
|
|
|
if (is_ci)
|
2011-12-12 20:32:27 +08:00
|
|
|
return H_PARAMETER;
|
|
|
|
/*
|
|
|
|
* Allow guest to map emulated device memory as
|
|
|
|
* uncacheable, but actually make it cacheable.
|
|
|
|
*/
|
|
|
|
ptel &= ~(HPTE_R_W|HPTE_R_I|HPTE_R_G);
|
|
|
|
ptel |= HPTE_R_M;
|
|
|
|
}
|
2011-12-12 20:30:16 +08:00
|
|
|
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
/* Find and lock the HPTEG slot to use */
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
do_insert:
|
2016-12-20 13:49:01 +08:00
|
|
|
if (pte_index >= kvmppc_hpt_npte(&kvm->arch.hpt))
|
2011-06-29 08:22:05 +08:00
|
|
|
return H_PARAMETER;
|
|
|
|
if (likely((flags & H_EXACT) == 0)) {
|
|
|
|
pte_index &= ~7UL;
|
2016-12-20 13:49:00 +08:00
|
|
|
hpte = (__be64 *)(kvm->arch.hpt.virt + (pte_index << 4));
|
2011-12-12 20:30:16 +08:00
|
|
|
for (i = 0; i < 8; ++i) {
|
2014-06-11 16:16:06 +08:00
|
|
|
if ((be64_to_cpu(*hpte) & HPTE_V_VALID) == 0 &&
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
try_lock_hpte(hpte, HPTE_V_HVLOCK | HPTE_V_VALID |
|
|
|
|
HPTE_V_ABSENT))
|
2011-06-29 08:22:05 +08:00
|
|
|
break;
|
|
|
|
hpte += 2;
|
|
|
|
}
|
2011-12-12 20:30:16 +08:00
|
|
|
if (i == 8) {
|
|
|
|
/*
|
|
|
|
* Since try_lock_hpte doesn't retry (not even stdcx.
|
|
|
|
* failures), it could be that there is a free slot
|
|
|
|
* but we transiently failed to lock it. Try again,
|
|
|
|
* actually locking each slot and checking it.
|
|
|
|
*/
|
|
|
|
hpte -= 16;
|
|
|
|
for (i = 0; i < 8; ++i) {
|
2014-06-11 16:16:06 +08:00
|
|
|
u64 pte;
|
2011-12-12 20:30:16 +08:00
|
|
|
while (!try_lock_hpte(hpte, HPTE_V_HVLOCK))
|
|
|
|
cpu_relax();
|
2015-03-20 17:39:43 +08:00
|
|
|
pte = be64_to_cpu(hpte[0]);
|
2014-06-11 16:16:06 +08:00
|
|
|
if (!(pte & (HPTE_V_VALID | HPTE_V_ABSENT)))
|
2011-12-12 20:30:16 +08:00
|
|
|
break;
|
2015-03-20 17:39:43 +08:00
|
|
|
__unlock_hpte(hpte, pte);
|
2011-12-12 20:30:16 +08:00
|
|
|
hpte += 2;
|
|
|
|
}
|
|
|
|
if (i == 8)
|
|
|
|
return H_PTEG_FULL;
|
|
|
|
}
|
2011-12-12 20:27:39 +08:00
|
|
|
pte_index += i;
|
2011-06-29 08:22:05 +08:00
|
|
|
} else {
|
2016-12-20 13:49:00 +08:00
|
|
|
hpte = (__be64 *)(kvm->arch.hpt.virt + (pte_index << 4));
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
if (!try_lock_hpte(hpte, HPTE_V_HVLOCK | HPTE_V_VALID |
|
|
|
|
HPTE_V_ABSENT)) {
|
2011-12-12 20:30:16 +08:00
|
|
|
/* Lock the slot and check again */
|
2014-06-11 16:16:06 +08:00
|
|
|
u64 pte;
|
|
|
|
|
2011-12-12 20:30:16 +08:00
|
|
|
while (!try_lock_hpte(hpte, HPTE_V_HVLOCK))
|
|
|
|
cpu_relax();
|
2015-03-20 17:39:43 +08:00
|
|
|
pte = be64_to_cpu(hpte[0]);
|
2014-06-11 16:16:06 +08:00
|
|
|
if (pte & (HPTE_V_VALID | HPTE_V_ABSENT)) {
|
2015-03-20 17:39:43 +08:00
|
|
|
__unlock_hpte(hpte, pte);
|
2011-12-12 20:30:16 +08:00
|
|
|
return H_PTEG_FULL;
|
|
|
|
}
|
|
|
|
}
|
2011-06-29 08:22:05 +08:00
|
|
|
}
|
2011-12-12 20:27:39 +08:00
|
|
|
|
|
|
|
/* Save away the guest's idea of the second HPTE dword */
|
2016-12-20 13:49:00 +08:00
|
|
|
rev = &kvm->arch.hpt.rev[pte_index];
|
2011-12-12 20:33:07 +08:00
|
|
|
if (realmode)
|
|
|
|
rev = real_vmalloc_addr(rev);
|
2012-11-20 06:52:49 +08:00
|
|
|
if (rev) {
|
2011-12-12 20:27:39 +08:00
|
|
|
rev->guest_rpte = g_ptel;
|
2012-11-20 06:52:49 +08:00
|
|
|
note_hpte_modification(kvm, rev);
|
|
|
|
}
|
2011-12-12 20:33:07 +08:00
|
|
|
|
|
|
|
/* Link HPTE into reverse-map chain */
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
if (pteh & HPTE_V_VALID) {
|
|
|
|
if (realmode)
|
|
|
|
rmap = real_vmalloc_addr(rmap);
|
|
|
|
lock_rmap(rmap);
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
/* Check for pending invalidations under the rmap chain lock */
|
2014-12-03 10:30:38 +08:00
|
|
|
if (mmu_notifier_retry(kvm, mmu_seq)) {
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
/* inval in progress, write a non-present HPTE */
|
|
|
|
pteh |= HPTE_V_ABSENT;
|
|
|
|
pteh &= ~HPTE_V_VALID;
|
2016-11-04 13:55:11 +08:00
|
|
|
ptel &= ~(HPTE_R_KEY_HI | HPTE_R_KEY_LO);
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
unlock_rmap(rmap);
|
|
|
|
} else {
|
|
|
|
kvmppc_add_revmap_chain(kvm, rev, rmap, pte_index,
|
|
|
|
realmode);
|
2011-12-15 10:02:02 +08:00
|
|
|
/* Only set R/C in real HPTE if already set in *rmap */
|
|
|
|
rcbits = *rmap >> KVMPPC_RMAP_RC_SHIFT;
|
|
|
|
ptel &= rcbits | ~(HPTE_R_R | HPTE_R_C);
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
}
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
}
|
2011-12-12 20:33:07 +08:00
|
|
|
|
2016-11-16 13:57:24 +08:00
|
|
|
/* Convert to new format on P9 */
|
|
|
|
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
|
|
|
|
ptel = hpte_old_to_new_r(pteh, ptel);
|
|
|
|
pteh = hpte_old_to_new_v(pteh);
|
|
|
|
}
|
2014-06-11 16:16:06 +08:00
|
|
|
hpte[1] = cpu_to_be64(ptel);
|
2011-12-12 20:33:07 +08:00
|
|
|
|
|
|
|
/* Write the first HPTE dword, unlocking the HPTE and making it valid */
|
2011-06-29 08:22:05 +08:00
|
|
|
eieio();
|
2015-03-20 17:39:43 +08:00
|
|
|
__unlock_hpte(hpte, pteh);
|
2011-06-29 08:22:05 +08:00
|
|
|
asm volatile("ptesync" : : : "memory");
|
2011-12-12 20:33:07 +08:00
|
|
|
|
KVM: PPC: Book3S HV: Restructure HPT entry creation code
This restructures the code that creates HPT (hashed page table)
entries so that it can be called in situations where we don't have a
struct vcpu pointer, only a struct kvm pointer. It also fixes a bug
where kvmppc_map_vrma() would corrupt the guest R4 value.
Most of the work of kvmppc_virtmode_h_enter is now done by a new
function, kvmppc_virtmode_do_h_enter, which itself calls another new
function, kvmppc_do_h_enter, which contains most of the old
kvmppc_h_enter. The new kvmppc_do_h_enter takes explicit arguments
for the place to return the HPTE index, the Linux page tables to use,
and whether it is being called in real mode, thus removing the need
for it to have the vcpu as an argument.
Currently kvmppc_map_vrma creates the VRMA (virtual real mode area)
HPTEs by calling kvmppc_virtmode_h_enter, which is designed primarily
to handle H_ENTER hcalls from the guest that need to pin a page of
memory. Since H_ENTER returns the index of the created HPTE in R4,
kvmppc_virtmode_h_enter updates the guest R4, corrupting the guest R4
in the case when it gets called from kvmppc_map_vrma on the first
VCPU_RUN ioctl. With this, kvmppc_map_vrma instead calls
kvmppc_virtmode_do_h_enter with the address of a dummy word as the
place to store the HPTE index, thus avoiding corrupting the guest R4.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-11-14 02:31:32 +08:00
|
|
|
*pte_idx_ret = pte_index;
|
2011-06-29 08:22:05 +08:00
|
|
|
return H_SUCCESS;
|
|
|
|
}
|
KVM: PPC: Book3S HV: Restructure HPT entry creation code
This restructures the code that creates HPT (hashed page table)
entries so that it can be called in situations where we don't have a
struct vcpu pointer, only a struct kvm pointer. It also fixes a bug
where kvmppc_map_vrma() would corrupt the guest R4 value.
Most of the work of kvmppc_virtmode_h_enter is now done by a new
function, kvmppc_virtmode_do_h_enter, which itself calls another new
function, kvmppc_do_h_enter, which contains most of the old
kvmppc_h_enter. The new kvmppc_do_h_enter takes explicit arguments
for the place to return the HPTE index, the Linux page tables to use,
and whether it is being called in real mode, thus removing the need
for it to have the vcpu as an argument.
Currently kvmppc_map_vrma creates the VRMA (virtual real mode area)
HPTEs by calling kvmppc_virtmode_h_enter, which is designed primarily
to handle H_ENTER hcalls from the guest that need to pin a page of
memory. Since H_ENTER returns the index of the created HPTE in R4,
kvmppc_virtmode_h_enter updates the guest R4, corrupting the guest R4
in the case when it gets called from kvmppc_map_vrma on the first
VCPU_RUN ioctl. With this, kvmppc_map_vrma instead calls
kvmppc_virtmode_do_h_enter with the address of a dummy word as the
place to store the HPTE index, thus avoiding corrupting the guest R4.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-11-14 02:31:32 +08:00
|
|
|
EXPORT_SYMBOL_GPL(kvmppc_do_h_enter);
|
|
|
|
|
|
|
|
long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
|
|
|
|
long pte_index, unsigned long pteh, unsigned long ptel)
|
|
|
|
{
|
|
|
|
return kvmppc_do_h_enter(vcpu->kvm, flags, pte_index, pteh, ptel,
|
2018-05-07 14:20:07 +08:00
|
|
|
vcpu->arch.pgdir, true,
|
|
|
|
&vcpu->arch.regs.gpr[4]);
|
KVM: PPC: Book3S HV: Restructure HPT entry creation code
This restructures the code that creates HPT (hashed page table)
entries so that it can be called in situations where we don't have a
struct vcpu pointer, only a struct kvm pointer. It also fixes a bug
where kvmppc_map_vrma() would corrupt the guest R4 value.
Most of the work of kvmppc_virtmode_h_enter is now done by a new
function, kvmppc_virtmode_do_h_enter, which itself calls another new
function, kvmppc_do_h_enter, which contains most of the old
kvmppc_h_enter. The new kvmppc_do_h_enter takes explicit arguments
for the place to return the HPTE index, the Linux page tables to use,
and whether it is being called in real mode, thus removing the need
for it to have the vcpu as an argument.
Currently kvmppc_map_vrma creates the VRMA (virtual real mode area)
HPTEs by calling kvmppc_virtmode_h_enter, which is designed primarily
to handle H_ENTER hcalls from the guest that need to pin a page of
memory. Since H_ENTER returns the index of the created HPTE in R4,
kvmppc_virtmode_h_enter updates the guest R4, corrupting the guest R4
in the case when it gets called from kvmppc_map_vrma on the first
VCPU_RUN ioctl. With this, kvmppc_map_vrma instead calls
kvmppc_virtmode_do_h_enter with the address of a dummy word as the
place to store the HPTE index, thus avoiding corrupting the guest R4.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-11-14 02:31:32 +08:00
|
|
|
}
|
2011-06-29 08:22:05 +08:00
|
|
|
|
2013-08-07 00:01:51 +08:00
|
|
|
#ifdef __BIG_ENDIAN__
|
2011-06-29 08:22:05 +08:00
|
|
|
#define LOCK_TOKEN (*(u32 *)(&get_paca()->lock_token))
|
2013-08-07 00:01:51 +08:00
|
|
|
#else
|
|
|
|
#define LOCK_TOKEN (*(u32 *)(&get_paca()->paca_index))
|
|
|
|
#endif
|
2011-06-29 08:22:05 +08:00
|
|
|
|
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-04 13:55:12 +08:00
|
|
|
static inline int is_mmio_hpte(unsigned long v, unsigned long r)
|
|
|
|
{
|
|
|
|
return ((v & HPTE_V_ABSENT) &&
|
|
|
|
(r & (HPTE_R_KEY_HI | HPTE_R_KEY_LO)) ==
|
|
|
|
(HPTE_R_KEY_HI | HPTE_R_KEY_LO));
|
|
|
|
}
|
|
|
|
|
2019-09-24 11:52:53 +08:00
|
|
|
static inline void fixup_tlbie_lpid(unsigned long rb_value, unsigned long lpid)
|
|
|
|
{
|
|
|
|
|
|
|
|
if (cpu_has_feature(CPU_FTR_P9_TLBIE_ERAT_BUG)) {
|
|
|
|
/* Radix flush for a hash guest */
|
|
|
|
|
|
|
|
unsigned long rb,rs,prs,r,ric;
|
|
|
|
|
|
|
|
rb = PPC_BIT(52); /* IS = 2 */
|
|
|
|
rs = 0; /* lpid = 0 */
|
|
|
|
prs = 0; /* partition scoped */
|
|
|
|
r = 1; /* radix format */
|
|
|
|
ric = 0; /* RIC_FLSUH_TLB */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Need the extra ptesync to make sure we don't
|
|
|
|
* re-order the tlbie
|
|
|
|
*/
|
|
|
|
asm volatile("ptesync": : :"memory");
|
|
|
|
asm volatile(PPC_TLBIE_5(%0, %4, %3, %2, %1)
|
|
|
|
: : "r"(rb), "i"(r), "i"(prs),
|
|
|
|
"i"(ric), "r"(rs) : "memory");
|
|
|
|
}
|
|
|
|
|
|
|
|
if (cpu_has_feature(CPU_FTR_P9_TLBIE_STQ_BUG)) {
|
|
|
|
asm volatile("ptesync": : :"memory");
|
|
|
|
asm volatile(PPC_TLBIE_5(%0,%1,0,0,0) : :
|
|
|
|
"r" (rb_value), "r" (lpid));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-07-08 18:08:25 +08:00
|
|
|
static void do_tlbies(struct kvm *kvm, unsigned long *rbvalues,
|
|
|
|
long npages, int global, bool need_sync)
|
|
|
|
{
|
|
|
|
long i;
|
|
|
|
|
2016-11-18 05:28:51 +08:00
|
|
|
/*
|
|
|
|
* We use the POWER9 5-operand versions of tlbie and tlbiel here.
|
|
|
|
* Since we are using RIC=0 PRS=0 R=0, and P7/P8 tlbiel ignores
|
|
|
|
* the RS field, this is backwards-compatible with P7 and P8.
|
|
|
|
*/
|
2013-07-08 18:08:25 +08:00
|
|
|
if (global) {
|
|
|
|
if (need_sync)
|
|
|
|
asm volatile("ptesync" : : : "memory");
|
2017-04-11 13:23:25 +08:00
|
|
|
for (i = 0; i < npages; ++i) {
|
2016-11-18 05:28:51 +08:00
|
|
|
asm volatile(PPC_TLBIE_5(%0,%1,0,0,0) : :
|
2013-07-08 18:08:25 +08:00
|
|
|
"r" (rbvalues[i]), "r" (kvm->arch.lpid));
|
2017-04-11 13:23:25 +08:00
|
|
|
}
|
2018-03-23 12:56:27 +08:00
|
|
|
|
2019-09-24 11:52:53 +08:00
|
|
|
fixup_tlbie_lpid(rbvalues[i - 1], kvm->arch.lpid);
|
2013-07-08 18:08:25 +08:00
|
|
|
asm volatile("eieio; tlbsync; ptesync" : : : "memory");
|
|
|
|
} else {
|
|
|
|
if (need_sync)
|
|
|
|
asm volatile("ptesync" : : : "memory");
|
2017-04-11 13:23:25 +08:00
|
|
|
for (i = 0; i < npages; ++i) {
|
2016-11-18 05:28:51 +08:00
|
|
|
asm volatile(PPC_TLBIEL(%0,%1,0,0,0) : :
|
|
|
|
"r" (rbvalues[i]), "r" (0));
|
2017-04-11 13:23:25 +08:00
|
|
|
}
|
2013-07-08 18:08:25 +08:00
|
|
|
asm volatile("ptesync" : : : "memory");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-11-20 06:55:44 +08:00
|
|
|
long kvmppc_do_h_remove(struct kvm *kvm, unsigned long flags,
|
|
|
|
unsigned long pte_index, unsigned long avpn,
|
|
|
|
unsigned long *hpret)
|
2011-06-29 08:22:05 +08:00
|
|
|
{
|
2014-06-11 16:16:06 +08:00
|
|
|
__be64 *hpte;
|
2011-06-29 08:22:05 +08:00
|
|
|
unsigned long v, r, rb;
|
2011-12-15 10:01:10 +08:00
|
|
|
struct revmap_entry *rev;
|
2016-11-16 13:57:24 +08:00
|
|
|
u64 pte, orig_pte, pte_r;
|
2011-06-29 08:22:05 +08:00
|
|
|
|
2017-01-30 18:21:49 +08:00
|
|
|
if (kvm_is_radix(kvm))
|
|
|
|
return H_FUNCTION;
|
2016-12-20 13:49:01 +08:00
|
|
|
if (pte_index >= kvmppc_hpt_npte(&kvm->arch.hpt))
|
2011-06-29 08:22:05 +08:00
|
|
|
return H_PARAMETER;
|
2016-12-20 13:49:00 +08:00
|
|
|
hpte = (__be64 *)(kvm->arch.hpt.virt + (pte_index << 4));
|
2011-12-12 20:30:16 +08:00
|
|
|
while (!try_lock_hpte(hpte, HPTE_V_HVLOCK))
|
2011-06-29 08:22:05 +08:00
|
|
|
cpu_relax();
|
2016-11-16 13:57:24 +08:00
|
|
|
pte = orig_pte = be64_to_cpu(hpte[0]);
|
|
|
|
pte_r = be64_to_cpu(hpte[1]);
|
|
|
|
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
|
|
|
|
pte = hpte_new_to_old_v(pte, pte_r);
|
|
|
|
pte_r = hpte_new_to_old_r(pte_r);
|
|
|
|
}
|
2014-06-11 16:16:06 +08:00
|
|
|
if ((pte & (HPTE_V_ABSENT | HPTE_V_VALID)) == 0 ||
|
|
|
|
((flags & H_AVPN) && (pte & ~0x7fUL) != avpn) ||
|
|
|
|
((flags & H_ANDCOND) && (pte & avpn) != 0)) {
|
2016-11-16 13:57:24 +08:00
|
|
|
__unlock_hpte(hpte, orig_pte);
|
2011-06-29 08:22:05 +08:00
|
|
|
return H_NOT_FOUND;
|
|
|
|
}
|
2011-12-15 10:01:10 +08:00
|
|
|
|
2016-12-20 13:49:00 +08:00
|
|
|
rev = real_vmalloc_addr(&kvm->arch.hpt.rev[pte_index]);
|
2014-06-11 16:16:06 +08:00
|
|
|
v = pte & ~HPTE_V_HVLOCK;
|
2011-12-15 10:01:10 +08:00
|
|
|
if (v & HPTE_V_VALID) {
|
2014-06-11 16:16:06 +08:00
|
|
|
hpte[0] &= ~cpu_to_be64(HPTE_V_VALID);
|
2016-11-16 13:57:24 +08:00
|
|
|
rb = compute_tlbie_rb(v, pte_r, pte_index);
|
2017-11-06 20:27:44 +08:00
|
|
|
do_tlbies(kvm, &rb, 1, global_invalidates(kvm), true);
|
2015-06-24 19:18:05 +08:00
|
|
|
/*
|
|
|
|
* The reference (R) and change (C) bits in a HPT
|
|
|
|
* entry can be set by hardware at any time up until
|
|
|
|
* the HPTE is invalidated and the TLB invalidation
|
|
|
|
* sequence has completed. This means that when
|
|
|
|
* removing a HPTE, we need to re-read the HPTE after
|
|
|
|
* the invalidation sequence has completed in order to
|
|
|
|
* obtain reliable values of R and C.
|
|
|
|
*/
|
|
|
|
remove_revmap_chain(kvm, pte_index, rev, v,
|
|
|
|
be64_to_cpu(hpte[1]));
|
2011-06-29 08:22:05 +08:00
|
|
|
}
|
2012-11-20 06:52:49 +08:00
|
|
|
r = rev->guest_rpte & ~HPTE_GR_RESERVED;
|
|
|
|
note_hpte_modification(kvm, rev);
|
2011-12-15 10:01:10 +08:00
|
|
|
unlock_hpte(hpte, 0);
|
|
|
|
|
2016-11-16 13:57:24 +08:00
|
|
|
if (is_mmio_hpte(v, pte_r))
|
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-04 13:55:12 +08:00
|
|
|
atomic64_inc(&kvm->arch.mmio_update);
|
|
|
|
|
2015-05-18 12:10:54 +08:00
|
|
|
if (v & HPTE_V_ABSENT)
|
|
|
|
v = (v & ~HPTE_V_ABSENT) | HPTE_V_VALID;
|
2012-11-20 06:55:44 +08:00
|
|
|
hpret[0] = v;
|
|
|
|
hpret[1] = r;
|
2011-06-29 08:22:05 +08:00
|
|
|
return H_SUCCESS;
|
|
|
|
}
|
2012-11-20 06:55:44 +08:00
|
|
|
EXPORT_SYMBOL_GPL(kvmppc_do_h_remove);
|
|
|
|
|
|
|
|
long kvmppc_h_remove(struct kvm_vcpu *vcpu, unsigned long flags,
|
|
|
|
unsigned long pte_index, unsigned long avpn)
|
|
|
|
{
|
|
|
|
return kvmppc_do_h_remove(vcpu->kvm, flags, pte_index, avpn,
|
2018-05-07 14:20:07 +08:00
|
|
|
&vcpu->arch.regs.gpr[4]);
|
2012-11-20 06:55:44 +08:00
|
|
|
}
|
2011-06-29 08:22:05 +08:00
|
|
|
|
|
|
|
long kvmppc_h_bulk_remove(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
struct kvm *kvm = vcpu->kvm;
|
2018-05-07 14:20:07 +08:00
|
|
|
unsigned long *args = &vcpu->arch.regs.gpr[4];
|
2014-06-11 16:16:06 +08:00
|
|
|
__be64 *hp, *hptes[4];
|
|
|
|
unsigned long tlbrb[4];
|
2011-12-15 10:01:10 +08:00
|
|
|
long int i, j, k, n, found, indexes[4];
|
|
|
|
unsigned long flags, req, pte_index, rcbits;
|
2013-07-08 18:08:25 +08:00
|
|
|
int global;
|
2011-06-29 08:22:05 +08:00
|
|
|
long int ret = H_SUCCESS;
|
2011-12-15 10:01:10 +08:00
|
|
|
struct revmap_entry *rev, *revs[4];
|
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-04 13:55:12 +08:00
|
|
|
u64 hp0, hp1;
|
2011-06-29 08:22:05 +08:00
|
|
|
|
2017-01-30 18:21:49 +08:00
|
|
|
if (kvm_is_radix(kvm))
|
|
|
|
return H_FUNCTION;
|
2017-11-06 20:27:44 +08:00
|
|
|
global = global_invalidates(kvm);
|
2011-12-15 10:01:10 +08:00
|
|
|
for (i = 0; i < 4 && ret == H_SUCCESS; ) {
|
|
|
|
n = 0;
|
|
|
|
for (; i < 4; ++i) {
|
|
|
|
j = i * 2;
|
|
|
|
pte_index = args[j];
|
|
|
|
flags = pte_index >> 56;
|
|
|
|
pte_index &= ((1ul << 56) - 1);
|
|
|
|
req = flags >> 6;
|
|
|
|
flags &= 3;
|
|
|
|
if (req == 3) { /* no more requests */
|
|
|
|
i = 4;
|
2011-06-29 08:22:05 +08:00
|
|
|
break;
|
2011-12-15 10:01:10 +08:00
|
|
|
}
|
KVM: PPC: Book3S HV: Make the guest hash table size configurable
This adds a new ioctl to enable userspace to control the size of the guest
hashed page table (HPT) and to clear it out when resetting the guest.
The KVM_PPC_ALLOCATE_HTAB ioctl is a VM ioctl and takes as its parameter
a pointer to a u32 containing the desired order of the HPT (log base 2
of the size in bytes), which is updated on successful return to the
actual order of the HPT which was allocated.
There must be no vcpus running at the time of this ioctl. To enforce
this, we now keep a count of the number of vcpus running in
kvm->arch.vcpus_running.
If the ioctl is called when a HPT has already been allocated, we don't
reallocate the HPT but just clear it out. We first clear the
kvm->arch.rma_setup_done flag, which has two effects: (a) since we hold
the kvm->lock mutex, it will prevent any vcpus from starting to run until
we're done, and (b) it means that the first vcpu to run after we're done
will re-establish the VRMA if necessary.
If userspace doesn't call this ioctl before running the first vcpu, the
kernel will allocate a default-sized HPT at that point. We do it then
rather than when creating the VM, as the code did previously, so that
userspace has a chance to do the ioctl if it wants.
When allocating the HPT, we can allocate either from the kernel page
allocator, or from the preallocated pool. If userspace is asking for
a different size from the preallocated HPTs, we first try to allocate
using the kernel page allocator. Then we try to allocate from the
preallocated pool, and then if that fails, we try allocating decreasing
sizes from the kernel page allocator, down to the minimum size allowed
(256kB). Note that the kernel page allocator limits allocations to
1 << CONFIG_FORCE_MAX_ZONEORDER pages, which by default corresponds to
16MB (on 64-bit powerpc, at least).
Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix module compilation]
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-05-04 10:32:53 +08:00
|
|
|
if (req != 1 || flags == 3 ||
|
2016-12-20 13:49:01 +08:00
|
|
|
pte_index >= kvmppc_hpt_npte(&kvm->arch.hpt)) {
|
2011-12-15 10:01:10 +08:00
|
|
|
/* parameter error */
|
|
|
|
args[j] = ((0xa0 | flags) << 56) + pte_index;
|
|
|
|
ret = H_PARAMETER;
|
2011-06-29 08:22:05 +08:00
|
|
|
break;
|
2011-12-15 10:01:10 +08:00
|
|
|
}
|
2016-12-20 13:49:00 +08:00
|
|
|
hp = (__be64 *) (kvm->arch.hpt.virt + (pte_index << 4));
|
2011-12-15 10:01:10 +08:00
|
|
|
/* to avoid deadlock, don't spin except for first */
|
|
|
|
if (!try_lock_hpte(hp, HPTE_V_HVLOCK)) {
|
|
|
|
if (n)
|
|
|
|
break;
|
|
|
|
while (!try_lock_hpte(hp, HPTE_V_HVLOCK))
|
|
|
|
cpu_relax();
|
|
|
|
}
|
|
|
|
found = 0;
|
2014-06-11 16:16:06 +08:00
|
|
|
hp0 = be64_to_cpu(hp[0]);
|
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-04 13:55:12 +08:00
|
|
|
hp1 = be64_to_cpu(hp[1]);
|
2016-11-16 13:57:24 +08:00
|
|
|
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
|
|
|
|
hp0 = hpte_new_to_old_v(hp0, hp1);
|
|
|
|
hp1 = hpte_new_to_old_r(hp1);
|
|
|
|
}
|
2014-06-11 16:16:06 +08:00
|
|
|
if (hp0 & (HPTE_V_ABSENT | HPTE_V_VALID)) {
|
2011-12-15 10:01:10 +08:00
|
|
|
switch (flags & 3) {
|
|
|
|
case 0: /* absolute */
|
2011-06-29 08:22:05 +08:00
|
|
|
found = 1;
|
2011-12-15 10:01:10 +08:00
|
|
|
break;
|
|
|
|
case 1: /* andcond */
|
2014-06-11 16:16:06 +08:00
|
|
|
if (!(hp0 & args[j + 1]))
|
2011-12-15 10:01:10 +08:00
|
|
|
found = 1;
|
|
|
|
break;
|
|
|
|
case 2: /* AVPN */
|
2014-06-11 16:16:06 +08:00
|
|
|
if ((hp0 & ~0x7fUL) == args[j + 1])
|
2011-12-15 10:01:10 +08:00
|
|
|
found = 1;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!found) {
|
2014-06-11 16:16:06 +08:00
|
|
|
hp[0] &= ~cpu_to_be64(HPTE_V_HVLOCK);
|
2011-12-15 10:01:10 +08:00
|
|
|
args[j] = ((0x90 | flags) << 56) + pte_index;
|
|
|
|
continue;
|
2011-06-29 08:22:05 +08:00
|
|
|
}
|
2011-12-15 10:01:10 +08:00
|
|
|
|
|
|
|
args[j] = ((0x80 | flags) << 56) + pte_index;
|
2016-12-20 13:49:00 +08:00
|
|
|
rev = real_vmalloc_addr(&kvm->arch.hpt.rev[pte_index]);
|
2012-11-20 06:52:49 +08:00
|
|
|
note_hpte_modification(kvm, rev);
|
2011-12-15 10:01:10 +08:00
|
|
|
|
2014-06-11 16:16:06 +08:00
|
|
|
if (!(hp0 & HPTE_V_VALID)) {
|
2011-12-15 10:02:02 +08:00
|
|
|
/* insert R and C bits from PTE */
|
|
|
|
rcbits = rev->guest_rpte & (HPTE_R_R|HPTE_R_C);
|
|
|
|
args[j] |= rcbits << (56 - 5);
|
2012-05-10 07:49:24 +08:00
|
|
|
hp[0] = 0;
|
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-04 13:55:12 +08:00
|
|
|
if (is_mmio_hpte(hp0, hp1))
|
|
|
|
atomic64_inc(&kvm->arch.mmio_update);
|
2011-12-15 10:01:10 +08:00
|
|
|
continue;
|
2011-12-15 10:02:02 +08:00
|
|
|
}
|
2011-12-15 10:01:10 +08:00
|
|
|
|
2014-06-11 16:16:06 +08:00
|
|
|
/* leave it locked */
|
|
|
|
hp[0] &= ~cpu_to_be64(HPTE_V_VALID);
|
2016-11-16 13:57:24 +08:00
|
|
|
tlbrb[n] = compute_tlbie_rb(hp0, hp1, pte_index);
|
2011-12-15 10:01:10 +08:00
|
|
|
indexes[n] = j;
|
|
|
|
hptes[n] = hp;
|
|
|
|
revs[n] = rev;
|
|
|
|
++n;
|
2011-06-29 08:22:05 +08:00
|
|
|
}
|
2011-12-15 10:01:10 +08:00
|
|
|
|
|
|
|
if (!n)
|
|
|
|
break;
|
|
|
|
|
|
|
|
/* Now that we've collected a batch, do the tlbies */
|
2013-07-08 18:08:25 +08:00
|
|
|
do_tlbies(kvm, tlbrb, n, global, true);
|
2011-12-15 10:01:10 +08:00
|
|
|
|
2011-12-15 10:02:02 +08:00
|
|
|
/* Read PTE low words after tlbie to get final R/C values */
|
2011-12-15 10:01:10 +08:00
|
|
|
for (k = 0; k < n; ++k) {
|
|
|
|
j = indexes[k];
|
|
|
|
pte_index = args[j] & ((1ul << 56) - 1);
|
|
|
|
hp = hptes[k];
|
|
|
|
rev = revs[k];
|
2014-06-11 16:16:06 +08:00
|
|
|
remove_revmap_chain(kvm, pte_index, rev,
|
|
|
|
be64_to_cpu(hp[0]), be64_to_cpu(hp[1]));
|
2011-12-15 10:02:02 +08:00
|
|
|
rcbits = rev->guest_rpte & (HPTE_R_R|HPTE_R_C);
|
|
|
|
args[j] |= rcbits << (56 - 5);
|
2015-03-20 17:39:43 +08:00
|
|
|
__unlock_hpte(hp, 0);
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
}
|
2011-06-29 08:22:05 +08:00
|
|
|
}
|
2011-12-15 10:01:10 +08:00
|
|
|
|
2011-06-29 08:22:05 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long flags,
|
|
|
|
unsigned long pte_index, unsigned long avpn,
|
|
|
|
unsigned long va)
|
|
|
|
{
|
|
|
|
struct kvm *kvm = vcpu->kvm;
|
2014-06-11 16:16:06 +08:00
|
|
|
__be64 *hpte;
|
2011-12-12 20:27:39 +08:00
|
|
|
struct revmap_entry *rev;
|
|
|
|
unsigned long v, r, rb, mask, bits;
|
2016-11-16 13:57:24 +08:00
|
|
|
u64 pte_v, pte_r;
|
2011-06-29 08:22:05 +08:00
|
|
|
|
2017-01-30 18:21:49 +08:00
|
|
|
if (kvm_is_radix(kvm))
|
|
|
|
return H_FUNCTION;
|
2016-12-20 13:49:01 +08:00
|
|
|
if (pte_index >= kvmppc_hpt_npte(&kvm->arch.hpt))
|
2011-06-29 08:22:05 +08:00
|
|
|
return H_PARAMETER;
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
|
2016-12-20 13:49:00 +08:00
|
|
|
hpte = (__be64 *)(kvm->arch.hpt.virt + (pte_index << 4));
|
2011-12-12 20:30:16 +08:00
|
|
|
while (!try_lock_hpte(hpte, HPTE_V_HVLOCK))
|
2011-06-29 08:22:05 +08:00
|
|
|
cpu_relax();
|
2016-11-16 13:57:24 +08:00
|
|
|
v = pte_v = be64_to_cpu(hpte[0]);
|
|
|
|
if (cpu_has_feature(CPU_FTR_ARCH_300))
|
|
|
|
v = hpte_new_to_old_v(v, be64_to_cpu(hpte[1]));
|
|
|
|
if ((v & (HPTE_V_ABSENT | HPTE_V_VALID)) == 0 ||
|
|
|
|
((flags & H_AVPN) && (v & ~0x7fUL) != avpn)) {
|
|
|
|
__unlock_hpte(hpte, pte_v);
|
2011-06-29 08:22:05 +08:00
|
|
|
return H_NOT_FOUND;
|
|
|
|
}
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
|
2016-11-16 13:57:24 +08:00
|
|
|
pte_r = be64_to_cpu(hpte[1]);
|
2011-12-12 20:27:39 +08:00
|
|
|
bits = (flags << 55) & HPTE_R_PP0;
|
|
|
|
bits |= (flags << 48) & HPTE_R_KEY_HI;
|
|
|
|
bits |= flags & (HPTE_R_PP | HPTE_R_N | HPTE_R_KEY_LO);
|
|
|
|
|
|
|
|
/* Update guest view of 2nd HPTE dword */
|
|
|
|
mask = HPTE_R_PP0 | HPTE_R_PP | HPTE_R_N |
|
|
|
|
HPTE_R_KEY_HI | HPTE_R_KEY_LO;
|
2016-12-20 13:49:00 +08:00
|
|
|
rev = real_vmalloc_addr(&kvm->arch.hpt.rev[pte_index]);
|
2011-12-12 20:27:39 +08:00
|
|
|
if (rev) {
|
|
|
|
r = (rev->guest_rpte & ~mask) | bits;
|
|
|
|
rev->guest_rpte = r;
|
2012-11-20 06:52:49 +08:00
|
|
|
note_hpte_modification(kvm, rev);
|
2011-12-12 20:27:39 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Update HPTE */
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
if (v & HPTE_V_VALID) {
|
2012-11-22 07:28:41 +08:00
|
|
|
/*
|
2014-11-03 12:51:58 +08:00
|
|
|
* If the page is valid, don't let it transition from
|
|
|
|
* readonly to writable. If it should be writable, we'll
|
|
|
|
* take a trap and let the page fault code sort it out.
|
2012-11-22 07:28:41 +08:00
|
|
|
*/
|
2016-11-16 13:57:24 +08:00
|
|
|
r = (pte_r & ~mask) | bits;
|
|
|
|
if (hpte_is_writable(r) && !hpte_is_writable(pte_r))
|
2014-11-03 12:51:58 +08:00
|
|
|
r = hpte_make_readonly(r);
|
|
|
|
/* If the PTE is changing, invalidate it first */
|
2016-11-16 13:57:24 +08:00
|
|
|
if (r != pte_r) {
|
2014-11-03 12:51:58 +08:00
|
|
|
rb = compute_tlbie_rb(v, r, pte_index);
|
2016-11-16 13:57:24 +08:00
|
|
|
hpte[0] = cpu_to_be64((pte_v & ~HPTE_V_VALID) |
|
2014-11-03 12:51:58 +08:00
|
|
|
HPTE_V_ABSENT);
|
2017-11-06 20:27:44 +08:00
|
|
|
do_tlbies(kvm, &rb, 1, global_invalidates(kvm), true);
|
2016-11-16 13:43:28 +08:00
|
|
|
/* Don't lose R/C bit updates done by hardware */
|
|
|
|
r |= be64_to_cpu(hpte[1]) & (HPTE_R_R | HPTE_R_C);
|
2014-11-03 12:51:58 +08:00
|
|
|
hpte[1] = cpu_to_be64(r);
|
2012-11-22 07:28:41 +08:00
|
|
|
}
|
2011-06-29 08:22:05 +08:00
|
|
|
}
|
2016-11-16 13:57:24 +08:00
|
|
|
unlock_hpte(hpte, pte_v & ~HPTE_V_HVLOCK);
|
2011-06-29 08:22:05 +08:00
|
|
|
asm volatile("ptesync" : : : "memory");
|
2016-11-16 13:57:24 +08:00
|
|
|
if (is_mmio_hpte(v, pte_r))
|
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-04 13:55:12 +08:00
|
|
|
atomic64_inc(&kvm->arch.mmio_update);
|
|
|
|
|
2011-06-29 08:22:05 +08:00
|
|
|
return H_SUCCESS;
|
|
|
|
}
|
|
|
|
|
|
|
|
long kvmppc_h_read(struct kvm_vcpu *vcpu, unsigned long flags,
|
|
|
|
unsigned long pte_index)
|
|
|
|
{
|
|
|
|
struct kvm *kvm = vcpu->kvm;
|
2014-06-11 16:16:06 +08:00
|
|
|
__be64 *hpte;
|
|
|
|
unsigned long v, r;
|
2011-06-29 08:22:05 +08:00
|
|
|
int i, n = 1;
|
2011-12-12 20:27:39 +08:00
|
|
|
struct revmap_entry *rev = NULL;
|
2011-06-29 08:22:05 +08:00
|
|
|
|
2017-01-30 18:21:49 +08:00
|
|
|
if (kvm_is_radix(kvm))
|
|
|
|
return H_FUNCTION;
|
2016-12-20 13:49:01 +08:00
|
|
|
if (pte_index >= kvmppc_hpt_npte(&kvm->arch.hpt))
|
2011-06-29 08:22:05 +08:00
|
|
|
return H_PARAMETER;
|
|
|
|
if (flags & H_READ_4) {
|
|
|
|
pte_index &= ~3;
|
|
|
|
n = 4;
|
|
|
|
}
|
2016-12-20 13:49:00 +08:00
|
|
|
rev = real_vmalloc_addr(&kvm->arch.hpt.rev[pte_index]);
|
2011-06-29 08:22:05 +08:00
|
|
|
for (i = 0; i < n; ++i, ++pte_index) {
|
2016-12-20 13:49:00 +08:00
|
|
|
hpte = (__be64 *)(kvm->arch.hpt.virt + (pte_index << 4));
|
2014-06-11 16:16:06 +08:00
|
|
|
v = be64_to_cpu(hpte[0]) & ~HPTE_V_HVLOCK;
|
|
|
|
r = be64_to_cpu(hpte[1]);
|
2016-11-16 13:57:24 +08:00
|
|
|
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
|
|
|
|
v = hpte_new_to_old_v(v, r);
|
|
|
|
r = hpte_new_to_old_r(r);
|
|
|
|
}
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
if (v & HPTE_V_ABSENT) {
|
|
|
|
v &= ~HPTE_V_ABSENT;
|
|
|
|
v |= HPTE_V_VALID;
|
|
|
|
}
|
2012-11-20 06:52:49 +08:00
|
|
|
if (v & HPTE_V_VALID) {
|
2011-12-15 10:02:02 +08:00
|
|
|
r = rev[i].guest_rpte | (r & (HPTE_R_R | HPTE_R_C));
|
2012-11-20 06:52:49 +08:00
|
|
|
r &= ~HPTE_GR_RESERVED;
|
|
|
|
}
|
2018-05-07 14:20:07 +08:00
|
|
|
vcpu->arch.regs.gpr[4 + i * 2] = v;
|
|
|
|
vcpu->arch.regs.gpr[5 + i * 2] = r;
|
2011-06-29 08:22:05 +08:00
|
|
|
}
|
|
|
|
return H_SUCCESS;
|
|
|
|
}
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
|
2015-06-24 19:18:07 +08:00
|
|
|
long kvmppc_h_clear_ref(struct kvm_vcpu *vcpu, unsigned long flags,
|
|
|
|
unsigned long pte_index)
|
|
|
|
{
|
|
|
|
struct kvm *kvm = vcpu->kvm;
|
|
|
|
__be64 *hpte;
|
|
|
|
unsigned long v, r, gr;
|
|
|
|
struct revmap_entry *rev;
|
|
|
|
unsigned long *rmap;
|
|
|
|
long ret = H_NOT_FOUND;
|
|
|
|
|
2017-01-30 18:21:49 +08:00
|
|
|
if (kvm_is_radix(kvm))
|
|
|
|
return H_FUNCTION;
|
2016-12-20 13:49:01 +08:00
|
|
|
if (pte_index >= kvmppc_hpt_npte(&kvm->arch.hpt))
|
2015-06-24 19:18:07 +08:00
|
|
|
return H_PARAMETER;
|
|
|
|
|
2016-12-20 13:49:00 +08:00
|
|
|
rev = real_vmalloc_addr(&kvm->arch.hpt.rev[pte_index]);
|
|
|
|
hpte = (__be64 *)(kvm->arch.hpt.virt + (pte_index << 4));
|
2015-06-24 19:18:07 +08:00
|
|
|
while (!try_lock_hpte(hpte, HPTE_V_HVLOCK))
|
|
|
|
cpu_relax();
|
|
|
|
v = be64_to_cpu(hpte[0]);
|
|
|
|
r = be64_to_cpu(hpte[1]);
|
|
|
|
if (!(v & (HPTE_V_VALID | HPTE_V_ABSENT)))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
gr = rev->guest_rpte;
|
|
|
|
if (rev->guest_rpte & HPTE_R_R) {
|
|
|
|
rev->guest_rpte &= ~HPTE_R_R;
|
|
|
|
note_hpte_modification(kvm, rev);
|
|
|
|
}
|
|
|
|
if (v & HPTE_V_VALID) {
|
|
|
|
gr |= r & (HPTE_R_R | HPTE_R_C);
|
|
|
|
if (r & HPTE_R_R) {
|
|
|
|
kvmppc_clear_ref_hpte(kvm, hpte, pte_index);
|
KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix
Currently, the HPT code in HV KVM maintains a dirty bit per guest page
in the rmap array, whether or not dirty page tracking has been enabled
for the memory slot. In contrast, the radix code maintains a dirty
bit per guest page in memslot->dirty_bitmap, and only does so when
dirty page tracking has been enabled.
This changes the HPT code to maintain the dirty bits in the memslot
dirty_bitmap like radix does. This results in slightly less code
overall, and will mean that we do not lose the dirty bits when
transitioning between HPT and radix mode in future.
There is one minor change to behaviour as a result. With HPT, when
dirty tracking was enabled for a memslot, we would previously clear
all the dirty bits at that point (both in the HPT entries and in the
rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately
following would show no pages as dirty (assuming no vcpus have run
in the meantime). With this change, the dirty bits on HPT entries
are not cleared at the point where dirty tracking is enabled, so
KVM_GET_DIRTY_LOG would show as dirty any guest pages that are
resident in the HPT and dirty. This is consistent with what happens
on radix.
This also fixes a bug in the mark_pages_dirty() function for radix
(in the sense that the function no longer exists). In the case where
a large page of 64 normal pages or more is marked dirty, the
addressing of the dirty bitmap was incorrect and could write past
the end of the bitmap. Fortunately this case was never hit in
practice because a 2MB large page is only 32 x 64kB pages, and we
don't support backing the guest with 1GB huge pages at this point.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-26 13:39:19 +08:00
|
|
|
rmap = revmap_for_hpte(kvm, v, gr, NULL, NULL);
|
2015-06-24 19:18:07 +08:00
|
|
|
if (rmap) {
|
|
|
|
lock_rmap(rmap);
|
|
|
|
*rmap |= KVMPPC_RMAP_REFERENCED;
|
|
|
|
unlock_rmap(rmap);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2018-05-07 14:20:07 +08:00
|
|
|
vcpu->arch.regs.gpr[4] = gr;
|
2015-06-24 19:18:07 +08:00
|
|
|
ret = H_SUCCESS;
|
|
|
|
out:
|
|
|
|
unlock_hpte(hpte, v & ~HPTE_V_HVLOCK);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
long kvmppc_h_clear_mod(struct kvm_vcpu *vcpu, unsigned long flags,
|
|
|
|
unsigned long pte_index)
|
|
|
|
{
|
|
|
|
struct kvm *kvm = vcpu->kvm;
|
|
|
|
__be64 *hpte;
|
|
|
|
unsigned long v, r, gr;
|
|
|
|
struct revmap_entry *rev;
|
|
|
|
long ret = H_NOT_FOUND;
|
|
|
|
|
2017-01-30 18:21:49 +08:00
|
|
|
if (kvm_is_radix(kvm))
|
|
|
|
return H_FUNCTION;
|
2016-12-20 13:49:01 +08:00
|
|
|
if (pte_index >= kvmppc_hpt_npte(&kvm->arch.hpt))
|
2015-06-24 19:18:07 +08:00
|
|
|
return H_PARAMETER;
|
|
|
|
|
2016-12-20 13:49:00 +08:00
|
|
|
rev = real_vmalloc_addr(&kvm->arch.hpt.rev[pte_index]);
|
|
|
|
hpte = (__be64 *)(kvm->arch.hpt.virt + (pte_index << 4));
|
2015-06-24 19:18:07 +08:00
|
|
|
while (!try_lock_hpte(hpte, HPTE_V_HVLOCK))
|
|
|
|
cpu_relax();
|
|
|
|
v = be64_to_cpu(hpte[0]);
|
|
|
|
r = be64_to_cpu(hpte[1]);
|
|
|
|
if (!(v & (HPTE_V_VALID | HPTE_V_ABSENT)))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
gr = rev->guest_rpte;
|
|
|
|
if (gr & HPTE_R_C) {
|
|
|
|
rev->guest_rpte &= ~HPTE_R_C;
|
|
|
|
note_hpte_modification(kvm, rev);
|
|
|
|
}
|
|
|
|
if (v & HPTE_V_VALID) {
|
|
|
|
/* need to make it temporarily absent so C is stable */
|
|
|
|
hpte[0] |= cpu_to_be64(HPTE_V_ABSENT);
|
|
|
|
kvmppc_invalidate_hpte(kvm, hpte, pte_index);
|
|
|
|
r = be64_to_cpu(hpte[1]);
|
|
|
|
gr |= r & (HPTE_R_R | HPTE_R_C);
|
|
|
|
if (r & HPTE_R_C) {
|
|
|
|
hpte[1] = cpu_to_be64(r & ~HPTE_R_C);
|
|
|
|
eieio();
|
KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix
Currently, the HPT code in HV KVM maintains a dirty bit per guest page
in the rmap array, whether or not dirty page tracking has been enabled
for the memory slot. In contrast, the radix code maintains a dirty
bit per guest page in memslot->dirty_bitmap, and only does so when
dirty page tracking has been enabled.
This changes the HPT code to maintain the dirty bits in the memslot
dirty_bitmap like radix does. This results in slightly less code
overall, and will mean that we do not lose the dirty bits when
transitioning between HPT and radix mode in future.
There is one minor change to behaviour as a result. With HPT, when
dirty tracking was enabled for a memslot, we would previously clear
all the dirty bits at that point (both in the HPT entries and in the
rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately
following would show no pages as dirty (assuming no vcpus have run
in the meantime). With this change, the dirty bits on HPT entries
are not cleared at the point where dirty tracking is enabled, so
KVM_GET_DIRTY_LOG would show as dirty any guest pages that are
resident in the HPT and dirty. This is consistent with what happens
on radix.
This also fixes a bug in the mark_pages_dirty() function for radix
(in the sense that the function no longer exists). In the case where
a large page of 64 normal pages or more is marked dirty, the
addressing of the dirty bitmap was incorrect and could write past
the end of the bitmap. Fortunately this case was never hit in
practice because a 2MB large page is only 32 x 64kB pages, and we
don't support backing the guest with 1GB huge pages at this point.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-26 13:39:19 +08:00
|
|
|
kvmppc_set_dirty_from_hpte(kvm, v, gr);
|
2015-06-24 19:18:07 +08:00
|
|
|
}
|
|
|
|
}
|
2018-05-07 14:20:07 +08:00
|
|
|
vcpu->arch.regs.gpr[4] = gr;
|
2015-06-24 19:18:07 +08:00
|
|
|
ret = H_SUCCESS;
|
|
|
|
out:
|
|
|
|
unlock_hpte(hpte, v & ~HPTE_V_HVLOCK);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2019-03-22 14:05:45 +08:00
|
|
|
static int kvmppc_get_hpa(struct kvm_vcpu *vcpu, unsigned long gpa,
|
|
|
|
int writing, unsigned long *hpa,
|
|
|
|
struct kvm_memory_slot **memslot_p)
|
|
|
|
{
|
|
|
|
struct kvm *kvm = vcpu->kvm;
|
|
|
|
struct kvm_memory_slot *memslot;
|
|
|
|
unsigned long gfn, hva, pa, psize = PAGE_SHIFT;
|
|
|
|
unsigned int shift;
|
|
|
|
pte_t *ptep, pte;
|
|
|
|
|
|
|
|
/* Find the memslot for this address */
|
|
|
|
gfn = gpa >> PAGE_SHIFT;
|
|
|
|
memslot = __gfn_to_memslot(kvm_memslots_raw(kvm), gfn);
|
|
|
|
if (!memslot || (memslot->flags & KVM_MEMSLOT_INVALID))
|
|
|
|
return H_PARAMETER;
|
|
|
|
|
|
|
|
/* Translate to host virtual address */
|
|
|
|
hva = __gfn_to_hva_memslot(memslot, gfn);
|
|
|
|
|
|
|
|
/* Try to find the host pte for that virtual address */
|
|
|
|
ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, &shift);
|
|
|
|
if (!ptep)
|
|
|
|
return H_TOO_HARD;
|
|
|
|
pte = kvmppc_read_update_linux_pte(ptep, writing);
|
|
|
|
if (!pte_present(pte))
|
|
|
|
return H_TOO_HARD;
|
|
|
|
|
|
|
|
/* Convert to a physical address */
|
|
|
|
if (shift)
|
|
|
|
psize = 1UL << shift;
|
|
|
|
pa = pte_pfn(pte) << PAGE_SHIFT;
|
|
|
|
pa |= hva & (psize - 1);
|
|
|
|
pa |= gpa & ~PAGE_MASK;
|
|
|
|
|
|
|
|
if (hpa)
|
|
|
|
*hpa = pa;
|
|
|
|
if (memslot_p)
|
|
|
|
*memslot_p = memslot;
|
|
|
|
|
|
|
|
return H_SUCCESS;
|
|
|
|
}
|
|
|
|
|
|
|
|
static long kvmppc_do_h_page_init_zero(struct kvm_vcpu *vcpu,
|
|
|
|
unsigned long dest)
|
|
|
|
{
|
|
|
|
struct kvm_memory_slot *memslot;
|
|
|
|
struct kvm *kvm = vcpu->kvm;
|
|
|
|
unsigned long pa, mmu_seq;
|
|
|
|
long ret = H_SUCCESS;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
/* Used later to detect if we might have been invalidated */
|
|
|
|
mmu_seq = kvm->mmu_notifier_seq;
|
|
|
|
smp_rmb();
|
|
|
|
|
|
|
|
ret = kvmppc_get_hpa(vcpu, dest, 1, &pa, &memslot);
|
|
|
|
if (ret != H_SUCCESS)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
/* Check if we've been invalidated */
|
|
|
|
raw_spin_lock(&kvm->mmu_lock.rlock);
|
|
|
|
if (mmu_notifier_retry(kvm, mmu_seq)) {
|
|
|
|
ret = H_TOO_HARD;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Zero the page */
|
|
|
|
for (i = 0; i < SZ_4K; i += L1_CACHE_BYTES, pa += L1_CACHE_BYTES)
|
|
|
|
dcbz((void *)pa);
|
|
|
|
kvmppc_update_dirty_map(memslot, dest >> PAGE_SHIFT, PAGE_SIZE);
|
|
|
|
|
|
|
|
out_unlock:
|
|
|
|
raw_spin_unlock(&kvm->mmu_lock.rlock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static long kvmppc_do_h_page_init_copy(struct kvm_vcpu *vcpu,
|
|
|
|
unsigned long dest, unsigned long src)
|
|
|
|
{
|
|
|
|
unsigned long dest_pa, src_pa, mmu_seq;
|
|
|
|
struct kvm_memory_slot *dest_memslot;
|
|
|
|
struct kvm *kvm = vcpu->kvm;
|
|
|
|
long ret = H_SUCCESS;
|
|
|
|
|
|
|
|
/* Used later to detect if we might have been invalidated */
|
|
|
|
mmu_seq = kvm->mmu_notifier_seq;
|
|
|
|
smp_rmb();
|
|
|
|
|
|
|
|
ret = kvmppc_get_hpa(vcpu, dest, 1, &dest_pa, &dest_memslot);
|
|
|
|
if (ret != H_SUCCESS)
|
|
|
|
return ret;
|
|
|
|
ret = kvmppc_get_hpa(vcpu, src, 0, &src_pa, NULL);
|
|
|
|
if (ret != H_SUCCESS)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
/* Check if we've been invalidated */
|
|
|
|
raw_spin_lock(&kvm->mmu_lock.rlock);
|
|
|
|
if (mmu_notifier_retry(kvm, mmu_seq)) {
|
|
|
|
ret = H_TOO_HARD;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Copy the page */
|
|
|
|
memcpy((void *)dest_pa, (void *)src_pa, SZ_4K);
|
|
|
|
|
|
|
|
kvmppc_update_dirty_map(dest_memslot, dest >> PAGE_SHIFT, PAGE_SIZE);
|
|
|
|
|
|
|
|
out_unlock:
|
|
|
|
raw_spin_unlock(&kvm->mmu_lock.rlock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
long kvmppc_rm_h_page_init(struct kvm_vcpu *vcpu, unsigned long flags,
|
|
|
|
unsigned long dest, unsigned long src)
|
|
|
|
{
|
|
|
|
struct kvm *kvm = vcpu->kvm;
|
|
|
|
u64 pg_mask = SZ_4K - 1; /* 4K page size */
|
|
|
|
long ret = H_SUCCESS;
|
|
|
|
|
|
|
|
/* Don't handle radix mode here, go up to the virtual mode handler */
|
|
|
|
if (kvm_is_radix(kvm))
|
|
|
|
return H_TOO_HARD;
|
|
|
|
|
|
|
|
/* Check for invalid flags (H_PAGE_SET_LOANED covers all CMO flags) */
|
|
|
|
if (flags & ~(H_ICACHE_INVALIDATE | H_ICACHE_SYNCHRONIZE |
|
|
|
|
H_ZERO_PAGE | H_COPY_PAGE | H_PAGE_SET_LOANED))
|
|
|
|
return H_PARAMETER;
|
|
|
|
|
|
|
|
/* dest (and src if copy_page flag set) must be page aligned */
|
|
|
|
if ((dest & pg_mask) || ((flags & H_COPY_PAGE) && (src & pg_mask)))
|
|
|
|
return H_PARAMETER;
|
|
|
|
|
|
|
|
/* zero and/or copy the page as determined by the flags */
|
|
|
|
if (flags & H_COPY_PAGE)
|
|
|
|
ret = kvmppc_do_h_page_init_copy(vcpu, dest, src);
|
|
|
|
else if (flags & H_ZERO_PAGE)
|
|
|
|
ret = kvmppc_do_h_page_init_zero(vcpu, dest);
|
|
|
|
|
|
|
|
/* We can ignore the other flags */
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2014-06-11 16:16:06 +08:00
|
|
|
void kvmppc_invalidate_hpte(struct kvm *kvm, __be64 *hptep,
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
unsigned long pte_index)
|
|
|
|
{
|
|
|
|
unsigned long rb;
|
2016-11-16 13:57:24 +08:00
|
|
|
u64 hp0, hp1;
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
|
2014-06-11 16:16:06 +08:00
|
|
|
hptep[0] &= ~cpu_to_be64(HPTE_V_VALID);
|
2016-11-16 13:57:24 +08:00
|
|
|
hp0 = be64_to_cpu(hptep[0]);
|
|
|
|
hp1 = be64_to_cpu(hptep[1]);
|
|
|
|
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
|
|
|
|
hp0 = hpte_new_to_old_v(hp0, hp1);
|
|
|
|
hp1 = hpte_new_to_old_r(hp1);
|
|
|
|
}
|
|
|
|
rb = compute_tlbie_rb(hp0, hp1, pte_index);
|
2013-07-08 18:08:25 +08:00
|
|
|
do_tlbies(kvm, &rb, 1, 1, true);
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvmppc_invalidate_hpte);
|
|
|
|
|
2014-06-11 16:16:06 +08:00
|
|
|
void kvmppc_clear_ref_hpte(struct kvm *kvm, __be64 *hptep,
|
2011-12-15 10:02:47 +08:00
|
|
|
unsigned long pte_index)
|
|
|
|
{
|
|
|
|
unsigned long rb;
|
|
|
|
unsigned char rbyte;
|
2016-11-16 13:57:24 +08:00
|
|
|
u64 hp0, hp1;
|
2011-12-15 10:02:47 +08:00
|
|
|
|
2016-11-16 13:57:24 +08:00
|
|
|
hp0 = be64_to_cpu(hptep[0]);
|
|
|
|
hp1 = be64_to_cpu(hptep[1]);
|
|
|
|
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
|
|
|
|
hp0 = hpte_new_to_old_v(hp0, hp1);
|
|
|
|
hp1 = hpte_new_to_old_r(hp1);
|
|
|
|
}
|
|
|
|
rb = compute_tlbie_rb(hp0, hp1, pte_index);
|
2014-06-11 16:16:06 +08:00
|
|
|
rbyte = (be64_to_cpu(hptep[1]) & ~HPTE_R_R) >> 8;
|
2011-12-15 10:02:47 +08:00
|
|
|
/* modify only the second-last byte, which contains the ref bit */
|
|
|
|
*((char *)hptep + 14) = rbyte;
|
2013-07-08 18:08:25 +08:00
|
|
|
do_tlbies(kvm, &rb, 1, 1, false);
|
2011-12-15 10:02:47 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvmppc_clear_ref_hpte);
|
|
|
|
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
static int slb_base_page_shift[4] = {
|
|
|
|
24, /* 16M */
|
|
|
|
16, /* 64k */
|
|
|
|
34, /* 16G */
|
|
|
|
20, /* 1M, unsupported */
|
|
|
|
};
|
|
|
|
|
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-04 13:55:12 +08:00
|
|
|
static struct mmio_hpte_cache_entry *mmio_cache_search(struct kvm_vcpu *vcpu,
|
|
|
|
unsigned long eaddr, unsigned long slb_v, long mmio_update)
|
|
|
|
{
|
|
|
|
struct mmio_hpte_cache_entry *entry = NULL;
|
|
|
|
unsigned int pshift;
|
|
|
|
unsigned int i;
|
|
|
|
|
|
|
|
for (i = 0; i < MMIO_HPTE_CACHE_SIZE; i++) {
|
|
|
|
entry = &vcpu->arch.mmio_cache.entry[i];
|
|
|
|
if (entry->mmio_update == mmio_update) {
|
|
|
|
pshift = entry->slb_base_pshift;
|
|
|
|
if ((entry->eaddr >> pshift) == (eaddr >> pshift) &&
|
|
|
|
entry->slb_v == slb_v)
|
|
|
|
return entry;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct mmio_hpte_cache_entry *
|
|
|
|
next_mmio_cache_entry(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
unsigned int index = vcpu->arch.mmio_cache.index;
|
|
|
|
|
|
|
|
vcpu->arch.mmio_cache.index++;
|
|
|
|
if (vcpu->arch.mmio_cache.index == MMIO_HPTE_CACHE_SIZE)
|
|
|
|
vcpu->arch.mmio_cache.index = 0;
|
|
|
|
|
|
|
|
return &vcpu->arch.mmio_cache.entry[index];
|
|
|
|
}
|
|
|
|
|
powerpc: kvm: fix rare but potential deadlock scene
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
realmode, so it can trigger the deadlock.
Suppose the following scene:
Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of
vcpus.
If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched
out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be
caught in realmode for a long time.
What makes things even worse if the following happens,
On cpuM, bitlockX is hold, on cpuN, Y is hold.
vcpu_B_2 try to lock Y on cpuM in realmode
vcpu_A_2 try to lock X on cpuN in realmode
Oops! deadlock happens
Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
Reviewed-by: Paul Mackerras <paulus@samba.org>
CC: stable@vger.kernel.org
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-11-15 16:35:00 +08:00
|
|
|
/* When called from virtmode, this func should be protected by
|
|
|
|
* preempt_disable(), otherwise, the holding of HPTE_V_HVLOCK
|
|
|
|
* can trigger deadlock issue.
|
|
|
|
*/
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr, unsigned long slb_v,
|
|
|
|
unsigned long valid)
|
|
|
|
{
|
|
|
|
unsigned int i;
|
|
|
|
unsigned int pshift;
|
|
|
|
unsigned long somask;
|
|
|
|
unsigned long vsid, hash;
|
|
|
|
unsigned long avpn;
|
2014-06-11 16:16:06 +08:00
|
|
|
__be64 *hpte;
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
unsigned long mask, val;
|
2016-11-16 13:57:24 +08:00
|
|
|
unsigned long v, r, orig_v;
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
|
|
|
|
/* Get page shift, work out hash and AVPN etc. */
|
|
|
|
mask = SLB_VSID_B | HPTE_V_AVPN | HPTE_V_SECONDARY;
|
|
|
|
val = 0;
|
|
|
|
pshift = 12;
|
|
|
|
if (slb_v & SLB_VSID_L) {
|
|
|
|
mask |= HPTE_V_LARGE;
|
|
|
|
val |= HPTE_V_LARGE;
|
|
|
|
pshift = slb_base_page_shift[(slb_v & SLB_VSID_LP) >> 4];
|
|
|
|
}
|
|
|
|
if (slb_v & SLB_VSID_B_1T) {
|
|
|
|
somask = (1UL << 40) - 1;
|
|
|
|
vsid = (slb_v & ~SLB_VSID_B) >> SLB_VSID_SHIFT_1T;
|
|
|
|
vsid ^= vsid << 25;
|
|
|
|
} else {
|
|
|
|
somask = (1UL << 28) - 1;
|
|
|
|
vsid = (slb_v & ~SLB_VSID_B) >> SLB_VSID_SHIFT;
|
|
|
|
}
|
2016-12-20 13:49:01 +08:00
|
|
|
hash = (vsid ^ ((eaddr & somask) >> pshift)) & kvmppc_hpt_mask(&kvm->arch.hpt);
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
avpn = slb_v & ~(somask >> 16); /* also includes B */
|
|
|
|
avpn |= (eaddr & somask) >> 16;
|
|
|
|
|
|
|
|
if (pshift >= 24)
|
|
|
|
avpn &= ~((1UL << (pshift - 16)) - 1);
|
|
|
|
else
|
|
|
|
avpn &= ~0x7fUL;
|
|
|
|
val |= avpn;
|
|
|
|
|
|
|
|
for (;;) {
|
2016-12-20 13:49:00 +08:00
|
|
|
hpte = (__be64 *)(kvm->arch.hpt.virt + (hash << 7));
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
|
|
|
|
for (i = 0; i < 16; i += 2) {
|
|
|
|
/* Read the PTE racily */
|
2014-06-11 16:16:06 +08:00
|
|
|
v = be64_to_cpu(hpte[i]) & ~HPTE_V_HVLOCK;
|
2016-11-16 13:57:24 +08:00
|
|
|
if (cpu_has_feature(CPU_FTR_ARCH_300))
|
|
|
|
v = hpte_new_to_old_v(v, be64_to_cpu(hpte[i+1]));
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
|
|
|
|
/* Check valid/absent, hash, segment size and AVPN */
|
|
|
|
if (!(v & valid) || (v & mask) != val)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* Lock the PTE and read it under the lock */
|
|
|
|
while (!try_lock_hpte(&hpte[i], HPTE_V_HVLOCK))
|
|
|
|
cpu_relax();
|
2016-11-16 13:57:24 +08:00
|
|
|
v = orig_v = be64_to_cpu(hpte[i]) & ~HPTE_V_HVLOCK;
|
2014-06-11 16:16:06 +08:00
|
|
|
r = be64_to_cpu(hpte[i+1]);
|
2016-11-16 13:57:24 +08:00
|
|
|
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
|
|
|
|
v = hpte_new_to_old_v(v, r);
|
|
|
|
r = hpte_new_to_old_r(r);
|
|
|
|
}
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
|
|
|
|
/*
|
2014-06-16 02:47:07 +08:00
|
|
|
* Check the HPTE again, including base page size
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
*/
|
|
|
|
if ((v & valid) && (v & mask) == val &&
|
2017-09-11 13:29:45 +08:00
|
|
|
kvmppc_hpte_base_page_shift(v, r) == pshift)
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
/* Return with the HPTE still locked */
|
|
|
|
return (hash << 3) + (i >> 1);
|
|
|
|
|
2016-11-16 13:57:24 +08:00
|
|
|
__unlock_hpte(&hpte[i], orig_v);
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if (val & HPTE_V_SECONDARY)
|
|
|
|
break;
|
|
|
|
val |= HPTE_V_SECONDARY;
|
2016-12-20 13:49:01 +08:00
|
|
|
hash = hash ^ kvmppc_hpt_mask(&kvm->arch.hpt);
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
}
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kvmppc_hv_find_lock_hpte);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Called in real mode to check whether an HPTE not found fault
|
2011-12-12 20:38:51 +08:00
|
|
|
* is due to accessing a paged-out page or an emulated MMIO page,
|
|
|
|
* or if a protection fault is due to accessing a page that the
|
|
|
|
* guest wanted read/write access to but which we made read-only.
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
* Returns a possibly modified status (DSISR) value if not
|
|
|
|
* (i.e. pass the interrupt to the guest),
|
|
|
|
* -1 to pass the fault up to host kernel mode code, -2 to do that
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
* and also load the instruction word (for MMIO emulation),
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
* or 0 if we should make the guest retry the access.
|
|
|
|
*/
|
|
|
|
long kvmppc_hpte_hv_fault(struct kvm_vcpu *vcpu, unsigned long addr,
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
unsigned long slb_v, unsigned int status, bool data)
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
{
|
|
|
|
struct kvm *kvm = vcpu->kvm;
|
|
|
|
long int index;
|
2016-11-16 13:57:24 +08:00
|
|
|
unsigned long v, r, gr, orig_v;
|
2014-06-11 16:16:06 +08:00
|
|
|
__be64 *hpte;
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
unsigned long valid;
|
|
|
|
struct revmap_entry *rev;
|
|
|
|
unsigned long pp, key;
|
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-04 13:55:12 +08:00
|
|
|
struct mmio_hpte_cache_entry *cache_entry = NULL;
|
|
|
|
long mmio_update = 0;
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
|
2011-12-12 20:38:51 +08:00
|
|
|
/* For protection fault, expect to find a valid HPTE */
|
|
|
|
valid = HPTE_V_VALID;
|
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-04 13:55:12 +08:00
|
|
|
if (status & DSISR_NOHPTE) {
|
2011-12-12 20:38:51 +08:00
|
|
|
valid |= HPTE_V_ABSENT;
|
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-04 13:55:12 +08:00
|
|
|
mmio_update = atomic64_read(&kvm->arch.mmio_update);
|
|
|
|
cache_entry = mmio_cache_search(vcpu, addr, slb_v, mmio_update);
|
2011-12-12 20:38:51 +08:00
|
|
|
}
|
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-04 13:55:12 +08:00
|
|
|
if (cache_entry) {
|
|
|
|
index = cache_entry->pte_index;
|
|
|
|
v = cache_entry->hpte_v;
|
|
|
|
r = cache_entry->hpte_r;
|
|
|
|
gr = cache_entry->rpte;
|
|
|
|
} else {
|
|
|
|
index = kvmppc_hv_find_lock_hpte(kvm, addr, slb_v, valid);
|
|
|
|
if (index < 0) {
|
|
|
|
if (status & DSISR_NOHPTE)
|
|
|
|
return status; /* there really was no HPTE */
|
|
|
|
return 0; /* for prot fault, HPTE disappeared */
|
|
|
|
}
|
2016-12-20 13:49:00 +08:00
|
|
|
hpte = (__be64 *)(kvm->arch.hpt.virt + (index << 4));
|
2016-11-16 13:57:24 +08:00
|
|
|
v = orig_v = be64_to_cpu(hpte[0]) & ~HPTE_V_HVLOCK;
|
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-04 13:55:12 +08:00
|
|
|
r = be64_to_cpu(hpte[1]);
|
2016-11-16 13:57:24 +08:00
|
|
|
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
|
|
|
|
v = hpte_new_to_old_v(v, r);
|
|
|
|
r = hpte_new_to_old_r(r);
|
|
|
|
}
|
2016-12-20 13:49:00 +08:00
|
|
|
rev = real_vmalloc_addr(&kvm->arch.hpt.rev[index]);
|
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-04 13:55:12 +08:00
|
|
|
gr = rev->guest_rpte;
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
|
2016-11-16 13:57:24 +08:00
|
|
|
unlock_hpte(hpte, orig_v);
|
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-04 13:55:12 +08:00
|
|
|
}
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
|
2011-12-12 20:38:51 +08:00
|
|
|
/* For not found, if the HPTE is valid by now, retry the instruction */
|
|
|
|
if ((status & DSISR_NOHPTE) && (v & HPTE_V_VALID))
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
/* Check access permissions to the page */
|
|
|
|
pp = gr & (HPTE_R_PP0 | HPTE_R_PP);
|
|
|
|
key = (vcpu->arch.shregs.msr & MSR_PR) ? SLB_VSID_KP : SLB_VSID_KS;
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
status &= ~DSISR_NOHPTE; /* DSISR_NOHPTE == SRR1_ISI_NOPT */
|
|
|
|
if (!data) {
|
|
|
|
if (gr & (HPTE_R_N | HPTE_R_G))
|
|
|
|
return status | SRR1_ISI_N_OR_G;
|
|
|
|
if (!hpte_read_permission(pp, slb_v & key))
|
|
|
|
return status | SRR1_ISI_PROT;
|
|
|
|
} else if (status & DSISR_ISSTORE) {
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
/* check write permission */
|
|
|
|
if (!hpte_write_permission(pp, slb_v & key))
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
return status | DSISR_PROTFAULT;
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
} else {
|
|
|
|
if (!hpte_read_permission(pp, slb_v & key))
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
return status | DSISR_PROTFAULT;
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Check storage key, if applicable */
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
if (data && (vcpu->arch.shregs.msr & MSR_DR)) {
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
unsigned int perm = hpte_get_skey_perm(gr, vcpu->arch.amr);
|
|
|
|
if (status & DSISR_ISSTORE)
|
|
|
|
perm >>= 1;
|
|
|
|
if (perm & 1)
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
return status | DSISR_KEYFAULT;
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Save HPTE info for virtual-mode handler */
|
|
|
|
vcpu->arch.pgfault_addr = addr;
|
|
|
|
vcpu->arch.pgfault_index = index;
|
|
|
|
vcpu->arch.pgfault_hpte[0] = v;
|
|
|
|
vcpu->arch.pgfault_hpte[1] = r;
|
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-04 13:55:12 +08:00
|
|
|
vcpu->arch.pgfault_cache = cache_entry;
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
/* Check the storage key to see if it is possibly emulated MMIO */
|
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-11-04 13:55:12 +08:00
|
|
|
if ((r & (HPTE_R_KEY_HI | HPTE_R_KEY_LO)) ==
|
|
|
|
(HPTE_R_KEY_HI | HPTE_R_KEY_LO)) {
|
|
|
|
if (!cache_entry) {
|
|
|
|
unsigned int pshift = 12;
|
|
|
|
unsigned int pshift_index;
|
|
|
|
|
|
|
|
if (slb_v & SLB_VSID_L) {
|
|
|
|
pshift_index = ((slb_v & SLB_VSID_LP) >> 4);
|
|
|
|
pshift = slb_base_page_shift[pshift_index];
|
|
|
|
}
|
|
|
|
cache_entry = next_mmio_cache_entry(vcpu);
|
|
|
|
cache_entry->eaddr = addr;
|
|
|
|
cache_entry->slb_base_pshift = pshift;
|
|
|
|
cache_entry->pte_index = index;
|
|
|
|
cache_entry->hpte_v = v;
|
|
|
|
cache_entry->hpte_r = r;
|
|
|
|
cache_entry->rpte = gr;
|
|
|
|
cache_entry->slb_v = slb_v;
|
|
|
|
cache_entry->mmio_update = mmio_update;
|
|
|
|
}
|
|
|
|
if (data && (vcpu->arch.shregs.msr & MSR_IR))
|
|
|
|
return -2; /* MMIO emulation - load instr word */
|
|
|
|
}
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
|
|
|
|
return -1; /* send fault up to host kernel mode */
|
|
|
|
}
|