2010-04-16 06:11:37 +08:00
|
|
|
/*
|
|
|
|
* This program is free software; you can redistribute it and/or modify
|
|
|
|
* it under the terms of the GNU General Public License, version 2, as
|
|
|
|
* published by the Free Software Foundation.
|
|
|
|
*
|
|
|
|
* This program is distributed in the hope that it will be useful,
|
|
|
|
* but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
|
|
* GNU General Public License for more details.
|
|
|
|
*
|
|
|
|
* You should have received a copy of the GNU General Public License
|
|
|
|
* along with this program; if not, write to the Free Software
|
|
|
|
* Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
|
|
|
|
*
|
|
|
|
* Copyright SUSE Linux Products GmbH 2010
|
|
|
|
*
|
|
|
|
* Authors: Alexander Graf <agraf@suse.de>
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef __ASM_KVM_BOOK3S_64_H__
|
|
|
|
#define __ASM_KVM_BOOK3S_64_H__
|
|
|
|
|
KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix
Currently, the HPT code in HV KVM maintains a dirty bit per guest page
in the rmap array, whether or not dirty page tracking has been enabled
for the memory slot. In contrast, the radix code maintains a dirty
bit per guest page in memslot->dirty_bitmap, and only does so when
dirty page tracking has been enabled.
This changes the HPT code to maintain the dirty bits in the memslot
dirty_bitmap like radix does. This results in slightly less code
overall, and will mean that we do not lose the dirty bits when
transitioning between HPT and radix mode in future.
There is one minor change to behaviour as a result. With HPT, when
dirty tracking was enabled for a memslot, we would previously clear
all the dirty bits at that point (both in the HPT entries and in the
rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately
following would show no pages as dirty (assuming no vcpus have run
in the meantime). With this change, the dirty bits on HPT entries
are not cleared at the point where dirty tracking is enabled, so
KVM_GET_DIRTY_LOG would show as dirty any guest pages that are
resident in the HPT and dirty. This is consistent with what happens
on radix.
This also fixes a bug in the mark_pages_dirty() function for radix
(in the sense that the function no longer exists). In the case where
a large page of 64 normal pages or more is marked dirty, the
addressing of the dirty bitmap was incorrect and could write past
the end of the bitmap. Fortunately this case was never hit in
practice because a 2MB large page is only 32 x 64kB pages, and we
don't support backing the guest with 1GB huge pages at this point.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-26 13:39:19 +08:00
|
|
|
#include <linux/string.h>
|
|
|
|
#include <asm/bitops.h>
|
2016-09-02 15:20:43 +08:00
|
|
|
#include <asm/book3s/64/mmu-hash.h>
|
|
|
|
|
KVM: PPC: Book3S HV: Split HPT allocation from activation
Currently, kvmppc_alloc_hpt() both allocates a new hashed page table (HPT)
and sets it up as the active page table for a VM. For the upcoming HPT
resize implementation we're going to want to allocate HPTs separately from
activating them.
So, split the allocation itself out into kvmppc_allocate_hpt() and perform
the activation with a new kvmppc_set_hpt() function. Likewise we split
kvmppc_free_hpt(), which just frees the HPT, from kvmppc_release_hpt()
which unsets it as an active HPT, then frees it.
We also move the logic to fall back to smaller HPT sizes if the first try
fails into the single caller which used that behaviour,
kvmppc_hv_setup_htab_rma(). This introduces a slight semantic change, in
that previously if the initial attempt at CMA allocation failed, we would
fall back to attempting smaller sizes with the page allocator. Now, we
try first CMA, then the page allocator at each size. As far as I can tell
this change should be harmless.
To match, we make kvmppc_free_hpt() just free the actual HPT itself. The
call to kvmppc_free_lpid() that was there, we move to the single caller.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-12-20 13:49:02 +08:00
|
|
|
/* Power architecture requires HPT is at least 256kiB, at most 64TiB */
|
|
|
|
#define PPC_MIN_HPT_ORDER 18
|
|
|
|
#define PPC_MAX_HPT_ORDER 46
|
|
|
|
|
2013-10-08 00:47:51 +08:00
|
|
|
#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
|
2011-12-09 21:44:13 +08:00
|
|
|
static inline struct kvmppc_book3s_shadow_vcpu *svcpu_get(struct kvm_vcpu *vcpu)
|
2010-04-16 06:11:37 +08:00
|
|
|
{
|
2011-12-09 21:44:13 +08:00
|
|
|
preempt_disable();
|
2010-04-16 06:11:37 +08:00
|
|
|
return &get_paca()->shadow_vcpu;
|
|
|
|
}
|
2011-12-09 21:44:13 +08:00
|
|
|
|
|
|
|
static inline void svcpu_put(struct kvmppc_book3s_shadow_vcpu *svcpu)
|
|
|
|
{
|
|
|
|
preempt_enable();
|
|
|
|
}
|
KVM: PPC: Add support for Book3S processors in hypervisor mode
This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.
This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.
Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.
This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.
With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.
We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.
In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.
The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.
This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.
This also adds a few new exports needed by the book3s_hv code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-06-29 08:21:34 +08:00
|
|
|
#endif
|
2010-04-16 06:11:37 +08:00
|
|
|
|
2013-10-08 00:47:52 +08:00
|
|
|
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
|
2017-01-30 18:21:44 +08:00
|
|
|
|
|
|
|
static inline bool kvm_is_radix(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return kvm->arch.radix;
|
|
|
|
}
|
|
|
|
|
KVM: PPC: Book3S HV: Make the guest hash table size configurable
This adds a new ioctl to enable userspace to control the size of the guest
hashed page table (HPT) and to clear it out when resetting the guest.
The KVM_PPC_ALLOCATE_HTAB ioctl is a VM ioctl and takes as its parameter
a pointer to a u32 containing the desired order of the HPT (log base 2
of the size in bytes), which is updated on successful return to the
actual order of the HPT which was allocated.
There must be no vcpus running at the time of this ioctl. To enforce
this, we now keep a count of the number of vcpus running in
kvm->arch.vcpus_running.
If the ioctl is called when a HPT has already been allocated, we don't
reallocate the HPT but just clear it out. We first clear the
kvm->arch.rma_setup_done flag, which has two effects: (a) since we hold
the kvm->lock mutex, it will prevent any vcpus from starting to run until
we're done, and (b) it means that the first vcpu to run after we're done
will re-establish the VRMA if necessary.
If userspace doesn't call this ioctl before running the first vcpu, the
kernel will allocate a default-sized HPT at that point. We do it then
rather than when creating the VM, as the code did previously, so that
userspace has a chance to do the ioctl if it wants.
When allocating the HPT, we can allocate either from the kernel page
allocator, or from the preallocated pool. If userspace is asking for
a different size from the preallocated HPTs, we first try to allocate
using the kernel page allocator. Then we try to allocate from the
preallocated pool, and then if that fails, we try allocating decreasing
sizes from the kernel page allocator, down to the minimum size allowed
(256kB). Note that the kernel page allocator limits allocations to
1 << CONFIG_FORCE_MAX_ZONEORDER pages, which by default corresponds to
16MB (on 64-bit powerpc, at least).
Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix module compilation]
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-05-04 10:32:53 +08:00
|
|
|
#define KVM_DEFAULT_HPT_ORDER 24 /* 16MB HPT by default */
|
2011-12-12 20:27:39 +08:00
|
|
|
#endif
|
|
|
|
|
2011-12-12 20:30:16 +08:00
|
|
|
/*
|
|
|
|
* We use a lock bit in HPTE dword 0 to synchronize updates and
|
|
|
|
* accesses to each HPTE, and another bit to indicate non-present
|
|
|
|
* HPTEs.
|
|
|
|
*/
|
|
|
|
#define HPTE_V_HVLOCK 0x40UL
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
#define HPTE_V_ABSENT 0x20UL
|
2011-12-12 20:30:16 +08:00
|
|
|
|
2012-11-20 06:52:49 +08:00
|
|
|
/*
|
|
|
|
* We use this bit in the guest_rpte field of the revmap entry
|
|
|
|
* to indicate a modified HPTE.
|
|
|
|
*/
|
|
|
|
#define HPTE_GR_MODIFIED (1ul << 62)
|
|
|
|
|
|
|
|
/* These bits are reserved in the guest view of the HPTE */
|
|
|
|
#define HPTE_GR_RESERVED HPTE_GR_MODIFIED
|
|
|
|
|
2014-06-11 16:16:06 +08:00
|
|
|
static inline long try_lock_hpte(__be64 *hpte, unsigned long bits)
|
2011-12-12 20:30:16 +08:00
|
|
|
{
|
|
|
|
unsigned long tmp, old;
|
2014-06-11 16:16:06 +08:00
|
|
|
__be64 be_lockbit, be_bits;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We load/store in native endian, but the HTAB is in big endian. If
|
|
|
|
* we byte swap all data we apply on the PTE we're implicitly correct
|
|
|
|
* again.
|
|
|
|
*/
|
|
|
|
be_lockbit = cpu_to_be64(HPTE_V_HVLOCK);
|
|
|
|
be_bits = cpu_to_be64(bits);
|
2011-12-12 20:30:16 +08:00
|
|
|
|
|
|
|
asm volatile(" ldarx %0,0,%2\n"
|
|
|
|
" and. %1,%0,%3\n"
|
|
|
|
" bne 2f\n"
|
2014-06-11 16:16:06 +08:00
|
|
|
" or %0,%0,%4\n"
|
2011-12-12 20:30:16 +08:00
|
|
|
" stdcx. %0,0,%2\n"
|
|
|
|
" beq+ 2f\n"
|
2012-10-15 09:20:50 +08:00
|
|
|
" mr %1,%3\n"
|
2011-12-12 20:30:16 +08:00
|
|
|
"2: isync"
|
|
|
|
: "=&r" (tmp), "=&r" (old)
|
2014-06-11 16:16:06 +08:00
|
|
|
: "r" (hpte), "r" (be_bits), "r" (be_lockbit)
|
2011-12-12 20:30:16 +08:00
|
|
|
: "cc", "memory");
|
|
|
|
return old == 0;
|
|
|
|
}
|
|
|
|
|
2015-03-20 17:39:43 +08:00
|
|
|
static inline void unlock_hpte(__be64 *hpte, unsigned long hpte_v)
|
|
|
|
{
|
|
|
|
hpte_v &= ~HPTE_V_HVLOCK;
|
|
|
|
asm volatile(PPC_RELEASE_BARRIER "" : : : "memory");
|
|
|
|
hpte[0] = cpu_to_be64(hpte_v);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Without barrier */
|
|
|
|
static inline void __unlock_hpte(__be64 *hpte, unsigned long hpte_v)
|
|
|
|
{
|
|
|
|
hpte_v &= ~HPTE_V_HVLOCK;
|
|
|
|
hpte[0] = cpu_to_be64(hpte_v);
|
|
|
|
}
|
|
|
|
|
2017-09-11 13:29:45 +08:00
|
|
|
/*
|
|
|
|
* These functions encode knowledge of the POWER7/8/9 hardware
|
|
|
|
* interpretations of the HPTE LP (large page size) field.
|
|
|
|
*/
|
|
|
|
static inline int kvmppc_hpte_page_shifts(unsigned long h, unsigned long l)
|
|
|
|
{
|
|
|
|
unsigned int lphi;
|
|
|
|
|
|
|
|
if (!(h & HPTE_V_LARGE))
|
|
|
|
return 12; /* 4kB */
|
|
|
|
lphi = (l >> 16) & 0xf;
|
|
|
|
switch ((l >> 12) & 0xf) {
|
|
|
|
case 0:
|
2017-11-10 13:40:24 +08:00
|
|
|
return !lphi ? 24 : 0; /* 16MB */
|
2017-09-11 13:29:45 +08:00
|
|
|
break;
|
|
|
|
case 1:
|
|
|
|
return 16; /* 64kB */
|
|
|
|
break;
|
|
|
|
case 3:
|
2017-11-10 13:40:24 +08:00
|
|
|
return !lphi ? 34 : 0; /* 16GB */
|
2017-09-11 13:29:45 +08:00
|
|
|
break;
|
|
|
|
case 7:
|
|
|
|
return (16 << 8) + 12; /* 64kB in 4kB */
|
|
|
|
break;
|
|
|
|
case 8:
|
|
|
|
if (!lphi)
|
|
|
|
return (24 << 8) + 16; /* 16MB in 64kkB */
|
|
|
|
if (lphi == 3)
|
|
|
|
return (24 << 8) + 12; /* 16MB in 4kB */
|
|
|
|
break;
|
|
|
|
}
|
2017-11-10 13:40:24 +08:00
|
|
|
return 0;
|
2017-09-11 13:29:45 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline int kvmppc_hpte_base_page_shift(unsigned long h, unsigned long l)
|
|
|
|
{
|
|
|
|
return kvmppc_hpte_page_shifts(h, l) & 0xff;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int kvmppc_hpte_actual_page_shift(unsigned long h, unsigned long l)
|
|
|
|
{
|
|
|
|
int tmp = kvmppc_hpte_page_shifts(h, l);
|
|
|
|
|
|
|
|
if (tmp >= 0x100)
|
|
|
|
tmp >>= 8;
|
|
|
|
return tmp;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long kvmppc_actual_pgsz(unsigned long v, unsigned long r)
|
|
|
|
{
|
2017-11-10 13:40:24 +08:00
|
|
|
int shift = kvmppc_hpte_actual_page_shift(v, r);
|
|
|
|
|
|
|
|
if (shift)
|
|
|
|
return 1ul << shift;
|
|
|
|
return 0;
|
2017-09-11 13:29:45 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline int kvmppc_pgsize_lp_encoding(int base_shift, int actual_shift)
|
|
|
|
{
|
|
|
|
switch (base_shift) {
|
|
|
|
case 12:
|
|
|
|
switch (actual_shift) {
|
|
|
|
case 12:
|
|
|
|
return 0;
|
|
|
|
case 16:
|
|
|
|
return 7;
|
|
|
|
case 24:
|
|
|
|
return 0x38;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case 16:
|
|
|
|
switch (actual_shift) {
|
|
|
|
case 16:
|
|
|
|
return 1;
|
|
|
|
case 24:
|
|
|
|
return 8;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case 24:
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
2011-11-08 15:08:52 +08:00
|
|
|
static inline unsigned long compute_tlbie_rb(unsigned long v, unsigned long r,
|
|
|
|
unsigned long pte_index)
|
|
|
|
{
|
2017-09-11 13:29:45 +08:00
|
|
|
int a_pgshift, b_pgshift;
|
KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
On recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size. Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page. This
capability is called mixed page-size segment (MPSS). With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry. [ Note that the actual page size is
always >= base page size ].
We use "ibm,segment-page-sizes" device tree node to advertise
the MPSS support to PAPR guest. The penc encoding indicates whether
we support a specific combination of base page size and actual
page size in the same segment. We also use the penc value in the
LP encoding of HPTE entry.
This patch exposes MPSS support to KVM guest by advertising the
feature via "ibm,segment-page-sizes". It also adds the necessary changes
to decode the base page size and the actual page size correctly from the
HPTE entry.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-07 02:01:36 +08:00
|
|
|
unsigned long rb = 0, va_low, sllp;
|
|
|
|
|
2017-09-11 13:29:45 +08:00
|
|
|
b_pgshift = a_pgshift = kvmppc_hpte_page_shifts(v, r);
|
|
|
|
if (a_pgshift >= 0x100) {
|
|
|
|
b_pgshift &= 0xff;
|
|
|
|
a_pgshift >>= 8;
|
KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
On recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size. Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page. This
capability is called mixed page-size segment (MPSS). With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry. [ Note that the actual page size is
always >= base page size ].
We use "ibm,segment-page-sizes" device tree node to advertise
the MPSS support to PAPR guest. The penc encoding indicates whether
we support a specific combination of base page size and actual
page size in the same segment. We also use the penc value in the
LP encoding of HPTE entry.
This patch exposes MPSS support to KVM guest by advertising the
feature via "ibm,segment-page-sizes". It also adds the necessary changes
to decode the base page size and the actual page size correctly from the
HPTE entry.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-07 02:01:36 +08:00
|
|
|
}
|
2016-09-02 15:20:43 +08:00
|
|
|
|
KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
On recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size. Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page. This
capability is called mixed page-size segment (MPSS). With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry. [ Note that the actual page size is
always >= base page size ].
We use "ibm,segment-page-sizes" device tree node to advertise
the MPSS support to PAPR guest. The penc encoding indicates whether
we support a specific combination of base page size and actual
page size in the same segment. We also use the penc value in the
LP encoding of HPTE entry.
This patch exposes MPSS support to KVM guest by advertising the
feature via "ibm,segment-page-sizes". It also adds the necessary changes
to decode the base page size and the actual page size correctly from the
HPTE entry.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-07 02:01:36 +08:00
|
|
|
/*
|
|
|
|
* Ignore the top 14 bits of va
|
|
|
|
* v have top two bits covering segment size, hence move
|
|
|
|
* by 16 bits, Also clear the lower HPTE_V_AVPN_SHIFT (7) bits.
|
|
|
|
* AVA field in v also have the lower 23 bits ignored.
|
|
|
|
* For base page size 4K we need 14 .. 65 bits (so need to
|
|
|
|
* collect extra 11 bits)
|
|
|
|
* For others we need 14..14+i
|
|
|
|
*/
|
|
|
|
/* This covers 14..54 bits of va*/
|
2011-11-08 15:08:52 +08:00
|
|
|
rb = (v & ~0x7fUL) << 16; /* AVA field */
|
2014-06-29 19:17:30 +08:00
|
|
|
|
KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
On recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size. Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page. This
capability is called mixed page-size segment (MPSS). With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry. [ Note that the actual page size is
always >= base page size ].
We use "ibm,segment-page-sizes" device tree node to advertise
the MPSS support to PAPR guest. The penc encoding indicates whether
we support a specific combination of base page size and actual
page size in the same segment. We also use the penc value in the
LP encoding of HPTE entry.
This patch exposes MPSS support to KVM guest by advertising the
feature via "ibm,segment-page-sizes". It also adds the necessary changes
to decode the base page size and the actual page size correctly from the
HPTE entry.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-07 02:01:36 +08:00
|
|
|
/*
|
|
|
|
* AVA in v had cleared lower 23 bits. We need to derive
|
|
|
|
* that from pteg index
|
|
|
|
*/
|
2011-11-08 15:08:52 +08:00
|
|
|
va_low = pte_index >> 3;
|
|
|
|
if (v & HPTE_V_SECONDARY)
|
|
|
|
va_low = ~va_low;
|
KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
On recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size. Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page. This
capability is called mixed page-size segment (MPSS). With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry. [ Note that the actual page size is
always >= base page size ].
We use "ibm,segment-page-sizes" device tree node to advertise
the MPSS support to PAPR guest. The penc encoding indicates whether
we support a specific combination of base page size and actual
page size in the same segment. We also use the penc value in the
LP encoding of HPTE entry.
This patch exposes MPSS support to KVM guest by advertising the
feature via "ibm,segment-page-sizes". It also adds the necessary changes
to decode the base page size and the actual page size correctly from the
HPTE entry.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-07 02:01:36 +08:00
|
|
|
/*
|
|
|
|
* get the vpn bits from va_low using reverse of hashing.
|
|
|
|
* In v we have va with 23 bits dropped and then left shifted
|
|
|
|
* HPTE_V_AVPN_SHIFT (7) bits. Now to find vsid we need
|
|
|
|
* right shift it with (SID_SHIFT - (23 - 7))
|
|
|
|
*/
|
2011-11-08 15:08:52 +08:00
|
|
|
if (!(v & HPTE_V_1TB_SEG))
|
KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
On recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size. Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page. This
capability is called mixed page-size segment (MPSS). With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry. [ Note that the actual page size is
always >= base page size ].
We use "ibm,segment-page-sizes" device tree node to advertise
the MPSS support to PAPR guest. The penc encoding indicates whether
we support a specific combination of base page size and actual
page size in the same segment. We also use the penc value in the
LP encoding of HPTE entry.
This patch exposes MPSS support to KVM guest by advertising the
feature via "ibm,segment-page-sizes". It also adds the necessary changes
to decode the base page size and the actual page size correctly from the
HPTE entry.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-07 02:01:36 +08:00
|
|
|
va_low ^= v >> (SID_SHIFT - 16);
|
2011-11-08 15:08:52 +08:00
|
|
|
else
|
KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
On recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size. Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page. This
capability is called mixed page-size segment (MPSS). With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry. [ Note that the actual page size is
always >= base page size ].
We use "ibm,segment-page-sizes" device tree node to advertise
the MPSS support to PAPR guest. The penc encoding indicates whether
we support a specific combination of base page size and actual
page size in the same segment. We also use the penc value in the
LP encoding of HPTE entry.
This patch exposes MPSS support to KVM guest by advertising the
feature via "ibm,segment-page-sizes". It also adds the necessary changes
to decode the base page size and the actual page size correctly from the
HPTE entry.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-07 02:01:36 +08:00
|
|
|
va_low ^= v >> (SID_SHIFT_1T - 16);
|
2011-11-08 15:08:52 +08:00
|
|
|
va_low &= 0x7ff;
|
KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
On recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size. Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page. This
capability is called mixed page-size segment (MPSS). With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry. [ Note that the actual page size is
always >= base page size ].
We use "ibm,segment-page-sizes" device tree node to advertise
the MPSS support to PAPR guest. The penc encoding indicates whether
we support a specific combination of base page size and actual
page size in the same segment. We also use the penc value in the
LP encoding of HPTE entry.
This patch exposes MPSS support to KVM guest by advertising the
feature via "ibm,segment-page-sizes". It also adds the necessary changes
to decode the base page size and the actual page size correctly from the
HPTE entry.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-07 02:01:36 +08:00
|
|
|
|
2017-11-10 13:40:24 +08:00
|
|
|
if (b_pgshift <= 12) {
|
2017-09-11 13:29:45 +08:00
|
|
|
if (a_pgshift > 12) {
|
|
|
|
sllp = (a_pgshift == 16) ? 5 : 4;
|
|
|
|
rb |= sllp << 5; /* AP field */
|
|
|
|
}
|
KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
On recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size. Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page. This
capability is called mixed page-size segment (MPSS). With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry. [ Note that the actual page size is
always >= base page size ].
We use "ibm,segment-page-sizes" device tree node to advertise
the MPSS support to PAPR guest. The penc encoding indicates whether
we support a specific combination of base page size and actual
page size in the same segment. We also use the penc value in the
LP encoding of HPTE entry.
This patch exposes MPSS support to KVM guest by advertising the
feature via "ibm,segment-page-sizes". It also adds the necessary changes
to decode the base page size and the actual page size correctly from the
HPTE entry.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-07 02:01:36 +08:00
|
|
|
rb |= (va_low & 0x7ff) << 12; /* remaining 11 bits of AVA */
|
2017-09-11 13:29:45 +08:00
|
|
|
} else {
|
KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
On recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size. Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page. This
capability is called mixed page-size segment (MPSS). With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry. [ Note that the actual page size is
always >= base page size ].
We use "ibm,segment-page-sizes" device tree node to advertise
the MPSS support to PAPR guest. The penc encoding indicates whether
we support a specific combination of base page size and actual
page size in the same segment. We also use the penc value in the
LP encoding of HPTE entry.
This patch exposes MPSS support to KVM guest by advertising the
feature via "ibm,segment-page-sizes". It also adds the necessary changes
to decode the base page size and the actual page size correctly from the
HPTE entry.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-07 02:01:36 +08:00
|
|
|
int aval_shift;
|
|
|
|
/*
|
2014-06-29 19:17:30 +08:00
|
|
|
* remaining bits of AVA/LP fields
|
KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
On recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size. Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page. This
capability is called mixed page-size segment (MPSS). With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry. [ Note that the actual page size is
always >= base page size ].
We use "ibm,segment-page-sizes" device tree node to advertise
the MPSS support to PAPR guest. The penc encoding indicates whether
we support a specific combination of base page size and actual
page size in the same segment. We also use the penc value in the
LP encoding of HPTE entry.
This patch exposes MPSS support to KVM guest by advertising the
feature via "ibm,segment-page-sizes". It also adds the necessary changes
to decode the base page size and the actual page size correctly from the
HPTE entry.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-07 02:01:36 +08:00
|
|
|
* Also contain the rr bits of LP
|
|
|
|
*/
|
2017-09-11 13:29:45 +08:00
|
|
|
rb |= (va_low << b_pgshift) & 0x7ff000;
|
KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
On recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size. Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page. This
capability is called mixed page-size segment (MPSS). With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry. [ Note that the actual page size is
always >= base page size ].
We use "ibm,segment-page-sizes" device tree node to advertise
the MPSS support to PAPR guest. The penc encoding indicates whether
we support a specific combination of base page size and actual
page size in the same segment. We also use the penc value in the
LP encoding of HPTE entry.
This patch exposes MPSS support to KVM guest by advertising the
feature via "ibm,segment-page-sizes". It also adds the necessary changes
to decode the base page size and the actual page size correctly from the
HPTE entry.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-07 02:01:36 +08:00
|
|
|
/*
|
|
|
|
* Now clear not needed LP bits based on actual psize
|
|
|
|
*/
|
2017-09-11 13:29:45 +08:00
|
|
|
rb &= ~((1ul << a_pgshift) - 1);
|
KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
On recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size. Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page. This
capability is called mixed page-size segment (MPSS). With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry. [ Note that the actual page size is
always >= base page size ].
We use "ibm,segment-page-sizes" device tree node to advertise
the MPSS support to PAPR guest. The penc encoding indicates whether
we support a specific combination of base page size and actual
page size in the same segment. We also use the penc value in the
LP encoding of HPTE entry.
This patch exposes MPSS support to KVM guest by advertising the
feature via "ibm,segment-page-sizes". It also adds the necessary changes
to decode the base page size and the actual page size correctly from the
HPTE entry.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-07 02:01:36 +08:00
|
|
|
/*
|
|
|
|
* AVAL field 58..77 - base_page_shift bits of va
|
|
|
|
* we have space for 58..64 bits, Missing bits should
|
|
|
|
* be zero filled. +1 is to take care of L bit shift
|
|
|
|
*/
|
2017-09-11 13:29:45 +08:00
|
|
|
aval_shift = 64 - (77 - b_pgshift) + 1;
|
KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
On recent IBM Power CPUs, while the hashed page table is looked up using
the page size from the segmentation hardware (i.e. the SLB), it is
possible to have the HPT entry indicate a larger page size. Thus for
example it is possible to put a 16MB page in a 64kB segment, but since
the hash lookup is done using a 64kB page size, it may be necessary to
put multiple entries in the HPT for a single 16MB page. This
capability is called mixed page-size segment (MPSS). With MPSS,
there are two relevant page sizes: the base page size, which is the
size used in searching the HPT, and the actual page size, which is the
size indicated in the HPT entry. [ Note that the actual page size is
always >= base page size ].
We use "ibm,segment-page-sizes" device tree node to advertise
the MPSS support to PAPR guest. The penc encoding indicates whether
we support a specific combination of base page size and actual
page size in the same segment. We also use the penc value in the
LP encoding of HPTE entry.
This patch exposes MPSS support to KVM guest by advertising the
feature via "ibm,segment-page-sizes". It also adds the necessary changes
to decode the base page size and the actual page size correctly from the
HPTE entry.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-05-07 02:01:36 +08:00
|
|
|
rb |= ((va_low << aval_shift) & 0xfe);
|
|
|
|
|
|
|
|
rb |= 1; /* L field */
|
2017-09-11 13:29:45 +08:00
|
|
|
rb |= r & 0xff000 & ((1ul << a_pgshift) - 1); /* LP field */
|
2011-11-08 15:08:52 +08:00
|
|
|
}
|
2016-09-16 15:25:50 +08:00
|
|
|
rb |= (v >> HPTE_V_SSIZE_SHIFT) << 8; /* B field */
|
2011-11-08 15:08:52 +08:00
|
|
|
return rb;
|
|
|
|
}
|
|
|
|
|
2011-12-12 20:33:07 +08:00
|
|
|
static inline unsigned long hpte_rpn(unsigned long ptel, unsigned long psize)
|
|
|
|
{
|
|
|
|
return ((ptel & HPTE_R_RPN) & ~(psize - 1)) >> PAGE_SHIFT;
|
|
|
|
}
|
|
|
|
|
2011-12-12 20:38:51 +08:00
|
|
|
static inline int hpte_is_writable(unsigned long ptel)
|
|
|
|
{
|
|
|
|
unsigned long pp = ptel & (HPTE_R_PP0 | HPTE_R_PP);
|
|
|
|
|
|
|
|
return pp != PP_RXRX && pp != PP_RXXX;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long hpte_make_readonly(unsigned long ptel)
|
|
|
|
{
|
|
|
|
if ((ptel & HPTE_R_PP0) || (ptel & HPTE_R_PP) == PP_RWXX)
|
|
|
|
ptel = (ptel & ~HPTE_R_PP) | PP_RXXX;
|
|
|
|
else
|
|
|
|
ptel |= PP_RXRX;
|
|
|
|
return ptel;
|
|
|
|
}
|
|
|
|
|
2016-04-29 21:25:38 +08:00
|
|
|
static inline bool hpte_cache_flags_ok(unsigned long hptel, bool is_ci)
|
2011-12-12 20:32:27 +08:00
|
|
|
{
|
2016-04-29 21:25:38 +08:00
|
|
|
unsigned int wimg = hptel & HPTE_R_WIMG;
|
2011-12-12 20:32:27 +08:00
|
|
|
|
|
|
|
/* Handle SAO */
|
|
|
|
if (wimg == (HPTE_R_W | HPTE_R_I | HPTE_R_M) &&
|
|
|
|
cpu_has_feature(CPU_FTR_ARCH_206))
|
|
|
|
wimg = HPTE_R_M;
|
|
|
|
|
2016-04-29 21:25:38 +08:00
|
|
|
if (!is_ci)
|
2011-12-12 20:32:27 +08:00
|
|
|
return wimg == HPTE_R_M;
|
2016-04-29 21:25:38 +08:00
|
|
|
/*
|
|
|
|
* if host is mapped cache inhibited, make sure hptel also have
|
|
|
|
* cache inhibited.
|
|
|
|
*/
|
|
|
|
if (wimg & HPTE_R_W) /* FIXME!! is this ok for all guest. ? */
|
|
|
|
return false;
|
|
|
|
return !!(wimg & HPTE_R_I);
|
2011-12-12 20:32:27 +08:00
|
|
|
}
|
|
|
|
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
/*
|
2013-06-20 17:00:19 +08:00
|
|
|
* If it's present and writable, atomically set dirty and referenced bits and
|
2015-03-30 13:11:04 +08:00
|
|
|
* return the PTE, otherwise return 0.
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
*/
|
2015-03-30 13:11:04 +08:00
|
|
|
static inline pte_t kvmppc_read_update_linux_pte(pte_t *ptep, int writing)
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
{
|
2013-06-20 17:00:19 +08:00
|
|
|
pte_t old_pte, new_pte = __pte(0);
|
|
|
|
|
|
|
|
while (1) {
|
2015-03-30 13:09:12 +08:00
|
|
|
/*
|
|
|
|
* Make sure we don't reload from ptep
|
|
|
|
*/
|
|
|
|
old_pte = READ_ONCE(*ptep);
|
2013-06-20 17:00:19 +08:00
|
|
|
/*
|
2016-04-29 21:25:45 +08:00
|
|
|
* wait until H_PAGE_BUSY is clear then set it atomically
|
2013-06-20 17:00:19 +08:00
|
|
|
*/
|
2016-04-29 21:25:45 +08:00
|
|
|
if (unlikely(pte_val(old_pte) & H_PAGE_BUSY)) {
|
2013-06-20 17:00:19 +08:00
|
|
|
cpu_relax();
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
/* If pte is not present return None */
|
2015-03-25 17:11:57 +08:00
|
|
|
if (unlikely(!(pte_val(old_pte) & _PAGE_PRESENT)))
|
2013-06-20 17:00:19 +08:00
|
|
|
return __pte(0);
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
|
2013-06-20 17:00:19 +08:00
|
|
|
new_pte = pte_mkyoung(old_pte);
|
|
|
|
if (writing && pte_write(old_pte))
|
|
|
|
new_pte = pte_mkdirty(new_pte);
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
|
2016-04-29 21:25:27 +08:00
|
|
|
if (pte_xchg(ptep, old_pte, new_pte))
|
2013-06-20 17:00:19 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
return new_pte;
|
KVM: PPC: Implement MMU notifiers for Book3S HV guests
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:38:05 +08:00
|
|
|
}
|
|
|
|
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
static inline bool hpte_read_permission(unsigned long pp, unsigned long key)
|
|
|
|
{
|
|
|
|
if (key)
|
|
|
|
return PP_RWRX <= pp && pp <= PP_RXRX;
|
2015-03-31 07:46:04 +08:00
|
|
|
return true;
|
KVM: PPC: Implement MMIO emulation support for Book3S HV guests
This provides the low-level support for MMIO emulation in Book3S HV
guests. When the guest tries to map a page which is not covered by
any memslot, that page is taken to be an MMIO emulation page. Instead
of inserting a valid HPTE, we insert an HPTE that has the valid bit
clear but another hypervisor software-use bit set, which we call
HPTE_V_ABSENT, to indicate that this is an absent page. An
absent page is treated much like a valid page as far as guest hcalls
(H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
an absent HPTE doesn't need to be invalidated with tlbie since it
was never valid as far as the hardware is concerned.
When the guest accesses a page for which there is an absent HPTE, it
will take a hypervisor data storage interrupt (HDSI) since we now set
the VPM1 bit in the LPCR. Our HDSI handler for HPTE-not-present faults
looks up the hash table and if it finds an absent HPTE mapping the
requested virtual address, will switch to kernel mode and handle the
fault in kvmppc_book3s_hv_page_fault(), which at present just calls
kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
This is based on an earlier patch by Benjamin Herrenschmidt, but since
heavily reworked.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-12 20:36:37 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool hpte_write_permission(unsigned long pp, unsigned long key)
|
|
|
|
{
|
|
|
|
if (key)
|
|
|
|
return pp == PP_RWRW;
|
|
|
|
return pp <= PP_RWRW;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int hpte_get_skey_perm(unsigned long hpte_r, unsigned long amr)
|
|
|
|
{
|
|
|
|
unsigned long skey;
|
|
|
|
|
|
|
|
skey = ((hpte_r & HPTE_R_KEY_HI) >> 57) |
|
|
|
|
((hpte_r & HPTE_R_KEY_LO) >> 9);
|
|
|
|
return (amr >> (62 - 2 * skey)) & 3;
|
|
|
|
}
|
|
|
|
|
2011-12-12 20:33:07 +08:00
|
|
|
static inline void lock_rmap(unsigned long *rmap)
|
|
|
|
{
|
|
|
|
do {
|
|
|
|
while (test_bit(KVMPPC_RMAP_LOCK_BIT, rmap))
|
|
|
|
cpu_relax();
|
|
|
|
} while (test_and_set_bit_lock(KVMPPC_RMAP_LOCK_BIT, rmap));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void unlock_rmap(unsigned long *rmap)
|
|
|
|
{
|
|
|
|
__clear_bit_unlock(KVMPPC_RMAP_LOCK_BIT, rmap);
|
|
|
|
}
|
|
|
|
|
2011-12-12 20:31:41 +08:00
|
|
|
static inline bool slot_is_aligned(struct kvm_memory_slot *memslot,
|
|
|
|
unsigned long pagesize)
|
|
|
|
{
|
|
|
|
unsigned long mask = (pagesize >> PAGE_SHIFT) - 1;
|
|
|
|
|
|
|
|
if (pagesize <= PAGE_SIZE)
|
2015-03-31 07:46:04 +08:00
|
|
|
return true;
|
2011-12-12 20:31:41 +08:00
|
|
|
return !(memslot->base_gfn & mask) && !(memslot->npages & mask);
|
|
|
|
}
|
|
|
|
|
KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor. Reads on
this fd return the contents of the HPT (hashed page table), writes
create and/or remove entries in the HPT. There is a new capability,
KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl. The ioctl
takes an argument structure with the index of the first HPT entry to
read out and a set of flags. The flags indicate whether the user is
intending to read or write the HPT, and whether to return all entries
or only the "bolted" entries (those with the bolted bit, 0x10, set in
the first doubleword).
This is intended for use in implementing qemu's savevm/loadvm and for
live migration. Therefore, on reads, the first pass returns information
about all HPTEs (or all bolted HPTEs). When the first pass reaches the
end of the HPT, it returns from the read. Subsequent reads only return
information about HPTEs that have changed since they were last read.
A read that finds no changed HPTEs in the HPT following where the last
read finished will return 0 bytes.
The format of the data provides a simple run-length compression of the
invalid entries. Each block of data starts with a header that indicates
the index (position in the HPT, which is just an array), the number of
valid entries starting at that index (may be zero), and the number of
invalid entries following those valid entries. The valid entries, 16
bytes each, follow the header. The invalid entries are not explicitly
represented.
Signed-off-by: Paul Mackerras <paulus@samba.org>
[agraf: fix documentation]
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-11-20 06:57:20 +08:00
|
|
|
/*
|
|
|
|
* This works for 4k, 64k and 16M pages on POWER7,
|
|
|
|
* and 4k and 16M pages on PPC970.
|
|
|
|
*/
|
|
|
|
static inline unsigned long slb_pgsize_encoding(unsigned long psize)
|
|
|
|
{
|
|
|
|
unsigned long senc = 0;
|
|
|
|
|
|
|
|
if (psize > 0x1000) {
|
|
|
|
senc = SLB_VSID_L;
|
|
|
|
if (psize == 0x10000)
|
|
|
|
senc |= SLB_VSID_LP_01;
|
|
|
|
}
|
|
|
|
return senc;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int is_vrma_hpte(unsigned long hpte_v)
|
|
|
|
{
|
|
|
|
return (hpte_v & ~0xffffffUL) ==
|
|
|
|
(HPTE_V_1TB_SEG | (VRMA_VSID << (40 - 16)));
|
|
|
|
}
|
|
|
|
|
2013-10-08 00:47:52 +08:00
|
|
|
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
|
2013-04-19 03:50:24 +08:00
|
|
|
/*
|
|
|
|
* Note modification of an HPTE; set the HPTE modified bit
|
|
|
|
* if anyone is interested.
|
|
|
|
*/
|
|
|
|
static inline void note_hpte_modification(struct kvm *kvm,
|
|
|
|
struct revmap_entry *rev)
|
|
|
|
{
|
|
|
|
if (atomic_read(&kvm->arch.hpte_mod_interest))
|
|
|
|
rev->guest_rpte |= HPTE_GR_MODIFIED;
|
|
|
|
}
|
KVM: PPC: Book3S HV: Don't use kvm_memslots() in real mode
With HV KVM, some high-frequency hypercalls such as H_ENTER are handled
in real mode, and need to access the memslots array for the guest.
Accessing the memslots array is safe, because we hold the SRCU read
lock for the whole time that a guest vcpu is running. However, the
checks that kvm_memslots() does when lockdep is enabled are potentially
unsafe in real mode, when only the linear mapping is available.
Furthermore, kvm_memslots() can be called from a secondary CPU thread,
which is an offline CPU from the point of view of the host kernel,
and is not running the task which holds the SRCU read lock.
To avoid false positives in the checks in kvm_memslots(), and to avoid
possible side effects from doing the checks in real mode, this replaces
kvm_memslots() with kvm_memslots_raw() in all the places that execute
in real mode. kvm_memslots_raw() is a new function that is like
kvm_memslots() but uses rcu_dereference_raw_notrace() instead of
kvm_dereference_check().
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Scott Wood <scottwood@freescale.com>
2014-03-25 07:47:06 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Like kvm_memslots(), but for use in real mode when we can't do
|
|
|
|
* any RCU stuff (since the secondary threads are offline from the
|
|
|
|
* kernel's point of view), and we can't print anything.
|
|
|
|
* Thus we use rcu_dereference_raw() rather than rcu_dereference_check().
|
|
|
|
*/
|
|
|
|
static inline struct kvm_memslots *kvm_memslots_raw(struct kvm *kvm)
|
|
|
|
{
|
2015-05-17 23:30:37 +08:00
|
|
|
return rcu_dereference_raw_notrace(kvm->memslots[0]);
|
KVM: PPC: Book3S HV: Don't use kvm_memslots() in real mode
With HV KVM, some high-frequency hypercalls such as H_ENTER are handled
in real mode, and need to access the memslots array for the guest.
Accessing the memslots array is safe, because we hold the SRCU read
lock for the whole time that a guest vcpu is running. However, the
checks that kvm_memslots() does when lockdep is enabled are potentially
unsafe in real mode, when only the linear mapping is available.
Furthermore, kvm_memslots() can be called from a secondary CPU thread,
which is an offline CPU from the point of view of the host kernel,
and is not running the task which holds the SRCU read lock.
To avoid false positives in the checks in kvm_memslots(), and to avoid
possible side effects from doing the checks in real mode, this replaces
kvm_memslots() with kvm_memslots_raw() in all the places that execute
in real mode. kvm_memslots_raw() is a new function that is like
kvm_memslots() but uses rcu_dereference_raw_notrace() instead of
kvm_dereference_check().
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Scott Wood <scottwood@freescale.com>
2014-03-25 07:47:06 +08:00
|
|
|
}
|
|
|
|
|
2015-03-28 11:21:01 +08:00
|
|
|
extern void kvmppc_mmu_debugfs_init(struct kvm *kvm);
|
2018-10-08 13:30:57 +08:00
|
|
|
extern void kvmhv_radix_debugfs_init(struct kvm *kvm);
|
2015-03-28 11:21:01 +08:00
|
|
|
|
2015-03-28 11:21:11 +08:00
|
|
|
extern void kvmhv_rm_send_ipi(int cpu);
|
|
|
|
|
2016-12-20 13:49:01 +08:00
|
|
|
static inline unsigned long kvmppc_hpt_npte(struct kvm_hpt_info *hpt)
|
|
|
|
{
|
|
|
|
/* HPTEs are 2**4 bytes long */
|
|
|
|
return 1UL << (hpt->order - 4);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long kvmppc_hpt_mask(struct kvm_hpt_info *hpt)
|
|
|
|
{
|
|
|
|
/* 128 (2**7) bytes in each HPTEG */
|
|
|
|
return (1UL << (hpt->order - 7)) - 1;
|
|
|
|
}
|
|
|
|
|
KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix
Currently, the HPT code in HV KVM maintains a dirty bit per guest page
in the rmap array, whether or not dirty page tracking has been enabled
for the memory slot. In contrast, the radix code maintains a dirty
bit per guest page in memslot->dirty_bitmap, and only does so when
dirty page tracking has been enabled.
This changes the HPT code to maintain the dirty bits in the memslot
dirty_bitmap like radix does. This results in slightly less code
overall, and will mean that we do not lose the dirty bits when
transitioning between HPT and radix mode in future.
There is one minor change to behaviour as a result. With HPT, when
dirty tracking was enabled for a memslot, we would previously clear
all the dirty bits at that point (both in the HPT entries and in the
rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately
following would show no pages as dirty (assuming no vcpus have run
in the meantime). With this change, the dirty bits on HPT entries
are not cleared at the point where dirty tracking is enabled, so
KVM_GET_DIRTY_LOG would show as dirty any guest pages that are
resident in the HPT and dirty. This is consistent with what happens
on radix.
This also fixes a bug in the mark_pages_dirty() function for radix
(in the sense that the function no longer exists). In the case where
a large page of 64 normal pages or more is marked dirty, the
addressing of the dirty bitmap was incorrect and could write past
the end of the bitmap. Fortunately this case was never hit in
practice because a 2MB large page is only 32 x 64kB pages, and we
don't support backing the guest with 1GB huge pages at this point.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-10-26 13:39:19 +08:00
|
|
|
/* Set bits in a dirty bitmap, which is in LE format */
|
|
|
|
static inline void set_dirty_bits(unsigned long *map, unsigned long i,
|
|
|
|
unsigned long npages)
|
|
|
|
{
|
|
|
|
|
|
|
|
if (npages >= 8)
|
|
|
|
memset((char *)map + i / 8, 0xff, npages / 8);
|
|
|
|
else
|
|
|
|
for (; npages; ++i, --npages)
|
|
|
|
__set_bit_le(i, map);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void set_dirty_bits_atomic(unsigned long *map, unsigned long i,
|
|
|
|
unsigned long npages)
|
|
|
|
{
|
|
|
|
if (npages >= 8)
|
|
|
|
memset((char *)map + i / 8, 0xff, npages / 8);
|
|
|
|
else
|
|
|
|
for (; npages; ++i, --npages)
|
|
|
|
set_bit_le(i, map);
|
|
|
|
}
|
|
|
|
|
KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9
POWER9 has hardware bugs relating to transactional memory and thread
reconfiguration (changes to hardware SMT mode). Specifically, the core
does not have enough storage to store a complete checkpoint of all the
architected state for all four threads. The DD2.2 version of POWER9
includes hardware modifications designed to allow hypervisor software
to implement workarounds for these problems. This patch implements
those workarounds in KVM code so that KVM guests see a full, working
transactional memory implementation.
The problems center around the use of TM suspended state, where the
CPU has a checkpointed state but execution is not transactional. The
workaround is to implement a "fake suspend" state, which looks to the
guest like suspended state but the CPU does not store a checkpoint.
In this state, any instruction that would cause a transition to
transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
checkpointed state (treclaim) causes a "soft patch" interrupt (vector
0x1500) to the hypervisor so that it can be emulated. The trechkpt
instruction also causes a soft patch interrupt.
On POWER9 DD2.2, we avoid returning to the guest in any state which
would require a checkpoint to be present. The trechkpt in the guest
entry path which would normally create that checkpoint is replaced by
either a transition to fake suspend state, if the guest is in suspend
state, or a rollback to the pre-transactional state if the guest is in
transactional state. Fake suspend state is indicated by a flag in the
PACA plus a new bit in the PSSCR. The new PSSCR bit is write-only and
reads back as 0.
On exit from the guest, if the guest is in fake suspend state, we still
do the treclaim instruction as we would in real suspend state, in order
to get into non-transactional state, but we do not save the resulting
register state since there was no checkpoint.
Emulation of the instructions that cause a softpatch interrupt is
handled in two paths. If the guest is in real suspend mode, we call
kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
transitioning to transactional state. This is called before we do the
treclaim in the guest exit path; because we haven't done treclaim, we
can get back to the guest with the transaction still active. If the
instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
handle, or if the guest is in fake suspend state, then we proceed to
do the complete guest exit path and subsequently call
kvmhv_p9_tm_emulation() in host context with the MMU on. This handles
all the cases including the cases that generate program interrupts
(illegal instruction or TM Bad Thing) and facility unavailable
interrupts.
The emulation is reasonably straightforward and is mostly concerned
with checking for exception conditions and updating the state of
registers such as MSR and CR0. The treclaim emulation takes care to
ensure that the TEXASR register gets updated as if it were the guest
treclaim instruction that had done failure recording, not the treclaim
done in hypervisor state in the guest exit path.
With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
transactional memory is not available to host userspace.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-21 18:32:01 +08:00
|
|
|
static inline u64 sanitize_msr(u64 msr)
|
|
|
|
{
|
|
|
|
msr &= ~MSR_HV;
|
|
|
|
msr |= MSR_ME;
|
|
|
|
return msr;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
|
|
|
|
static inline void copy_from_checkpoint(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
vcpu->arch.cr = vcpu->arch.cr_tm;
|
2018-05-07 14:20:08 +08:00
|
|
|
vcpu->arch.regs.xer = vcpu->arch.xer_tm;
|
|
|
|
vcpu->arch.regs.link = vcpu->arch.lr_tm;
|
|
|
|
vcpu->arch.regs.ctr = vcpu->arch.ctr_tm;
|
KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9
POWER9 has hardware bugs relating to transactional memory and thread
reconfiguration (changes to hardware SMT mode). Specifically, the core
does not have enough storage to store a complete checkpoint of all the
architected state for all four threads. The DD2.2 version of POWER9
includes hardware modifications designed to allow hypervisor software
to implement workarounds for these problems. This patch implements
those workarounds in KVM code so that KVM guests see a full, working
transactional memory implementation.
The problems center around the use of TM suspended state, where the
CPU has a checkpointed state but execution is not transactional. The
workaround is to implement a "fake suspend" state, which looks to the
guest like suspended state but the CPU does not store a checkpoint.
In this state, any instruction that would cause a transition to
transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
checkpointed state (treclaim) causes a "soft patch" interrupt (vector
0x1500) to the hypervisor so that it can be emulated. The trechkpt
instruction also causes a soft patch interrupt.
On POWER9 DD2.2, we avoid returning to the guest in any state which
would require a checkpoint to be present. The trechkpt in the guest
entry path which would normally create that checkpoint is replaced by
either a transition to fake suspend state, if the guest is in suspend
state, or a rollback to the pre-transactional state if the guest is in
transactional state. Fake suspend state is indicated by a flag in the
PACA plus a new bit in the PSSCR. The new PSSCR bit is write-only and
reads back as 0.
On exit from the guest, if the guest is in fake suspend state, we still
do the treclaim instruction as we would in real suspend state, in order
to get into non-transactional state, but we do not save the resulting
register state since there was no checkpoint.
Emulation of the instructions that cause a softpatch interrupt is
handled in two paths. If the guest is in real suspend mode, we call
kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
transitioning to transactional state. This is called before we do the
treclaim in the guest exit path; because we haven't done treclaim, we
can get back to the guest with the transaction still active. If the
instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
handle, or if the guest is in fake suspend state, then we proceed to
do the complete guest exit path and subsequently call
kvmhv_p9_tm_emulation() in host context with the MMU on. This handles
all the cases including the cases that generate program interrupts
(illegal instruction or TM Bad Thing) and facility unavailable
interrupts.
The emulation is reasonably straightforward and is mostly concerned
with checking for exception conditions and updating the state of
registers such as MSR and CR0. The treclaim emulation takes care to
ensure that the TEXASR register gets updated as if it were the guest
treclaim instruction that had done failure recording, not the treclaim
done in hypervisor state in the guest exit path.
With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
transactional memory is not available to host userspace.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-21 18:32:01 +08:00
|
|
|
vcpu->arch.amr = vcpu->arch.amr_tm;
|
|
|
|
vcpu->arch.ppr = vcpu->arch.ppr_tm;
|
|
|
|
vcpu->arch.dscr = vcpu->arch.dscr_tm;
|
|
|
|
vcpu->arch.tar = vcpu->arch.tar_tm;
|
2018-05-07 14:20:07 +08:00
|
|
|
memcpy(vcpu->arch.regs.gpr, vcpu->arch.gpr_tm,
|
|
|
|
sizeof(vcpu->arch.regs.gpr));
|
KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9
POWER9 has hardware bugs relating to transactional memory and thread
reconfiguration (changes to hardware SMT mode). Specifically, the core
does not have enough storage to store a complete checkpoint of all the
architected state for all four threads. The DD2.2 version of POWER9
includes hardware modifications designed to allow hypervisor software
to implement workarounds for these problems. This patch implements
those workarounds in KVM code so that KVM guests see a full, working
transactional memory implementation.
The problems center around the use of TM suspended state, where the
CPU has a checkpointed state but execution is not transactional. The
workaround is to implement a "fake suspend" state, which looks to the
guest like suspended state but the CPU does not store a checkpoint.
In this state, any instruction that would cause a transition to
transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
checkpointed state (treclaim) causes a "soft patch" interrupt (vector
0x1500) to the hypervisor so that it can be emulated. The trechkpt
instruction also causes a soft patch interrupt.
On POWER9 DD2.2, we avoid returning to the guest in any state which
would require a checkpoint to be present. The trechkpt in the guest
entry path which would normally create that checkpoint is replaced by
either a transition to fake suspend state, if the guest is in suspend
state, or a rollback to the pre-transactional state if the guest is in
transactional state. Fake suspend state is indicated by a flag in the
PACA plus a new bit in the PSSCR. The new PSSCR bit is write-only and
reads back as 0.
On exit from the guest, if the guest is in fake suspend state, we still
do the treclaim instruction as we would in real suspend state, in order
to get into non-transactional state, but we do not save the resulting
register state since there was no checkpoint.
Emulation of the instructions that cause a softpatch interrupt is
handled in two paths. If the guest is in real suspend mode, we call
kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
transitioning to transactional state. This is called before we do the
treclaim in the guest exit path; because we haven't done treclaim, we
can get back to the guest with the transaction still active. If the
instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
handle, or if the guest is in fake suspend state, then we proceed to
do the complete guest exit path and subsequently call
kvmhv_p9_tm_emulation() in host context with the MMU on. This handles
all the cases including the cases that generate program interrupts
(illegal instruction or TM Bad Thing) and facility unavailable
interrupts.
The emulation is reasonably straightforward and is mostly concerned
with checking for exception conditions and updating the state of
registers such as MSR and CR0. The treclaim emulation takes care to
ensure that the TEXASR register gets updated as if it were the guest
treclaim instruction that had done failure recording, not the treclaim
done in hypervisor state in the guest exit path.
With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
transactional memory is not available to host userspace.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-21 18:32:01 +08:00
|
|
|
vcpu->arch.fp = vcpu->arch.fp_tm;
|
|
|
|
vcpu->arch.vr = vcpu->arch.vr_tm;
|
|
|
|
vcpu->arch.vrsave = vcpu->arch.vrsave_tm;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void copy_to_checkpoint(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
vcpu->arch.cr_tm = vcpu->arch.cr;
|
2018-05-07 14:20:08 +08:00
|
|
|
vcpu->arch.xer_tm = vcpu->arch.regs.xer;
|
|
|
|
vcpu->arch.lr_tm = vcpu->arch.regs.link;
|
|
|
|
vcpu->arch.ctr_tm = vcpu->arch.regs.ctr;
|
KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9
POWER9 has hardware bugs relating to transactional memory and thread
reconfiguration (changes to hardware SMT mode). Specifically, the core
does not have enough storage to store a complete checkpoint of all the
architected state for all four threads. The DD2.2 version of POWER9
includes hardware modifications designed to allow hypervisor software
to implement workarounds for these problems. This patch implements
those workarounds in KVM code so that KVM guests see a full, working
transactional memory implementation.
The problems center around the use of TM suspended state, where the
CPU has a checkpointed state but execution is not transactional. The
workaround is to implement a "fake suspend" state, which looks to the
guest like suspended state but the CPU does not store a checkpoint.
In this state, any instruction that would cause a transition to
transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
checkpointed state (treclaim) causes a "soft patch" interrupt (vector
0x1500) to the hypervisor so that it can be emulated. The trechkpt
instruction also causes a soft patch interrupt.
On POWER9 DD2.2, we avoid returning to the guest in any state which
would require a checkpoint to be present. The trechkpt in the guest
entry path which would normally create that checkpoint is replaced by
either a transition to fake suspend state, if the guest is in suspend
state, or a rollback to the pre-transactional state if the guest is in
transactional state. Fake suspend state is indicated by a flag in the
PACA plus a new bit in the PSSCR. The new PSSCR bit is write-only and
reads back as 0.
On exit from the guest, if the guest is in fake suspend state, we still
do the treclaim instruction as we would in real suspend state, in order
to get into non-transactional state, but we do not save the resulting
register state since there was no checkpoint.
Emulation of the instructions that cause a softpatch interrupt is
handled in two paths. If the guest is in real suspend mode, we call
kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
transitioning to transactional state. This is called before we do the
treclaim in the guest exit path; because we haven't done treclaim, we
can get back to the guest with the transaction still active. If the
instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
handle, or if the guest is in fake suspend state, then we proceed to
do the complete guest exit path and subsequently call
kvmhv_p9_tm_emulation() in host context with the MMU on. This handles
all the cases including the cases that generate program interrupts
(illegal instruction or TM Bad Thing) and facility unavailable
interrupts.
The emulation is reasonably straightforward and is mostly concerned
with checking for exception conditions and updating the state of
registers such as MSR and CR0. The treclaim emulation takes care to
ensure that the TEXASR register gets updated as if it were the guest
treclaim instruction that had done failure recording, not the treclaim
done in hypervisor state in the guest exit path.
With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
transactional memory is not available to host userspace.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-21 18:32:01 +08:00
|
|
|
vcpu->arch.amr_tm = vcpu->arch.amr;
|
|
|
|
vcpu->arch.ppr_tm = vcpu->arch.ppr;
|
|
|
|
vcpu->arch.dscr_tm = vcpu->arch.dscr;
|
|
|
|
vcpu->arch.tar_tm = vcpu->arch.tar;
|
2018-05-07 14:20:07 +08:00
|
|
|
memcpy(vcpu->arch.gpr_tm, vcpu->arch.regs.gpr,
|
|
|
|
sizeof(vcpu->arch.regs.gpr));
|
KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9
POWER9 has hardware bugs relating to transactional memory and thread
reconfiguration (changes to hardware SMT mode). Specifically, the core
does not have enough storage to store a complete checkpoint of all the
architected state for all four threads. The DD2.2 version of POWER9
includes hardware modifications designed to allow hypervisor software
to implement workarounds for these problems. This patch implements
those workarounds in KVM code so that KVM guests see a full, working
transactional memory implementation.
The problems center around the use of TM suspended state, where the
CPU has a checkpointed state but execution is not transactional. The
workaround is to implement a "fake suspend" state, which looks to the
guest like suspended state but the CPU does not store a checkpoint.
In this state, any instruction that would cause a transition to
transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
checkpointed state (treclaim) causes a "soft patch" interrupt (vector
0x1500) to the hypervisor so that it can be emulated. The trechkpt
instruction also causes a soft patch interrupt.
On POWER9 DD2.2, we avoid returning to the guest in any state which
would require a checkpoint to be present. The trechkpt in the guest
entry path which would normally create that checkpoint is replaced by
either a transition to fake suspend state, if the guest is in suspend
state, or a rollback to the pre-transactional state if the guest is in
transactional state. Fake suspend state is indicated by a flag in the
PACA plus a new bit in the PSSCR. The new PSSCR bit is write-only and
reads back as 0.
On exit from the guest, if the guest is in fake suspend state, we still
do the treclaim instruction as we would in real suspend state, in order
to get into non-transactional state, but we do not save the resulting
register state since there was no checkpoint.
Emulation of the instructions that cause a softpatch interrupt is
handled in two paths. If the guest is in real suspend mode, we call
kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
transitioning to transactional state. This is called before we do the
treclaim in the guest exit path; because we haven't done treclaim, we
can get back to the guest with the transaction still active. If the
instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
handle, or if the guest is in fake suspend state, then we proceed to
do the complete guest exit path and subsequently call
kvmhv_p9_tm_emulation() in host context with the MMU on. This handles
all the cases including the cases that generate program interrupts
(illegal instruction or TM Bad Thing) and facility unavailable
interrupts.
The emulation is reasonably straightforward and is mostly concerned
with checking for exception conditions and updating the state of
registers such as MSR and CR0. The treclaim emulation takes care to
ensure that the TEXASR register gets updated as if it were the guest
treclaim instruction that had done failure recording, not the treclaim
done in hypervisor state in the guest exit path.
With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
transactional memory is not available to host userspace.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-03-21 18:32:01 +08:00
|
|
|
vcpu->arch.fp_tm = vcpu->arch.fp;
|
|
|
|
vcpu->arch.vr_tm = vcpu->arch.vr;
|
|
|
|
vcpu->arch.vrsave_tm = vcpu->arch.vrsave;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
|
|
|
|
|
2013-10-08 00:47:52 +08:00
|
|
|
#endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
|
2013-04-19 03:50:24 +08:00
|
|
|
|
2010-04-16 06:11:37 +08:00
|
|
|
#endif /* __ASM_KVM_BOOK3S_64_H__ */
|