Commit Graph

568 Commits

Author SHA1 Message Date
Takuya Yoshikawa e12091ce7b KVM: Remove unused slot_bitmap from kvm_mmu_page
Not needed any more.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-14 11:13:58 +02:00
Takuya Yoshikawa b99db1d352 KVM: MMU: Make kvm_mmu_slot_remove_write_access() rmap based
This makes it possible to release mmu_lock and reschedule conditionally
in a later patch.  Although this may increase the time needed to protect
the whole slot when we start dirty logging, the kernel should not allow
the userspace to trigger something that will hold a spinlock for such a
long time as tens of milliseconds: actually there is no limit since it
is roughly proportional to the number of guest pages.

Another point to note is that this patch removes the only user of
slot_bitmap which will cause some problems when we increase the number
of slots further.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-14 11:13:47 +02:00
Takuya Yoshikawa 245c3912ea KVM: MMU: Remove unused parameter level from __rmap_write_protect()
No longer need to care about the mapping level in this function.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-14 11:13:31 +02:00
Xiao Guangrong 7751babd3c KVM: MMU: fix infinite fault access retry
We have two issues in current code:
- if target gfn is used as its page table, guest will refault then kvm will use
  small page size to map it. We need two #PF to fix its shadow page table

- sometimes, say a exception is triggered during vm-exit caused by #PF
  (see handle_exception() in vmx.c), we remove all the shadow pages shadowed
  by the target gfn before go into page fault path, it will cause infinite
  loop:
  delete shadow pages shadowed by the gfn -> try to use large page size to map
  the gfn -> retry the access ->...

To fix these, we can adjust page size early if the target gfn is used as page
table

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-10 15:28:30 -02:00
Xiao Guangrong c22885050e KVM: MMU: fix Dirty bit missed if CR0.WP = 0
If the write-fault access is from supervisor and CR0.WP is not set on the
vcpu, kvm will fix it by adjusting pte access - it sets the W bit on pte
and clears U bit. This is the chance that kvm can change pte access from
readonly to writable

Unfortunately, the pte access is the access of 'direct' shadow page table,
means direct sp.role.access = pte_access, then we will create a writable
spte entry on the readonly shadow page table. It will cause Dirty bit is
not tracked when two guest ptes point to the same large page. Note, it
does not have other impact except Dirty bit since cr0.wp is encoded into
sp.role

It can be fixed by adjusting pte access before establishing shadow page
table. Also, after that, no mmu specified code exists in the common function
and drop two parameters in set_spte

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-10 15:28:08 -02:00
Xiao Guangrong c219346325 KVM: MMU: optimize for set_spte
There are two cases we need to adjust page size in set_spte:
1): the one is other vcpu creates new sp in the window between mapping_level()
    and acquiring mmu-lock.
2): the another case is the new sp is created by itself (page-fault path) when
    guest uses the target gfn as its page table.

In current code, set_spte drop the spte and emulate the access for these case,
it works not good:
- for the case 1, it may destroy the mapping established by other vcpu, and
  do expensive instruction emulation.
- for the case 2, it may emulate the access even if the guest is accessing
  the page which not used as page table. There is a example, 0~2M is used as
  huge page in guest, in this huge page, only page 3 used as page table, then
  guest read/writes on other pages can cause instruction emulation.

Both of these cases can be fixed by allowing guest to retry the access, it
will refault, then we can establish the mapping by using small page

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-06 09:11:25 +02:00
Xiao Guangrong 81c52c56e2 KVM: do not treat noslot pfn as a error pfn
This patch filters noslot pfn out from error pfns based on Marcelo comment:
noslot pfn is not a error pfn

After this patch,
- is_noslot_pfn indicates that the gfn is not in slot
- is_error_pfn indicates that the gfn is in slot but the error is occurred
  when translate the gfn to pfn
- is_error_noslot_pfn indicates that the pfn either it is error pfns or it
  is noslot pfn
And is_invalid_pfn can be removed, it makes the code more clean

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-10-29 20:31:04 -02:00
Marcelo Tosatti 19bf7f8ac3 Merge remote-tracking branch 'master' into queue
Merge reason: development work has dependency on kvm patches merged
upstream.

Conflicts:
	arch/powerpc/include/asm/Kbuild
	arch/powerpc/include/asm/kvm_para.h

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-10-29 19:15:32 -02:00
Christoffer Dall 8ca40a70a7 KVM: Take kvm instead of vcpu to mmu_notifier_retry
The mmu_notifier_retry is not specific to any vcpu (and never will be)
so only take struct kvm as a parameter.

The motivation is the ARM mmu code that needs to call this from
somewhere where we long let go of the vcpu pointer.

Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-10-23 13:35:43 +02:00
Xiao Guangrong f3ac1a4b66 KVM: MMU: fix release noslot pfn
We can not directly call kvm_release_pfn_clean to release the pfn
since we can meet noslot pfn which is used to cache mmio info into
spte

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-10-22 18:03:25 +02:00
Xiao Guangrong a052b42b0e KVM: MMU: move prefetch_invalid_gpte out of pagaing_tmp.h
The function does not depend on guest mmu mode, move it out from
paging_tmpl.h

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-10-17 16:39:18 +02:00
Xiao Guangrong bd660776da KVM: MMU: remove mmu_is_invalid
Remove mmu_is_invalid and use is_invalid_pfn instead

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-10-17 16:39:15 +02:00
Avi Kivity 6fd01b711b KVM: MMU: Optimize is_last_gpte()
Instead of branchy code depending on level, gpte.ps, and mmu configuration,
prepare everything in a bitmap during mode changes and look it up during
runtime.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-20 13:00:09 +03:00
Avi Kivity 97d64b7881 KVM: MMU: Optimize pte permission checks
walk_addr_generic() permission checks are a maze of branchy code, which is
performed four times per lookup.  It depends on the type of access, efer.nxe,
cr0.wp, cr4.smep, and in the near future, cr4.smap.

Optimize this away by precalculating all variants and storing them in a
bitmap.  The bitmap is recalculated when rarely-changing variables change
(cr0, cr4) and is indexed by the often-changing variables (page fault error
code, pte access permissions).

The permission check is moved to the end of the loop, otherwise an SMEP
fault could be reported as a false positive, when PDE.U=1 but PTE.U=0.
Noted by Xiao Guangrong.

The result is short, branch-free code.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-20 13:00:08 +03:00
Avi Kivity 3d34adec70 KVM: MMU: Move gpte_access() out of paging_tmpl.h
We no longer rely on paging_tmpl.h defines; so we can move the function
to mmu.c.

Rely on zero extension to 64 bits to get the correct nx behaviour.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-20 13:00:08 +03:00
Avi Kivity 8ea667f259 KVM: MMU: Push clean gpte write protection out of gpte_access()
gpte_access() computes the access permissions of a guest pte and also
write-protects clean gptes.  This is wrong when we are servicing a
write fault (since we'll be setting the dirty bit momentarily) but
correct when instantiating a speculative spte, or when servicing a
read fault (since we'll want to trap a following write in order to
set the dirty bit).

It doesn't seem to hurt in practice, but in order to make the code
readable, push the write protection out of gpte_access() and into
a new protect_clean_gpte() which is called explicitly when needed.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-20 13:00:07 +03:00
Xiao Guangrong 7de5bdc96c KVM: MMU: remove unnecessary check
Checking the return of kvm_mmu_get_page is unnecessary since it is
guaranteed by memory cache

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-10 11:26:16 +03:00
Marcelo Tosatti c78aa4c4b9 Merge remote-tracking branch 'upstream/master' into queue
Merging critical fixes from upstream required for development.

* upstream/master: (809 commits)
  libata: Add a space to " 2GB ATA Flash Disk" DMA blacklist entry
  Revert "powerpc: Update g5_defconfig"
  powerpc/perf: Use pmc_overflow() to detect rolled back events
  powerpc: Fix VMX in interrupt check in POWER7 copy loops
  powerpc: POWER7 copy_to_user/copy_from_user patch applied twice
  powerpc: Fix personality handling in ppc64_personality()
  powerpc/dma-iommu: Fix IOMMU window check
  powerpc: Remove unnecessary ifdefs
  powerpc/kgdb: Restore current_thread_info properly
  powerpc/kgdb: Bail out of KGDB when we've been triggered
  powerpc/kgdb: Do not set kgdb_single_step on ppc
  powerpc/mpic_msgr: Add missing includes
  powerpc: Fix null pointer deref in perf hardware breakpoints
  powerpc: Fixup whitespace in xmon
  powerpc: Fix xmon dl command for new printk implementation
  xfs: check for possible overflow in xfs_ioc_trim
  xfs: unlock the AGI buffer when looping in xfs_dialloc
  xfs: fix uninitialised variable in xfs_rtbuf_get()
  powerpc/fsl: fix "Failed to mount /dev: No such device" errors
  powerpc/fsl: update defconfigs
  ...

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-26 13:58:41 -03:00
Takuya Yoshikawa 35f2d16bb9 KVM: MMU: Fix mmu_shrink() so that it can free mmu pages as intended
Although the possible race described in

  commit 85b7059169
  KVM: MMU: fix shrinking page from the empty mmu

was correct, the real cause of that issue was a more trivial bug of
mmu_shrink() introduced by

  commit 1952639665
  KVM: MMU: do not iterate over all VMs in mmu_shrink()

Here is the bug:

	if (kvm->arch.n_used_mmu_pages > 0) {
		if (!nr_to_scan--)
			break;
		continue;
	}

We skip VMs whose n_used_mmu_pages is not zero and try to shrink others:
in other words we try to shrink empty ones by mistake.

This patch reverses the logic so that mmu_shrink() can free pages from
the first VM whose n_used_mmu_pages is not zero.  Note that we also add
comments explaining the role of nr_to_scan which is not practically
important now, hoping this will be improved in the future.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-22 15:27:13 +03:00
Xiao Guangrong 4d8b81abc4 KVM: introduce readonly memslot
In current code, if we map a readonly memory space from host to guest
and the page is not currently mapped in the host, we will get a fault
pfn and async is not allowed, then the vm will crash

We introduce readonly memory region to map ROM/ROMD to the guest, read access
is happy for readonly memslot, write access on readonly memslot will cause
KVM_EXIT_MMIO exit

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-22 15:09:03 +03:00
Xiao Guangrong 037d92dc5d KVM: introduce gfn_to_pfn_memslot_atomic
It can instead of hva_to_pfn_atomic

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-22 15:08:52 +03:00
Xiao Guangrong cb9aaa30b1 KVM: do not release the error pfn
After commit a2766325cf, the error pfn is replaced by the
error code, it need not be released anymore

[ The patch has been compiling tested for powerpc ]

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 16:04:57 +03:00
Xiao Guangrong e6c1502b3f KVM: introduce KVM_PFN_ERR_HWPOISON
Then, get_hwpoison_pfn and is_hwpoison_pfn can be removed

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 16:04:52 +03:00
Xiao Guangrong 6c8ee57be9 KVM: introduce KVM_PFN_ERR_FAULT
After that, the exported and un-inline function, get_fault_pfn,
can be removed

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 16:04:50 +03:00
Takuya Yoshikawa d89cc617b9 KVM: Push rmap into kvm_arch_memory_slot
Two reasons:
 - x86 can integrate rmap and rmap_pde and remove heuristics in
   __gfn_to_rmap().
 - Some architectures do not need rmap.

Since rmap is one of the most memory consuming stuff in KVM, ppc'd
better restrict the allocation to Book3S HV.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 12:47:30 +03:00
Takuya Yoshikawa 65fbe37c42 KVM: MMU: Use gfn_to_rmap() instead of directly reading rmap array
This helps to make rmap architecture specific in a later patch.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 12:47:04 +03:00
Xiao Guangrong 3b2bd2f800 KVM: MMU: use kvm_release_pfn_clean to release pfn
The current code depends on the fact that fault_page is the normal page,
however, we will use the error code instead of these dummy pages in the
later patch, so we use kvm_release_pfn_clean to release pfn which will
release the error code properly

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-26 11:55:30 +03:00
Avi Kivity e9bda6f6f9 Merge branch 'queue' into next
Merge patches queued during the run-up to the merge window.

* queue: (25 commits)
  KVM: Choose better candidate for directed yield
  KVM: Note down when cpu relax intercepted or pause loop exited
  KVM: Add config to support ple or cpu relax optimzation
  KVM: switch to symbolic name for irq_states size
  KVM: x86: Fix typos in pmu.c
  KVM: x86: Fix typos in lapic.c
  KVM: x86: Fix typos in cpuid.c
  KVM: x86: Fix typos in emulate.c
  KVM: x86: Fix typos in x86.c
  KVM: SVM: Fix typos
  KVM: VMX: Fix typos
  KVM: remove the unused parameter of gfn_to_pfn_memslot
  KVM: remove is_error_hpa
  KVM: make bad_pfn static to kvm_main.c
  KVM: using get_fault_pfn to get the fault pfn
  KVM: MMU: track the refcount when unmap the page
  KVM: x86: remove unnecessary mark_page_dirty
  KVM: MMU: Avoid handling same rmap_pde in kvm_handle_hva_range()
  KVM: MMU: Push trace_kvm_age_page() into kvm_age_rmapp()
  KVM: MMU: Add memslot parameter to hva handlers
  ...

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-26 11:54:21 +03:00
Linus Torvalds 5fecc9d8f5 KVM updates for the 3.6 merge window
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQDRDNAAoJEI7yEDeUysxlkl8P/3C2AHx2webOU8sVzhfU6ONZ
 ZoGevwBjyZIeJEmiWVpFTTEew1l0PXtpyOocXGNUXIddVnhXTQOKr/Scj4uFbmx8
 ROqgK8NSX9+xOGrBPCoN7SlJkmp+m6uYtwYkl2SGnsEVLWMKkc7J7oqmszCcTQvN
 UXMf7G47/Ul2NUSBdv4Yvizhl4kpvWxluiweDw3E/hIQKN0uyP7CY58qcAztw8nG
 csZBAnnuPFwIAWxHXW3eBBv4UP138HbNDqJ/dujjocM6GnOxmXJmcZ6b57gh+Y64
 3+w9IR4qrRWnsErb/I8inKLJ1Jdcf7yV2FmxYqR4pIXay2Yzo1BsvFd6EB+JavUv
 pJpixrFiDDFoQyXlh4tGpsjpqdXNMLqyG4YpqzSZ46C8naVv9gKE7SXqlXnjyDlb
 Llx3hb9Fop8O5ykYEGHi+gIISAK5eETiQl4yw9RUBDpxydH4qJtqGIbLiDy8y9wi
 Xyi8PBlNl+biJFsK805lxURqTp/SJTC3+Zb7A7CzYEQm5xZw3W/CKZx1ZYBfpaa/
 pWaP6tB7JwgLIVXi4HQayLWqMVwH0soZIn9yazpOEFv6qO8d5QH5RAxAW2VXE3n5
 JDlrajar/lGIdiBVWfwTJLb86gv3QDZtIWoR9mZuLKeKWE/6PRLe7HQpG1pJovsm
 2AsN5bS0BWq+aqPpZHa5
 =pECD
 -----END PGP SIGNATURE-----

Merge tag 'kvm-3.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Avi Kivity:
 "Highlights include
  - full big real mode emulation on pre-Westmere Intel hosts (can be
    disabled with emulate_invalid_guest_state=0)
  - relatively small ppc and s390 updates
  - PCID/INVPCID support in guests
  - EOI avoidance; 3.6 guests should perform better on 3.6 hosts on
    interrupt intensive workloads)
  - Lockless write faults during live migration
  - EPT accessed/dirty bits support for new Intel processors"

Fix up conflicts in:
 - Documentation/virtual/kvm/api.txt:

   Stupid subchapter numbering, added next to each other.

 - arch/powerpc/kvm/booke_interrupts.S:

   PPC asm changes clashing with the KVM fixes

 - arch/s390/include/asm/sigp.h, arch/s390/kvm/sigp.c:

   Duplicated commits through the kvm tree and the s390 tree, with
   subsequent edits in the KVM tree.

* tag 'kvm-3.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (93 commits)
  KVM: fix race with level interrupts
  x86, hyper: fix build with !CONFIG_KVM_GUEST
  Revert "apic: fix kvm build on UP without IOAPIC"
  KVM guest: switch to apic_set_eoi_write, apic_write
  apic: add apic_set_eoi_write for PV use
  KVM: VMX: Implement PCID/INVPCID for guests with EPT
  KVM: Add x86_hyper_kvm to complete detect_hypervisor_platform check
  KVM: PPC: Critical interrupt emulation support
  KVM: PPC: e500mc: Fix tlbilx emulation for 64-bit guests
  KVM: PPC64: booke: Set interrupt computation mode for 64-bit host
  KVM: PPC: bookehv: Add ESR flag to Data Storage Interrupt
  KVM: PPC: bookehv64: Add support for std/ld emulation.
  booke: Added crit/mc exception handler for e500v2
  booke/bookehv: Add host crit-watchdog exception support
  KVM: MMU: document mmu-lock and fast page fault
  KVM: MMU: fix kvm_mmu_pagetable_walk tracepoint
  KVM: MMU: trace fast page fault
  KVM: MMU: fast path of handling guest page fault
  KVM: MMU: introduce SPTE_MMU_WRITEABLE bit
  KVM: MMU: fold tlb flush judgement into mmu_spte_update
  ...
2012-07-24 12:01:20 -07:00
Xiao Guangrong d566104853 KVM: remove the unused parameter of gfn_to_pfn_memslot
The parameter, 'kvm', is not used in gfn_to_pfn_memslot, we can happily remove
it

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-19 21:25:24 -03:00
Xiao Guangrong 903816fa4d KVM: using get_fault_pfn to get the fault pfn
Using get_fault_pfn to cleanup the code

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-19 21:15:25 -03:00
Xiao Guangrong 86fde74cf5 KVM: MMU: track the refcount when unmap the page
It will trigger a WARN_ON if the page has been freed but it is still
used in mmu, it can help us to detect mm bug early

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-19 21:09:10 -03:00
Takuya Yoshikawa bcd3ef5828 KVM: MMU: Avoid handling same rmap_pde in kvm_handle_hva_range()
When we invalidate a THP page, we call the handler with the same
rmap_pde argument 512 times in the following loop:

  for each guest page in the range
    for each level
      unmap using rmap

This patch avoids these extra handler calls by changing the loop order
like this:

  for each level
    for each rmap in the range
      unmap using rmap

With the preceding patches in the patch series, this made THP page
invalidation more than 5 times faster on our x86 host: the host became
more responsive during swapping the guest's memory as a result.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-18 16:55:04 -03:00
Takuya Yoshikawa f395302e09 KVM: MMU: Push trace_kvm_age_page() into kvm_age_rmapp()
This restricts the tracing to page aging and makes it possible to
optimize kvm_handle_hva_range() further in the following patch.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-18 16:55:04 -03:00
Takuya Yoshikawa 048212d0bc KVM: MMU: Add memslot parameter to hva handlers
This is needed to push trace_kvm_age_page() into kvm_age_rmapp() in the
following patch.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-18 16:55:04 -03:00
Takuya Yoshikawa 77d11309b3 KVM: Separate rmap_pde from kvm_lpage_info->write_count
This makes it possible to loop over rmap_pde arrays in the same way as
we do over rmap so that we can optimize kvm_handle_hva_range() easily in
the following patch.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-18 16:55:04 -03:00
Takuya Yoshikawa b3ae209697 KVM: Introduce kvm_unmap_hva_range() for kvm_mmu_notifier_invalidate_range_start()
When we tested KVM under memory pressure, with THP enabled on the host,
we noticed that MMU notifier took a long time to invalidate huge pages.

Since the invalidation was done with mmu_lock held, it not only wasted
the CPU but also made the host harder to respond.

This patch mitigates this by using kvm_handle_hva_range().

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Cc: Alexander Graf <agraf@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-18 16:55:04 -03:00
Takuya Yoshikawa 84504ef386 KVM: MMU: Make kvm_handle_hva() handle range of addresses
When guest's memory is backed by THP pages, MMU notifier needs to call
kvm_unmap_hva(), which in turn leads to kvm_handle_hva(), in a loop to
invalidate a range of pages which constitute one huge page:

  for each page
    for each memslot
      if page is in memslot
        unmap using rmap

This means although every page in that range is expected to be found in
the same memslot, we are forced to check unrelated memslots many times.
If the guest has more memslots, the situation will become worse.

Furthermore, if the range does not include any pages in the guest's
memory, the loop over the pages will just consume extra time.

This patch, together with the following patches, solves this problem by
introducing kvm_handle_hva_range() which makes the loop look like this:

  for each memslot
    for each page in memslot
      unmap using rmap

In this new processing, the actual work is converted to a loop over rmap
which is much more cache friendly than before.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Cc: Alexander Graf <agraf@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-18 16:55:04 -03:00
Takuya Yoshikawa d19a748b1c KVM: Introduce hva_to_gfn_memslot() for kvm_handle_hva()
This restricts hva handling in mmu code and makes it easier to extend
kvm_handle_hva() so that it can treat a range of addresses later in this
patch series.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Cc: Alexander Graf <agraf@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-18 16:55:04 -03:00
Takuya Yoshikawa 9594a49861 KVM: MMU: Use __gfn_to_rmap() to clean up kvm_handle_hva()
We can treat every level uniformly.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-18 16:55:03 -03:00
Xiao Guangrong a72faf2504 KVM: MMU: trace fast page fault
To see what happen on this path and help us to optimize it

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-11 16:51:21 +03:00
Xiao Guangrong c7ba5b48cc KVM: MMU: fast path of handling guest page fault
If the the present bit of page fault error code is set, it indicates
the shadow page is populated on all levels, it means what we do is
only modify the access bit which can be done out of mmu-lock

Currently, in order to simplify the code, we only fix the page fault
caused by write-protect on the fast path

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-11 16:51:20 +03:00
Xiao Guangrong 49fde3406f KVM: MMU: introduce SPTE_MMU_WRITEABLE bit
This bit indicates whether the spte can be writable on MMU, that means
the corresponding gpte is writable and the corresponding gfn is not
protected by shadow page protection

In the later path, SPTE_MMU_WRITEABLE will indicates whether the spte
can be locklessly updated

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-11 16:51:19 +03:00
Xiao Guangrong 6e7d035407 KVM: MMU: fold tlb flush judgement into mmu_spte_update
mmu_spte_update() is the common function, we can easily audit the path

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-11 16:51:18 +03:00
Xiao Guangrong 8e22f955fb KVM: MMU: cleanup spte_write_protect
Use __drop_large_spte to cleanup this function and comment spte_write_protect

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-11 16:51:16 +03:00
Xiao Guangrong d13bc5b5a1 KVM: MMU: abstract spte write-protect
Introduce a common function to abstract spte write-protect to
cleanup the code

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-11 16:51:14 +03:00
Xiao Guangrong 2f84569f97 KVM: MMU: return bool in __rmap_write_protect
The reture value of __rmap_write_protect is either 1 or 0, use
true/false instead of these

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-11 16:51:13 +03:00
Avi Kivity e676505ac9 KVM: MMU: Force cr3 reload with two dimensional paging on mov cr3 emulation
Currently the MMU's ->new_cr3() callback does nothing when guest paging
is disabled or when two-dimentional paging (e.g. EPT on Intel) is active.
This means that an emulated write to cr3 can be lost; kvm_set_cr3() will
write vcpu-arch.cr3, but the GUEST_CR3 field in the VMCS will retain its
old value and this is what the guest sees.

This bug did not have any effect until now because:
- with unrestricted guest, or with svm, we never emulate a mov cr3 instruction
- without unrestricted guest, and with paging enabled, we also never emulate a
  mov cr3 instruction
- without unrestricted guest, but with paging disabled, the guest's cr3 is
  ignored until the guest enables paging; at this point the value from arch.cr3
  is loaded correctly my the mov cr0 instruction which turns on paging

However, the patchset that enables big real mode causes us to emulate mov cr3
instructions in protected mode sometimes (when guest state is not virtualizable
by vmx); this mov cr3 is effectively ignored and will crash the guest.

The fix is to make nonpaging_new_cr3() call mmu_free_roots() to force a cr3
reload.  This is awkward because now all the new_cr3 callbacks to the same
thing, and because mmu_free_roots() is somewhat of an overkill; but fixing
that is more complicated and will be done after this minimal fix.

Observed in the Window XP 32-bit installer while bringing up secondary vcpus.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:18:59 +03:00
Xiao Guangrong 85b7059169 KVM: MMU: fix shrinking page from the empty mmu
Fix:

 [ 3190.059226] BUG: unable to handle kernel NULL pointer dereference at           (null)
 [ 3190.062224] IP: [<ffffffffa02aac66>] mmu_page_zap_pte+0x10/0xa7 [kvm]
 [ 3190.063760] PGD 104f50067 PUD 112bea067 PMD 0
 [ 3190.065309] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
 [ 3190.066860] CPU 1
[ ...... ]
 [ 3190.109629] Call Trace:
 [ 3190.111342]  [<ffffffffa02aada6>] kvm_mmu_prepare_zap_page+0xa9/0x1fc [kvm]
 [ 3190.113091]  [<ffffffffa02ab2f5>] mmu_shrink+0x11f/0x1f3 [kvm]
 [ 3190.114844]  [<ffffffffa02ab25d>] ? mmu_shrink+0x87/0x1f3 [kvm]
 [ 3190.116598]  [<ffffffff81150c9d>] ? prune_super+0x142/0x154
 [ 3190.118333]  [<ffffffff8110a4f4>] ? shrink_slab+0x39/0x31e
 [ 3190.120043]  [<ffffffff8110a687>] shrink_slab+0x1cc/0x31e
 [ 3190.121718]  [<ffffffff8110ca1d>] do_try_to_free_pages

This is caused by shrinking page from the empty mmu, although we have
checked n_used_mmu_pages, it is useless since the check is out of mmu-lock

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-03 17:31:50 -03:00
Xudong Hao 00763e4113 KVM: x86: change PT_FIRST_AVAIL_BITS_SHIFT to avoid conflict with EPT Dirty bit
EPT Dirty bit use bit 9 as Intel SDM definition, to avoid conflict, change
PT_FIRST_AVAIL_BITS_SHIFT to 10.

Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-06-13 20:28:21 -03:00
Takuya Yoshikawa 80feb89a0a KVM: MMU: Remove unused parameter from mmu_memory_cache_alloc()
Size is not needed to return one from pre-allocated objects.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-06-11 22:46:47 -03:00
Michael S. Tsirkin 79f702a6d1 KVM: disable uninitialized var warning
I see this in 3.5-rc1:

arch/x86/kvm/mmu.c: In function ‘kvm_test_age_rmapp’:
arch/x86/kvm/mmu.c:1271: warning: ‘iter.desc’ may be used uninitialized in this function

The line in question was introduced by commit
1e3f42f03c

 static int kvm_test_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
                              unsigned long data)
 {
-       u64 *spte;
+       u64 *sptep;
+       struct rmap_iterator iter;   <- line 1271
        int young = 0;

        /*

The reason I think is that the compiler assumes that
the rmap value could be 0, so

static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator
*iter)
{
        if (!rmap)
                return NULL;

        if (!(rmap & 1)) {
                iter->desc = NULL;
                return (u64 *)rmap;
        }

        iter->desc = (struct pte_list_desc *)(rmap & ~1ul);
        iter->pos = 0;
        return iter->desc->sptes[iter->pos];
}

will not initialize iter.desc, but the compiler isn't
smart enough to see that

        for (sptep = rmap_get_first(*rmapp, &iter); sptep;
             sptep = rmap_get_next(&iter)) {

will immediately exit in this case.
I checked by adding
        if (!*rmapp)
                goto out;
on top which is clearly equivalent but disables the warning.

This patch uses uninitialized_var to disable the warning without
increasing code size.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-06 15:26:12 +03:00
Gleb Natapov 1952639665 KVM: MMU: do not iterate over all VMs in mmu_shrink()
mmu_shrink() needlessly iterates over all VMs even though it will not
attempt to free mmu pages from more than one on them. Fix that and also
check used mmu pages count outside of VM lock to skip inactive VMs faster.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-05 17:46:43 +03:00
Xudong Hao 3f6d8c8a47 KVM: VMX: Use EPT Access bit in response to memory notifiers
Signed-off-by: Haitao Shan <haitao.shan@intel.com>
Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-05 16:31:05 +03:00
Xiao Guangrong c358666783 KVM: MMU: fix huge page adapted on non-PAE host
The huge page size is 4M on non-PAE host, but 2M page size is used in
transparent_hugepage_adjust(), so the page we get after adjust the
mapping level is not the head page, the BUG_ON() will be triggered

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-05-28 17:41:15 +03:00
Avi Kivity c142786c62 KVM: MMU: Don't use RCU for lockless shadow walking
Using RCU for lockless shadow walking can increase the amount of memory
in use by the system, since RCU grace periods are unpredictable.  We also
have an unconditional write to a shared variable (reader_counter), which
isn't good for scaling.

Replace that with a scheme similar to x86's get_user_pages_fast(): disable
interrupts during lockless shadow walk to force the freer
(kvm_mmu_commit_zap_page()) to wait for the TLB flush IPI to find the
processor with interrupts enabled.

We also add a new vcpu->mode, READING_SHADOW_PAGE_TABLES, to prevent
kvm_flush_remote_tlbs() from avoiding the IPI.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-05-16 16:08:28 -03:00
Davidlohr Bueso f71fa31f9f KVM: MMU: use page table level macro
Its much cleaner to use PT_PAGE_TABLE_LEVEL than its numeric value.

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-04-18 23:35:01 -03:00
Takuya Yoshikawa 1e3f42f03c KVM: MMU: Improve iteration through sptes from rmap
Iteration using rmap_next(), the actual body is pte_list_next(), is
inefficient: every time we call it we start from checking whether rmap
holds a single spte or points to a descriptor which links more sptes.

In the case of shadow paging, this quadratic total iteration cost is a
problem.  Even for two dimensional paging, with EPT/NPT on, in which we
almost always have a single mapping, the extra checks at the end of the
iteration should be eliminated.

This patch fixes this by introducing rmap_iterator which keeps the
iteration context for the next search.  Furthermore the implementation
of rmap_next() is splitted into two functions, rmap_get_first() and
rmap_get_next(), to avoid repeatedly checking whether the rmap being
iterated on has only one spte.

Although there seemed to be only a slight change for EPT/NPT, the actual
improvement was significant: we observed that GET_DIRTY_LOG for 1GB
dirty memory became 15% faster than before.  This is probably because
the new code is easy to make branch predictions.

Note: we just remove pte_list_next() because we can think of parent_ptes
as a reverse mapping.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 16:08:27 +03:00
Takuya Yoshikawa 220f773a00 KVM: MMU: Make pte_list_desc fit cache lines well
We have PTE_LIST_EXT + 1 pointers in this structure and these 40/20
bytes do not fit cache lines well.  Furthermore, some allocators may
use 64/32-byte objects for the pte_list_desc cache.

This patch solves this problem by changing PTE_LIST_EXT from 4 to 3.

For shadow paging, the new size is still large enough to hold both the
kernel and process mappings for usual anonymous pages.  For file
mappings, there may be a slight change in the cache usage.

Note: with EPT/NPT we almost always have a single spte in each reverse
mapping and we will not see any change by this.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 16:08:25 +03:00
Takuya Yoshikawa 5dc99b2380 KVM: Avoid checking huge page mappings in get_dirty_log()
Dropped such mappings when we enabled dirty logging and we will never
create new ones until we stop the logging.

For this we introduce a new function which can be used to write protect
a range of PT level pages: although we do not need to care about a range
of pages at this point, the following patch will need this feature to
optimize the write protection of many pages.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 12:49:58 +03:00
Takuya Yoshikawa a0ed46073c KVM: MMU: Split the main body of rmap_write_protect() off from others
We will use this in the following patch to implement another function
which needs to write protect pages using the rmap information.

Note that there is a small change in debug printing for large pages:
we do not differentiate them from others to avoid duplicating code.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 12:49:56 +03:00
Davidlohr Bueso 4d6931c380 KVM: MMU: make use of ->root_level in reset_rsvds_bits_mask
The reset_rsvds_bits_mask() function can use the guest walker's root level
number instead of using a separate 'level' variable.

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:13:54 +02:00
Takuya Yoshikawa db3fe4eb45 KVM: Introduce kvm_memory_slot::arch and move lpage_info into it
Some members of kvm_memory_slot are not used by every architecture.

This patch is the first step to make this difference clear by
introducing kvm_memory_slot::arch;  lpage_info is moved into it.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:22 +02:00
Takuya Yoshikawa fb03cb6f44 KVM: Introduce gfn_to_index() which returns the index for a given level
This patch cleans up the code and removes the "(void)level;" warning
suppressor.

Note that we can also use this for PT_PAGE_TABLE_LEVEL to treat every
level uniformly later.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-08 14:10:19 +02:00
Takuya Yoshikawa e4b35cc960 KVM: MMU: Remove unused kvm parameter from rmap_next()
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:43 +02:00
Takuya Yoshikawa 9373e2c057 KVM: MMU: Remove unused kvm parameter from __gfn_to_rmap()
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:42 +02:00
Davidlohr Bueso 4a58ae614a KVM: MMU: unnecessary NX state assignment
We can remove the first ->nx state assignment since it is assigned afterwards anyways.

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:21 +02:00
Xiao Guangrong a138fe7535 KVM: MMU: remove the redundant get_written_sptes
get_written_sptes is called twice in kvm_mmu_pte_write, one of them can be
removed

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:18 +02:00
Takuya Yoshikawa 6addd1aa2c KVM: MMU: Add missing large page accounting to drop_large_spte()
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:18 +02:00
Takuya Yoshikawa 37178b8bf0 KVM: MMU: Remove for_each_unsync_children() macro
There is only one user of it and for_each_set_bit() does the same.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05 14:52:17 +02:00
Rusty Russell 476bc0015b module_param: make bool parameters really bool (arch)
module_param(bool) used to counter-intuitively take an int.  In
fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy
trick.

It's time to remove the int/unsigned int option.  For this version
it'll simply give a warning, but it'll break next kernel version.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-01-13 09:32:18 +10:30
Jan Kiszka 3d56cbdf35 KVM: MMU: Drop unused return value of kvm_mmu_remove_some_alloc_mmu_pages
freed_pages is never evaluated, so remove it as well as the return code
kvm_mmu_remove_some_alloc_mmu_pages so far delivered to its only user.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:15 +02:00
Xiao Guangrong e37fa7853c KVM: MMU: audit: inline audit function
inline audit function and little cleanup

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:12 +02:00
Xiao Guangrong d750ea2886 KVM: MMU: remove oos_shadow parameter
The unsync code should be stable now, maybe it is the time to remove this
parameter to cleanup the code a little bit

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:10 +02:00
Xiao Guangrong e459e3228d KVM: MMU: move the relevant mmu code to mmu.c
Move the mmu code in kvm_arch_vcpu_init() to kvm_mmu_create()

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:09 +02:00
Xiao Guangrong 0375f7fad9 KVM: MMU: audit: replace mmu audit tracepoint with jump-label
The tracepoint is only used to audit mmu code, it should not be exposed to
user, let us replace it with jump-label.

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:05 +02:00
Xiao Guangrong be6ba0f096 KVM: introduce kvm_for_each_memslot macro
Introduce kvm_for_each_memslot to walk all valid memslot

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:37 +02:00
Xiao Guangrong 93a5cef07d KVM: introduce KVM_MEM_SLOTS_NUM macro
Introduce KVM_MEM_SLOTS_NUM macro to instead of
KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:34 +02:00
Takuya Yoshikawa 95d4c16ce7 KVM: Optimize dirty logging by rmap_write_protect()
Currently, write protecting a slot needs to walk all the shadow pages
and checks ones which have a pte mapping a page in it.

The walk is overly heavy when dirty pages in that slot are not so many
and checking the shadow pages would result in unwanted cache pollution.

To mitigate this problem, we use rmap_write_protect() and check only
the sptes which can be reached from gfns marked in the dirty bitmap
when the number of dirty pages are less than that of shadow pages.

This criterion is reasonable in its meaning and worked well in our test:
write protection became some times faster than before when the ratio of
dirty pages are low and was not worse even when the ratio was near the
criterion.

Note that the locking for this write protection becomes fine grained.
The reason why this is safe is descripted in the comments.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:20 +02:00
Takuya Yoshikawa 9b9b149236 KVM: MMU: Split gfn_to_rmap() into two functions
rmap_write_protect() calls gfn_to_rmap() for each level with gfn fixed.
This results in calling gfn_to_memslot() repeatedly with that gfn.

This patch introduces __gfn_to_rmap() which takes the slot as an
argument to avoid this.

This is also needed for the following dirty logging optimization.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:17 +02:00
Takuya Yoshikawa d6eebf8b80 KVM: MMU: Clean up BUG_ON() conditions in rmap_write_protect()
Remove redundant checks and use is_large_pte() macro.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:13 +02:00
Chris Wright fb92045843 KVM: MMU: remove KVM host pv mmu support
The host side pv mmu support has been marked for feature removal in
January 2011.  It's not in use, is slower than shadow or hardware
assisted paging, and a maintenance burden.  It's November 2011, time to
remove it.

Signed-off-by: Chris Wright <chrisw@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:10 +02:00
Xiao Guangrong a30f47cb15 KVM: MMU: improve write flooding detected
Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough

Instead of detected page accessed, we can detect whether the spte is accessed
after it is written, if the spte is not accessed but it is written frequently,
we treat is not a page table or it not used for a long time

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:02 +02:00
Xiao Guangrong 5d9ca30e96 KVM: MMU: fix detecting misaligned accessed
Sometimes, we only modify the last one byte of a pte to update status bit,
for example, clear_bit is used to clear r/w bit in linux kernel and 'andb'
instruction is used in this function, in this case, kvm_mmu_pte_write will
treat it as misaligned access, and the shadow page table is zapped

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:01 +02:00
Xiao Guangrong 889e5cbced KVM: MMU: split kvm_mmu_pte_write function
kvm_mmu_pte_write is too long, we split it for better readable

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:59 +02:00
Xiao Guangrong f8734352c6 KVM: MMU: remove unnecessary kvm_mmu_free_some_pages
In kvm_mmu_pte_write, we do not need to alloc shadow page, so calling
kvm_mmu_free_some_pages is really unnecessary

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:58 +02:00
Xiao Guangrong f57f2ef58f KVM: MMU: fast prefetch spte on invlpg path
Fast prefetch spte for the unsync shadow page on invlpg path

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:56 +02:00
Xiao Guangrong 505aef8f30 KVM: MMU: cleanup FNAME(invlpg)
Directly Use mmu_page_zap_pte to zap spte in FNAME(invlpg), also remove the
same code between FNAME(invlpg) and FNAME(sync_page)

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:54 +02:00
Xiao Guangrong d01f8d5e02 KVM: MMU: do not mark accessed bit on pte write path
In current code, the accessed bit is always set when page fault occurred,
do not need to set it on pte write path

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:53 +02:00
Xiao Guangrong 1cb3f3ae5a KVM: x86: retry non-page-table writing instructions
If the emulation is caused by #PF and it is non-page_table writing instruction,
it means the VM-EXIT is caused by shadow page protected, we can zap the shadow
page and retry this instruction directly

The idea is from Avi

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:50 +02:00
Xiao Guangrong f759e2b4c7 KVM: MMU: avoid pte_list_desc running out in kvm_mmu_pte_write
kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the
function when spte is prefetched, unfortunately, we can not know how many
spte need to be prefetched on this path, that means we can use out of the
free  pte_list_desc object in the cache, and BUG_ON() is triggered, also some
path does not fill the cache, such as INS instruction emulated that does not
trigger page fault

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:47 +02:00
Avi Kivity e4e517b4be KVM: MMU: Do not unconditionally read PDPTE from guest memory
Architecturally, PDPTEs are cached in the PDPTRs when CR3 is reloaded.
On SVM, it is not possible to implement this, but on VMX this is possible
and was indeed implemented until nested SVM changed this to unconditionally
read PDPTEs dynamically.  This has noticable impact when running PAE guests.

Fix by changing the MMU to read PDPTRs from the cache, falling back to
reading from memory for the nested MMU.

Signed-off-by: Avi Kivity <avi@redhat.com>
Tested-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:18:01 +03:00
Zhao Jin 41bc3186b3 KVM: MMU: fix incorrect return of spte
__update_clear_spte_slow should return original spte while the
current code returns low half of original spte combined with high
half of new spte.

Signed-off-by: Zhao Jin <cronozhj@gmail.com>
Reviewed-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-09-25 19:13:25 +03:00
Linus Torvalds d3ec4844d4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
  fs: Merge split strings
  treewide: fix potentially dangerous trailing ';' in #defined values/expressions
  uwb: Fix misspelling of neighbourhood in comment
  net, netfilter: Remove redundant goto in ebt_ulog_packet
  trivial: don't touch files that are removed in the staging tree
  lib/vsprintf: replace link to Draft by final RFC number
  doc: Kconfig: `to be' -> `be'
  doc: Kconfig: Typo: square -> squared
  doc: Konfig: Documentation/power/{pm => apm-acpi}.txt
  drivers/net: static should be at beginning of declaration
  drivers/media: static should be at beginning of declaration
  drivers/i2c: static should be at beginning of declaration
  XTENSA: static should be at beginning of declaration
  SH: static should be at beginning of declaration
  MIPS: static should be at beginning of declaration
  ARM: static should be at beginning of declaration
  rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check
  Update my e-mail address
  PCIe ASPM: forcedly -> forcibly
  gma500: push through device driver tree
  ...

Fix up trivial conflicts:
 - arch/arm/mach-ep93xx/dma-m2p.c (deleted)
 - drivers/gpio/gpio-ep93xx.c (renamed and context nearby)
 - drivers/net/r8169.c (just context changes)
2011-07-25 13:56:39 -07:00
Xiao Guangrong 4f0226482d KVM: MMU: trace mmio page fault
Add tracepoints to trace mmio page fault

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:41 +03:00
Xiao Guangrong ce88decffd KVM: MMU: mmio page fault support
The idea is from Avi:

| We could cache the result of a miss in an spte by using a reserved bit, and
| checking the page fault error code (or seeing if we get an ept violation or
| ept misconfiguration), so if we get repeated mmio on a page, we don't need to
| search the slot list/tree.
| (https://lkml.org/lkml/2011/2/22/221)

When the page fault is caused by mmio, we cache the info in the shadow page
table, and also set the reserved bits in the shadow page table, so if the mmio
is caused again, we can quickly identify it and emulate it directly

Searching mmio gfn in memslots is heavy since we need to walk all memeslots, it
can be reduced by this feature, and also avoid walking guest page table for
soft mmu.

[jan: fix operator precedence issue]

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:40 +03:00
Xiao Guangrong dd3bfd59db KVM: MMU: reorganize struct kvm_shadow_walk_iterator
Reorganize it for good using the cache

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:39 +03:00
Xiao Guangrong c2a2ac2b56 KVM: MMU: lockless walking shadow page table
Use rcu to protect shadow pages table to be freed, so we can safely walk it,
it should run fastly and is needed by mmio page fault

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:38 +03:00
Xiao Guangrong 603e0651cf KVM: MMU: do not need atomicly to set/clear spte
Now, the spte is just from nonprsent to present or present to nonprsent, so
we can use some trick to set/clear spte non-atomicly as linux kernel does

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:37 +03:00
Xiao Guangrong 1df9f2dc39 KVM: MMU: introduce the rules to modify shadow page table
Introduce some interfaces to modify spte as linux kernel does:
- mmu_spte_clear_track_bits, it set the spte from present to nonpresent, and
  track the stat bits(accessed/dirty) of spte
- mmu_spte_clear_no_track, the same as mmu_spte_clear_track_bits except
  tracking the stat bits
- mmu_spte_set, set spte from nonpresent to present
- mmu_spte_update, only update the stat bits

Now, it does not allowed to set spte from present to present, later, we can
drop the atomicly opration for X86_32 host, and it is the preparing work to
get spte on X86_32 host out of the mmu lock

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:36 +03:00
Xiao Guangrong d7c55201e6 KVM: MMU: abstract some functions to handle fault pfn
Introduce handle_abnormal_pfn to handle fault pfn on page fault path,
introduce mmu_invalid_pfn to handle fault pfn on prefetch path

It is the preparing work for mmio page fault support

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:35 +03:00
Xiao Guangrong fce92dce79 KVM: MMU: filter out the mmio pfn from the fault pfn
If the page fault is caused by mmio, the gfn can not be found in memslots, and
'bad_pfn' is returned on gfn_to_hva path, so we can use 'bad_pfn' to identify
the mmio page fault.
And, to clarify the meaning of mmio pfn, we return fault page instead of bad
page when the gfn is not allowd to prefetch

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:34 +03:00
Xiao Guangrong c37079586f KVM: MMU: remove bypass_guest_pf
The idea is from Avi:
| Maybe it's time to kill off bypass_guest_pf=1.  It's not as effective as
| it used to be, since unsync pages always use shadow_trap_nonpresent_pte,
| and since we convert between the two nonpresent_ptes during sync and unsync.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:33 +03:00
Xiao Guangrong bd4c86eaa6 KVM: MMU: split kvm_mmu_free_page
Split kvm_mmu_free_page to kvm_mmu_isolate_page and
kvm_mmu_free_page

One is used to remove the page from cache under mmu lock and the other is
used to free page table out of mmu lock

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:32 +03:00
Xiao Guangrong aa6bd187af KVM: MMU: count used shadow pages on prepareing path
Move counting used shadow pages from commiting path to preparing path to
reduce tlb flush on some paths

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:31 +03:00
Xiao Guangrong b90a0e6c81 KVM: MMU: rename 'pt_write' to 'emulate'
If 'pt_write' is true, we need to emulate the fault. And in later patch, we
need to emulate the fault even though it is not a pt_write event, so rename
it to better fit the meaning

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:30 +03:00
Xiao Guangrong 640d9b0dbe KVM: MMU: optimize to handle dirty bit
If dirty bit is not set, we can make the pte access read-only to avoid handing
dirty bit everywhere

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:27 +03:00
Xiao Guangrong bebb106a5a KVM: MMU: cache mmio info on page fault path
If the page fault is caused by mmio, we can cache the mmio info, later, we do
not need to walk guest page table and quickly know it is a mmio fault while we
emulate the mmio instruction

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:26 +03:00
Xiao Guangrong ffb61bb3bc KVM: MMU: do not update slot bitmap if spte is nonpresent
Set slot bitmap only if the spte is present

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:24 +03:00
Xiao Guangrong 052331bea3 KVM: MMU: fix walking shadow page table
Properly check the last mapping, and do not walk to the next level if last spte
is met

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-24 11:50:23 +03:00
Marcelo Tosatti f8f7e5ee10 Revert "KVM: MMU: make kvm_mmu_reset_context() flush the guest TLB"
This reverts commit bee931d31e588b8eb86b7edee32fac2d16930cd7.

TLB flush should be done lazily during guest entry, in
kvm_mmu_load().

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 13:16:41 +03:00
Avi Kivity 45bd07b9d5 KVM: MMU: make kvm_mmu_reset_context() flush the guest TLB
kvm_set_cr0() and kvm_set_cr4(), and possible other functions,
assume that kvm_mmu_reset_context() flushes the guest TLB.  However,
it does not.

Fix by flushing the tlb (and syncing the new root as well).

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-12 13:16:27 +03:00
Avi Kivity 411c588dfb KVM: MMU: Adjust shadow paging to work when SMEP=1 and CR0.WP=0
When CR0.WP=0, we sometimes map user pages as kernel pages (to allow
the kernel to write to them).  Unfortunately this also allows the kernel
to fetch from these pages, even if CR4.SMEP is set.

Adjust for this by also setting NX on the spte in these circumstances.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-07-12 13:16:26 +03:00
Xiao Guangrong bcdd9a93c5 KVM: MMU: cleanup for dropping parent pte
Introduce drop_parent_pte to remove the rmap of parent pte and
clear parent pte

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:07 +03:00
Xiao Guangrong 38e3b2b28c KVM: MMU: cleanup for kvm_mmu_page_unlink_children
Cleanup the same operation between kvm_mmu_page_unlink_children and
mmu_pte_write_zap_pte

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:07 +03:00
Xiao Guangrong 67052b3508 KVM: MMU: remove the arithmetic of parent pte rmap
Parent pte rmap and page rmap are very similar, so use the same arithmetic
for them

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:07 +03:00
Xiao Guangrong 53c07b1878 KVM: MMU: abstract the operation of rmap
Abstract the operation of rmap to spte_list, then we can use it for the
reverse mapping of parent pte in the later patch

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:06 +03:00
Xiao Guangrong 332b207d65 KVM: MMU: optimize pte write path if don't have protected sp
Simply return from kvm_mmu_pte_write path if no shadow page is
write-protected, then we can avoid to walk all shadow pages and hold
mmu-lock

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-07-12 11:45:02 +03:00
Jiri Kosina b7e9c223be Merge branch 'master' into for-next
Sync with Linus' tree to be able to apply pending patches that
are based on newer code already present upstream.
2011-07-11 14:15:55 +02:00
Vitaliy Ivanov e44ba033c5 treewide: remove duplicate includes
Many stupid corrections of duplicated includes based on the output of
scripts/checkincludes.pl.

Signed-off-by: Vitaliy Ivanov <vitalivanov@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-06-20 16:08:19 +02:00
Steve a0a8eaba16 KVM: MMU: fix opposite condition in mapping_level_dirty_bitmap
The condition is opposite, it always maps huge page for the dirty tracked page

Reported-by: Steve <stefan.bosak@gmail.com>
Signed-off-by: Steve <stefan.bosak@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-06-19 19:23:13 +03:00
Ying Han 1495f230fa vmscan: change shrinker API by passing shrink_control struct
Change each shrinker's API by consolidating the existing parameters into
shrink_control struct.  This will simplify any further features added w/o
touching each file of shrinker.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: fix warning]
[kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API]
[akpm@linux-foundation.org: fix xfs warning]
[akpm@linux-foundation.org: update gfs2]
Signed-off-by: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-25 08:39:26 -07:00
Xiao Guangrong 7c5625227f KVM: MMU: remove mmu_seq verification on pte update path
The mmu_seq verification can be removed since we get the pfn in the
protection of mmu_lock.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-05-11 07:57:03 -04:00
Xiao Guangrong 0f53b5b1c0 KVM: MMU: cleanup pte write path
This patch does:
- call vcpu->arch.mmu.update_pte directly
- use gfn_to_pfn_atomic in update_pte path

The suggestion is from Avi.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:35 -03:00
Xiao Guangrong 5d163b1c9d KVM: MMU: introduce a common function to get no-dirty-logged slot
Cleanup the code of pte_prefetch_gfn_to_memslot and mapping_level_dirty_bitmap

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:34 -03:00
Xiao Guangrong 676646ee4b KVM: MMU: remove unused macros
These macros are not used, so removed

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:32 -03:00
Xiao Guangrong 842f22ed9b KVM: MMU: cleanup page alloc and free
Using __get_free_page instead of alloc_page and page_address,
using free_page instead of __free_page and virt_to_page

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:32 -03:00
Xiao Guangrong 49b26e26e4 KVM: MMU: do not record gfn in kvm_mmu_pte_write
No need to record the gfn to verifier the pte has the same mode as
current vcpu, it's because we only speculatively update the pte only
if the pte and vcpu have the same mode

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:32 -03:00
Xiao Guangrong 1b7fd45c32 KVM: MMU: set spte accessed bit properly
Set spte accessed bit only if guest_initiated == 1 that means the really
accessed

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:32 -03:00
Xiao Guangrong da8dc75f0c KVM: MMU: fix kvm_mmu_slot_remove_write_access dropping intermediate W bits
Only remove write access in the last sptes.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:32 -03:00
Jan Kiszka e935b8372c KVM: Convert kvm_lock to raw_spinlock
Code under this lock requires non-preemptibility. Ensure this also over
-rt by converting it to raw spinlock.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-03-17 13:08:30 -03:00
Avi Kivity 8234b22e1c KVM: MMU: Don't flush shadow when enabling dirty tracking
Instead, drop large mappings, which were the reason we dropped shadow.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-03-17 13:08:24 -03:00
Andrea Arcangeli 8ee53820ed thp: mmu_notifier_test_young
For GRU and EPT, we need gup-fast to set referenced bit too (this is why
it's correct to return 0 when shadow_access_mask is zero, it requires
gup-fast to set the referenced bit).  qemu-kvm access already sets the
young bit in the pte if it isn't zero-copy, if it's zero copy or a shadow
paging EPT minor fault we relay on gup-fast to signal the page is in
use...

We also need to check the young bits on the secondary pagetables for NPT
and not nested shadow mmu as the data may never get accessed again by the
primary pte.

Without this closer accuracy, we'd have to remove the heuristic that
avoids collapsing hugepages in hugepage virtual regions that have not even
a single subpage in use.

->test_young is full backwards compatible with GRU and other usages that
don't have young bits in pagetables set by the hardware and that should
nuke the secondary mmu mappings when ->clear_flush_young runs just like
EPT does.

Removing the heuristic that checks the young bit in
khugepaged/collapse_huge_page completely isn't so bad either probably but
I thought it was worth it and this makes it reliable.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:46 -08:00
Andrea Arcangeli 936a5fe6e6 thp: kvm mmu transparent hugepage support
This should work for both hugetlbfs and transparent hugepages.

[akpm@linux-foundation.org: bring forward PageTransCompound() addition for bisectability]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:41 -08:00
Xiao Guangrong f8e453b00c KVM: MMU: handle 'map_writable' in set_spte() function
Move the operation of 'writable' to set_spte() to clean up code

[avi: remove unneeded booleanification]

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:31:19 +02:00
Xiao Guangrong b034cf0105 KVM: MMU: audit: allow audit more guests at the same time
It only allows to audit one guest in the system since:
- 'audit_point' is a glob variable
- mmu_audit_disable() is called in kvm_mmu_destroy(), so audit is disabled
  after a guest exited

this patch fix those issues then allow to audit more guests at the same time

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:31:17 +02:00
Avi Kivity 9f8fe5043f KVM: Replace reads of vcpu->arch.cr3 by an accessor
This allows us to keep cr3 in the VMCS, later on.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:31:15 +02:00
Marcelo Tosatti e49146dce8 KVM: MMU: only write protect mappings at pagetable level
If a pagetable contains a writeable large spte, all of its sptes will be
write protected, including non-leaf ones, leading to endless pagefaults.

Do not write protect pages above PT_PAGE_TABLE_LEVEL, as the spte fault
paths assume non-leaf sptes are writable.

Tested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:31:13 +02:00
Avi Kivity c445f8ef43 KVM: MMU: Initialize base_role for tdp mmus
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:31:11 +02:00
Andre Przywara dc25e89e07 KVM: SVM: copy instruction bytes from VMCB
In case of a nested page fault or an intercepted #PF newer SVM
implementations provide a copy of the faulting instruction bytes
in the VMCB.
Use these bytes to feed the instruction emulator and avoid the costly
guest instruction fetch in this case.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:31:07 +02:00
Andre Przywara 51d8b66199 KVM: cleanup emulate_instruction
emulate_instruction had many callers, but only one used all
parameters. One parameter was unused, another one is now
hidden by a wrapper function (required for a future addition
anyway), so most callers use now a shorter parameter list.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:31:00 +02:00
Takuya Yoshikawa d4dbf47009 KVM: MMU: Make the way of accessing lpage_info more generic
Large page information has two elements but one of them, write_count, alone
is accessed by a helper function.

This patch replaces this helper function with more generic one which returns
newly named kvm_lpage_info structure and use it to access the other element
rmap_pde.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:30:47 +02:00
Xiao Guangrong fb67e14fc9 KVM: MMU: retry #PF for softmmu
Retry #PF for softmmu only when the current vcpu has the same cr3 as the time
when #PF occurs

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:30:41 +02:00
Xiao Guangrong 2ec4739ddc KVM: MMU: fix accessed bit set on prefault path
Retry #PF is the speculative path, so don't set the accessed bit

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:30:40 +02:00
Xiao Guangrong 78b2c54aa4 KVM: MMU: rename 'no_apf' to 'prefault'
It's the speculative path if 'no_apf = 1' and we will specially handle this
speculative path in the later patch, so 'prefault' is better to fit the sense.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:30:38 +02:00
Takuya Yoshikawa 700e1b1219 KVM: MMU: Avoid dropping accessed bit while removing write access
One more "KVM: MMU: Don't drop accessed bit while updating an spte."

Sptes are accessed by both kvm and hardware.
This patch uses update_spte() to fix the way of removing write access.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-01-12 11:30:21 +02:00
Avi Kivity 6389ee9463 KVM: Pull extra page fault information into struct x86_exception
Currently page fault cr2 and nesting infomation are carried outside
the fault data structure.  Instead they are placed in the vcpu struct,
which results in confusion as global variables are manipulated instead
of passing parameters.

Fix this issue by adding address and nested fields to struct x86_exception,
so this struct can carry all information associated with a fault.

Signed-off-by: Avi Kivity <avi@redhat.com>
Tested-by: Joerg Roedel <joerg.roedel@amd.com>
Tested-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:30:02 +02:00
Avi Kivity ab9ae31387 KVM: Push struct x86_exception info the various gva_to_gpa variants
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:29:59 +02:00
Xiao Guangrong 407c61c6bd KVM: MMU: abstract invalid guest pte mapping
Introduce a common function to map invalid gpte

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:29:49 +02:00
Xiao Guangrong a4a8e6f76e KVM: MMU: remove 'clear_unsync' parameter
Remove it since we can judge it by using sp->unsync

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-01-12 11:29:48 +02:00