Allow for a device which is being emulated at L0 (the host) for an L1
guest to be passed through to a nested (L2) guest.
The existing kvmppc_hv_emulate_mmio function can be used here. The main
challenge is that for a load the result must be stored into the L2 gpr,
not an L1 gpr as would normally be the case after going out to qemu to
complete the operation. This presents a challenge as at this point the
L2 gpr state has been written back into L1 memory.
To work around this we store the address in L1 memory of the L2 gpr
where the result of the load is to be stored and use the new io_gpr
value KVM_MMIO_REG_NESTED_GPR to indicate that this is a nested load for
which completion must be done when returning back into the kernel. Then
in kvmppc_complete_mmio_load() the resultant value is written into L1
memory at the location of the indicated L2 gpr.
Note that we don't currently let an L1 guest emulate a device for an L2
guest which is then passed through to an L3 guest.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The functions kvmppc_st and kvmppc_ld are used to access guest memory
from the host using a guest effective address. They do so by translating
through the process table to obtain a guest real address and then using
kvm_read_guest or kvm_write_guest to make the access with the guest real
address.
This method of access however only works for L1 guests and will give the
incorrect results for a nested guest.
We can however use the store_to_eaddr and load_from_eaddr kvmppc_ops to
perform the access for a nested guesti (and a L1 guest). So attempt this
method first and fall back to the old method if this fails and we aren't
running a nested guest.
At this stage there is no fall back method to perform the access for a
nested guest and this is left as a future improvement. For now we will
return to the nested guest and rely on the fact that a translation
should be faulted in before retrying the access.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The kvmppc_ops struct is used to store function pointers to kvm
implementation specific functions.
Introduce two new functions load_from_eaddr and store_to_eaddr to be
used to load from and store to a guest effective address respectively.
Also implement these for the kvm-hv module. If we are using the radix
mmu then we can call the functions to access quadrant 1 and 2.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The POWER9 radix mmu has the concept of quadrants. The quadrant number
is the two high bits of the effective address and determines the fully
qualified address to be used for the translation. The fully qualified
address consists of the effective lpid, the effective pid and the
effective address. This gives then 4 possible quadrants 0, 1, 2, and 3.
When accessing these quadrants the fully qualified address is obtained
as follows:
Quadrant | Hypervisor | Guest
--------------------------------------------------------------------------
| EA[0:1] = 0b00 | EA[0:1] = 0b00
0 | effLPID = 0 | effLPID = LPIDR
| effPID = PIDR | effPID = PIDR
--------------------------------------------------------------------------
| EA[0:1] = 0b01 |
1 | effLPID = LPIDR | Invalid Access
| effPID = PIDR |
--------------------------------------------------------------------------
| EA[0:1] = 0b10 |
2 | effLPID = LPIDR | Invalid Access
| effPID = 0 |
--------------------------------------------------------------------------
| EA[0:1] = 0b11 | EA[0:1] = 0b11
3 | effLPID = 0 | effLPID = LPIDR
| effPID = 0 | effPID = 0
--------------------------------------------------------------------------
In the Guest;
Quadrant 3 is normally used to address the operating system since this
uses effPID=0 and effLPID=LPIDR, meaning the PID register doesn't need to
be switched.
Quadrant 0 is normally used to address user space since the effLPID and
effPID are taken from the corresponding registers.
In the Host;
Quadrant 0 and 3 are used as above, however the effLPID is always 0 to
address the host.
Quadrants 1 and 2 can be used by the host to address guest memory using
a guest effective address. Since the effLPID comes from the LPID register,
the host loads the LPID of the guest it would like to access (and the
PID of the process) and can perform accesses to a guest effective
address.
This means quadrant 1 can be used to address the guest user space and
quadrant 2 can be used to address the guest operating system from the
hypervisor, using a guest effective address.
Access to the quadrants can cause a Hypervisor Data Storage Interrupt
(HDSI) due to being unable to perform partition scoped translation.
Previously this could only be generated from a guest and so the code
path expects us to take the KVM trampoline in the interrupt handler.
This is no longer the case so we modify the handler to call
bad_page_fault() to check if we were expecting this fault so we can
handle it gracefully and just return with an error code. In the hash mmu
case we still raise an unknown exception since quadrants aren't defined
for the hash mmu.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
There exists a function kvm_is_radix() which is used to determine if a
kvm instance is using the radix mmu. However this only applies to the
first level (L1) guest. Add a function kvmhv_vcpu_is_radix() which can
be used to determine if the current execution context of the vcpu is
radix, accounting for if the vcpu is running a nested guest.
Currently all nested guests must be radix but this may change in the
future.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The kvm capability KVM_CAP_SPAPR_TCE_VFIO is used to indicate the
availability of in kernel tce acceleration for vfio. However it is
currently the case that this is only available on a powernv machine,
not for a pseries machine.
Thus make this capability dependent on having the cpu feature
CPU_FTR_HVMODE.
[paulus@ozlabs.org - fixed compilation for Book E.]
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds code to flush the partition-scoped page tables for a radix
guest when dirty tracking is turned on or off for a memslot. Only the
guest real addresses covered by the memslot are flushed. The reason
for this is to get rid of any 2M PTEs in the partition-scoped page
tables that correspond to host transparent huge pages, so that page
dirtiness is tracked at a system page (4k or 64k) granularity rather
than a 2M granularity. The page tables are also flushed when turning
dirty tracking off so that the memslot's address space can be
repopulated with THPs if possible.
To do this, we add a new function kvmppc_radix_flush_memslot(). Since
this does what's needed for kvmppc_core_flush_memslot_hv() on a radix
guest, we now make kvmppc_core_flush_memslot_hv() call the new
kvmppc_radix_flush_memslot() rather than calling kvm_unmap_radix()
for each page in the memslot. This has the effect of fixing a bug in
that kvmppc_core_flush_memslot_hv() was previously calling
kvm_unmap_radix() without holding the kvm->mmu_lock spinlock, which
is required to be held.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds 'const' to the declarations for the struct kvm_memory_slot
pointer parameters of some functions, which will make it possible to
call those functions from kvmppc_core_commit_memory_region_hv()
in the next patch.
This also fixes some comments about locking.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
For radix guests, this makes KVM map guest memory as individual pages
when dirty page logging is enabled for the memslot corresponding to the
guest real address. Having a separate partition-scoped PTE for each
system page mapped to the guest means that we have a separate dirty
bit for each page, thus making the reported dirty bitmap more accurate.
Without this, if part of guest memory is backed by transparent huge
pages, the dirty status is reported at a 2MB granularity rather than
a 64kB (or 4kB) granularity for that part, causing userspace to have
to transmit more data when migrating the guest.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Currently, kvm_arch_commit_memory_region() gets called with a
parameter indicating what type of change is being made to the memslot,
but it doesn't pass it down to the platform-specific memslot commit
functions. This adds the `change' parameter to the lower-level
functions so that they can use it in future.
[paulus@ozlabs.org - fix book E also.]
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
When booting a kvm-pr guest on a POWER9 machine the following message is
observed:
"qemu-system-ppc64: KVM does not support 1TiB segments which guest expects"
This is because the guest is expecting to be able to use 1T segments
however we don't indicate support for it. This is because we don't set
the BOOK3S_HFLAG_MULTI_PGSIZE flag in the hflags in kvmppc_set_pvr_pr()
on POWER9.
POWER9 does indeed have support for 1T segments, so add a case for
POWER9 to the switch statement to ensure it is set.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.
Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Testing has revealed an occasional crash which appears to be caused
by a race between kvmppc_switch_mmu_to_hpt and kvm_unmap_hva_range_hv.
The symptom is a NULL pointer dereference in __find_linux_pte() called
from kvm_unmap_radix() with kvm->arch.pgtable == NULL.
Looking at kvmppc_switch_mmu_to_hpt(), it does indeed clear
kvm->arch.pgtable (via kvmppc_free_radix()) before setting
kvm->arch.radix to NULL, and there is nothing to prevent
kvm_unmap_hva_range_hv() or the other MMU callback functions from
being called concurrently with kvmppc_switch_mmu_to_hpt() or
kvmppc_switch_mmu_to_radix().
This patch therefore adds calls to spin_lock/unlock on the kvm->mmu_lock
around the assignments to kvm->arch.radix, and makes sure that the
partition-scoped radix tree or HPT is only freed after changing
kvm->arch.radix.
This also takes the kvm->mmu_lock in kvmppc_rmap_reset() to make sure
that the clearing of each rmap array (one per memslot) doesn't happen
concurrently with use of the array in the kvm_unmap_hva_range_hv()
or the other MMU callbacks.
Fixes: 18c3640cef ("KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix host")
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Previously, we only called indirect_branch_prediction_barrier on the
logical CPU that freed a vmcb. This function should be called on all
logical CPUs that last loaded the vmcb in question.
Fixes: 15d4507152 ("KVM/x86: Add IBPB support")
Reported-by: Neel Natu <neelnatu@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a guest page table is updated via an emulated write,
kvm_mmu_pte_write() is called to update the shadow PTE using the just
written guest PTE value. But if two emulated guest PTE writes happened
concurrently, it is possible that the guest PTE and the shadow PTE end
up being out of sync. Emulated writes do not mark the shadow page as
unsync-ed, so this inconsistency will not be resolved even by a guest TLB
flush (unless the page was marked as unsync-ed at some other point).
This is fixed by re-reading the current value of the guest PTE after the
MMU lock has been acquired instead of just using the value that was
written prior to calling kvm_mmu_pte_write().
Signed-off-by: Junaid Shahid <junaids@google.com>
Reviewed-by: Wanpeng Li <wanpengli@tencent.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
vmcs12 represents the per-CPU cache of L1 active vmcs12.
This cache can be loaded by one of the following:
1) Guest making a vmcs12 active by exeucting VMPTRLD
2) Guest specifying eVMCS in VP assist page and executing
VMLAUNCH/VMRESUME.
Either way, vmcs12 should have revision_id of VMCS12_REVISION.
Which is not equal to eVMCS revision_id which specifies used
VersionNumber of eVMCS struct (e.g. KVM_EVMCS_VERSION).
Specifically, this causes an issue in restoring a nested VM state
because vmx_set_nested_state() verifies that vmcs12->revision_id
is equal to VMCS12_REVISION which was not true in case vmcs12
was populated from an eVMCS by vmx_get_nested_state() which calls
copy_enlightened_to_vmcs12().
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
According to TLFS section 16.11.2 Enlightened VMCS, the first u32
field of eVMCS should specify eVMCS VersionNumber.
This version should be in the range of supported eVMCS versions exposed
to guest via CPUID.0x4000000A.EAX[0:15].
The range which KVM expose to guest in this CPUID field should be the
same as the value returned in vmcs_version by nested_enable_evmcs().
According to the above, eVMCS VMPTRLD should verify that version specified
in given eVMCS is in the supported range. However, current code
mistakenly verfies this field against VMCS12_REVISION.
One can also see that when KVM use eVMCS, it makes sure that
alloc_vmcs_cpu() sets allocated eVMCS revision_id to KVM_EVMCS_VERSION.
Obvious fix should just change eVMCS VMPTRLD to verify first u32 field
of eVMCS is equal to KVM_EVMCS_VERSION.
However, it turns out that Microsoft Hyper-V fails to comply to their
own invented interface: When Hyper-V use eVMCS, it just sets first u32
field of eVMCS to revision_id specified in MSR_IA32_VMX_BASIC (In our
case: VMCS12_REVISION). Instead of used eVMCS version number which is
one of the supported versions specified in CPUID.0x4000000A.EAX[0:15].
To overcome Hyper-V bug, we accept either a supported eVMCS version
or VMCS12_REVISION as valid values for first u32 field of eVMCS.
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since commit e79f245dde ("X86/KVM: Properly update 'tsc_offset' to
represent the running guest"), vcpu->arch.tsc_offset meaning was
changed to always reflect the tsc_offset value set on active VMCS.
Regardless if vCPU is currently running L1 or L2.
However, above mentioned commit failed to also change
kvm_vcpu_write_tsc_offset() to set vcpu->arch.tsc_offset correctly.
This is because vmx_write_tsc_offset() could set the tsc_offset value
in active VMCS to given offset parameter *plus vmcs12->tsc_offset*.
However, kvm_vcpu_write_tsc_offset() just sets vcpu->arch.tsc_offset
to given offset parameter. Without taking into account the possible
addition of vmcs12->tsc_offset. (Same is true for SVM case).
Fix this issue by changing kvm_x86_ops->write_tsc_offset() to return
actually set tsc_offset in active VMCS and modify
kvm_vcpu_write_tsc_offset() to set returned value in
vcpu->arch.tsc_offset.
In addition, rename write_tsc_offset() callback to write_l1_tsc_offset()
to make it clear that it is meant to set L1 TSC offset.
Fixes: e79f245dde ("X86/KVM: Properly update 'tsc_offset' to represent the running guest")
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Leonid Shatz <leonid.shatz@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The inline keyword which is not at the beginning of the function
declaration may trigger the following build warnings, so let's fix it:
arch/x86/kvm/vmx.c:1309:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
arch/x86/kvm/vmx.c:5947:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
arch/x86/kvm/vmx.c:5985:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
arch/x86/kvm/vmx.c:6023:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We get the following warnings about empty statements when building
with 'W=1':
arch/x86/kvm/lapic.c:632:53: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
arch/x86/kvm/lapic.c:1907:42: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
arch/x86/kvm/lapic.c:1936:65: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
arch/x86/kvm/lapic.c:1975:44: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
Rework the debug helper macro to get rid of these warnings.
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When guest transitions from/to long-mode by modifying MSR_EFER.LMA,
the list of shared MSRs to be saved/restored on guest<->host
transitions is updated (See vmx_set_efer() call to setup_msrs()).
On every entry to guest, vcpu_enter_guest() calls
vmx_prepare_switch_to_guest(). This function should also take care
of setting the shared MSRs to be saved/restored. However, the
function does nothing in case we are already running with loaded
guest state (vmx->loaded_cpu_state != NULL).
This means that even when guest modifies MSR_EFER.LMA which results
in updating the list of shared MSRs, it isn't being taken into account
by vmx_prepare_switch_to_guest() because it happens while we are
running with loaded guest state.
To fix above mentioned issue, add a flag to mark that the list of
shared MSRs has been updated and modify vmx_prepare_switch_to_guest()
to set shared MSRs when running with host state *OR* list of shared
MSRs has been updated.
Note that this issue was mistakenly introduced by commit
678e315e78 ("KVM: vmx: add dedicated utility to access guest's
kernel_gs_base") because previously vmx_set_efer() always called
vmx_load_host_state() which resulted in vmx_prepare_switch_to_guest() to
set shared MSRs.
Fixes: 678e315e78 ("KVM: vmx: add dedicated utility to access guest's kernel_gs_base")
Reported-by: Eyal Moscovici <eyal.moscovici@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_pv_clock_pairing() allocates local var
"struct kvm_clock_pairing clock_pairing" on stack and initializes
all it's fields besides padding (clock_pairing.pad[]).
Because clock_pairing var is written completely (including padding)
to guest memory, failure to init struct padding results in kernel
info-leak.
Fix the issue by making sure to also init the padding with zeroes.
Fixes: 55dd00a73a ("KVM: x86: add KVM_HC_CLOCK_PAIRING hypercall")
Reported-by: syzbot+a8ef68d71211ba264f56@syzkaller.appspotmail.com
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Consider the case that userspace enables KVM_CAP_HYPERV_ENLIGHTENED_VMCS twice:
1) kvm_vcpu_ioctl_enable_cap() is called to enable
KVM_CAP_HYPERV_ENLIGHTENED_VMCS which calls nested_enable_evmcs().
2) nested_enable_evmcs() sets enlightened_vmcs_enabled to true and fills
vmcs_version which is then copied to userspace.
3) kvm_vcpu_ioctl_enable_cap() is called again to enable
KVM_CAP_HYPERV_ENLIGHTENED_VMCS which calls nested_enable_evmcs().
4) This time nested_enable_evmcs() just returns 0 as
enlightened_vmcs_enabled is already true. *Without filling
vmcs_version*.
5) kvm_vcpu_ioctl_enable_cap() continues as usual and copies
*uninitialized* vmcs_version to userspace which leads to kernel info-leak.
Fix this issue by simply changing nested_enable_evmcs() to always fill
vmcs_version output argument. Even when enlightened_vmcs_enabled is
already set to true.
Note that SVM's nested_enable_evmcs() should not be modified because it
always returns a non-zero value (-ENODEV) which results in
kvm_vcpu_ioctl_enable_cap() skipping the copy of vmcs_version to
userspace (as it should).
Fixes: 57b119da35 ("KVM: nVMX: add KVM_CAP_HYPERV_ENLIGHTENED_VMCS capability")
Reported-by: syzbot+cfbc368e283d381f8cef@syzkaller.appspotmail.com
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is a race condition when accessing kvm->arch.apic_access_page_done.
Due to it, x86_set_memory_region will fail when creating the second vcpu
for a svm guest.
Add a mutex_lock to serialize the accesses to apic_access_page_done.
This lock is also used by vmx for the same purpose.
Signed-off-by: Wei Wang <wawei@amazon.de>
Signed-off-by: Amadeusz Juskowiak <ajusk@amazon.de>
Signed-off-by: Julian Stecklina <jsteckli@amazon.de>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reported by syzkaller:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000014
PGD 800000040410c067 P4D 800000040410c067 PUD 40410d067 PMD 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 3 PID: 2567 Comm: poc Tainted: G OE 4.19.0-rc5 #16
RIP: 0010:kvm_pv_send_ipi+0x94/0x350 [kvm]
Call Trace:
kvm_emulate_hypercall+0x3cc/0x700 [kvm]
handle_vmcall+0xe/0x10 [kvm_intel]
vmx_handle_exit+0xc1/0x11b0 [kvm_intel]
vcpu_enter_guest+0x9fb/0x1910 [kvm]
kvm_arch_vcpu_ioctl_run+0x35c/0x610 [kvm]
kvm_vcpu_ioctl+0x3e9/0x6d0 [kvm]
do_vfs_ioctl+0xa5/0x690
ksys_ioctl+0x6d/0x80
__x64_sys_ioctl+0x1a/0x20
do_syscall_64+0x83/0x6e0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
The reason is that the apic map has not yet been initialized, the testcase
triggers pv_send_ipi interface by vmcall which results in kvm->arch.apic_map
is dereferenced. This patch fixes it by checking whether or not apic map is
NULL and bailing out immediately if that is the case.
Fixes: 4180bf1b65 (KVM: X86: Implement "send IPI" hypercall)
Reported-by: Wei Wu <ww9210@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Wei Wu <ww9210@gmail.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Apparently, the ple_gap parameter was accidentally removed
by commit c8e88717cf. Add it
back.
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Cc: stable@vger.kernel.org
Fixes: c8e88717cf
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This has a single 1-line patch which fixes a bug in the recently-merged
nested HV KVM support.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJb7pbhAAoJEJ2a6ncsY3Gf8UIIAKgiocLz4jTrWYaR/OVbg6EY
tSJQBbsi6bEAog/FZMWDG0zL0YB4s+wXu34RiTt/P7g0VzHFTmR6ZHIJPiSd78aH
oxe8H7TOVq8/EmD0TwREVgUe1qIHgLBkVkk05b0P0nlpeO5bzWQBco2No2mfKWOq
yZcK03QWBsVaq0xhZFM/c0SkxBYOIDcm1kG+XNpOcsmWGXin96TlK+2WohOIH5nY
+16vI61n7/jBjdoxQS0Lw8OAfsA8CjY9GaKf3MuFYe93anZUv2s8FrAv35qUwzBg
5/Y/f+EB5AKMf3XR2A8nJ6HmoeXUFu4NUxT1YAQPAUcrxkENcsaRHDe2Uwt1QIk=
=iPcL
-----END PGP SIGNATURE-----
Merge tag 'kvm-ppc-fixes-4.20-1' of https://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD
PPC KVM fixes for 4.20
This has a single 1-line patch which fixes a bug in the recently-merged
nested HV KVM support.
- Address Range Scrub overflow continuation handling has been broken
since it was initially merged. It was only recently that error injection
and platform-BIOS support enabled this corner case to be exercised.
- The recent attempt to provide more isolation for the kernel Address
Range Scrub state machine from userapace initiated sessions triggers a
lockdep report. Revert and try again at the next merge window.
- Fix a kasan reported buffer overflow in libnvdimm unit test
infrastrucutre (nfit_test)
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJb8MdUAAoJEB7SkWpmfYgCifYP/A+OQ19HybqcY2nfvqXUdQum
Q5x3qcKmGmEbKbnCUMOHZJpEjW4c/Cpm6OKhuFDQJ4tijn1XG3/ATSi7PXrZxs/o
CRK8MIg5Wz/mMvYRvypIkCHHHr9+Y1NjqmQynM4LLzNG24GMXaeHHuZUTnrZCDmu
0+jBTylNgVYdykoIxgHDYDB+cd6w4NtAP5OD9D46pdsmzX9ac+OQyZMyNB3glUhd
/ZFAoywVNfvfJVWEci9RoHiKttWxgVoCuNbSlCs2Y6ymepA44ApR9AgLHtaC9pFO
DrPkfCzPSmf4PVSxLJd79+/sw9YOcBD7LZ5IxzozxRMuRn5pIofdZIsBg9PlwT5B
NL9jQK87XPiG0vNxhJu3wzP+FlyCXxGxkWfApp7w4rlWBV7RgugOZHyH051rdKzQ
44JAPzLLCfA5Mj4o2tIbSx42f2JNX93XDEX8fkUB+qs3GzyOcMtlcmz9UjmnrT0R
o9KHKhDn81Vivxh33Ts2G0iHktO83XSUBDWApSd6erjEUXMsCLY0D8y+nDGTOMUh
kVcY8q93sgZGLVbcxt0eGc8Q7osZYawQGRGucflTETFcxNwMyLL4F9lWgPirGeYF
i1JDWeTrhcImYufNj8o78LsbT5xh6YjbZZ8Q1obIgPXpDtxHNIXO6COId49Zp2cK
obftWyVp+7kYe79NWzmD
=sfNx
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
"A small batch of fixes for v4.20-rc3.
The overflow continuation fix addresses something that has been broken
for several releases. Arguably it could wait even longer, but it's a
one line fix and this finishes the last of the known address range
scrub bug reports. The revert addresses a lockdep regression. The unit
tests are not critical to fix, but no reason to hold this fix back.
Summary:
- Address Range Scrub overflow continuation handling has been broken
since it was initially merged. It was only recently that error
injection and platform-BIOS support enabled this corner case to be
exercised.
- The recent attempt to provide more isolation for the kernel Address
Range Scrub state machine from userapace initiated sessions
triggers a lockdep report. Revert and try again at the next merge
window.
- Fix a kasan reported buffer overflow in libnvdimm unit test
infrastrucutre (nfit_test)"
* tag 'libnvdimm-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
Revert "acpi, nfit: Further restrict userspace ARS start requests"
acpi, nfit: Fix ARS overflow continuation
tools/testing/nvdimm: Fix the array size for dimm devices.
Merge misc fixes from Andrew Morton:
"16 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm/memblock.c: fix a typo in __next_mem_pfn_range() comments
mm, page_alloc: check for max order in hot path
scripts/spdxcheck.py: make python3 compliant
tmpfs: make lseek(SEEK_DATA/SEK_HOLE) return ENXIO with a negative offset
lib/ubsan.c: don't mark __ubsan_handle_builtin_unreachable as noreturn
mm/vmstat.c: fix NUMA statistics updates
mm/gup.c: fix follow_page_mask() kerneldoc comment
ocfs2: free up write context when direct IO failed
scripts/faddr2line: fix location of start_kernel in comment
mm: don't reclaim inodes with many attached pages
mm, memory_hotplug: check zone_movable in has_unmovable_pages
mm/swapfile.c: use kvzalloc for swap_info_struct allocation
MAINTAINERS: update OMAP MMC entry
hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!
kernel/sched/psi.c: simplify cgroup_move_task()
z3fold: fix possible reclaim races
Pull scheduler fix from Ingo Molnar:
"Fix an exec() related scalability/performance regression, which was
caused by incorrectly calculating load and migrating tasks on exec()
when they shouldn't be"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix cpu_util_wake() for 'execl' type workloads
Pull perf fixes from Ingo Molnar:
"Fix uncore PMU enumeration for CofeeLake CPUs"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Support CoffeeLake 8th CBOX
perf/x86/intel/uncore: Add more IMC PCI IDs for KabyLake and CoffeeLake CPUs
Pull EFI fixes from Ingo Molnar:
"Misc fixes: two warning splat fixes, a leak fix and persistent memory
allocation fixes for ARM"
* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi: Permit calling efi_mem_reserve_persistent() from atomic context
efi/arm: Defer persistent reservations until after paging_init()
efi/arm/libstub: Pack FDT after populating it
efi/arm: Revert deferred unmap of early memmap mapping
efi: Fix debugobjects warning on 'efi_rts_work'
Pull ARM spectre updates from Russell King:
"These are the currently known final bits that resolve the Spectre
issues. big.Little systems used to be sufficiently identical in that
there were no differences between individual CPUs in the system that
mattered to the kernel. With the advent of the Spectre problem, the
CPUs now have differences in how the workaround is applied.
As a result of previous Spectre patches, these systems ended up
reporting quite a lot of:
"CPUx: Spectre v2: incorrect context switching function, system vulnerable"
messages due to the action of the big.Little switcher causing the CPUs
to be re-initialised regularly. This series resolves that issue by
making the CPU vtable unique to each CPU.
However, since this is used very early, before per-cpu is setup,
per-cpu can't be used. We also have a problem that two of the methods
are not called from preempt-safe paths, but thankfully these remain
identical between all CPUs in the system. To make sure, we validate
that these are identical during boot"
* 'spectre' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: spectre-v2: per-CPU vtables to work around big.Little systems
ARM: add PROC_VTABLE and PROC_TABLE macros
ARM: clean up per-processor check_bugs method call
ARM: split out processor lookup
ARM: make lookup_processor_type() non-__init
Konstantin has noticed that kvmalloc might trigger the following
warning:
WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60
[...]
Call Trace:
fragmentation_index+0x76/0x90
compaction_suitable+0x4f/0xf0
shrink_node+0x295/0x310
node_reclaim+0x205/0x250
get_page_from_freelist+0x649/0xad0
__alloc_pages_nodemask+0x12a/0x2a0
kmalloc_large_node+0x47/0x90
__kmalloc_node+0x22b/0x2e0
kvmalloc_node+0x3e/0x70
xt_alloc_table_info+0x3a/0x80 [x_tables]
do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables]
nf_setsockopt+0x44/0x60
SyS_setsockopt+0x6f/0xc0
do_syscall_64+0x67/0x120
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
the problem is that we only check for an out of bound order in the slow
path and the node reclaim might happen from the fast path already. This
is fixable by making sure that kvmalloc doesn't ever use kmalloc for
requests that are larger than KMALLOC_MAX_SIZE but this also shows that
the code is rather fragile. A recent UBSAN report just underlines that
by the following report
UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
shift exponent 51 is too large for 32-bit type 'int'
CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0xd2/0x148 lib/dump_stack.c:113
ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
__ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425
__zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117
zone_watermark_fast mm/page_alloc.c:3216 [inline]
get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300
__alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370
alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093
alloc_pages include/linux/gfp.h:509 [inline]
__get_free_pages+0x12/0x60 mm/page_alloc.c:4414
dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156
raw_cmd_copyin drivers/block/floppy.c:3159 [inline]
raw_cmd_ioctl drivers/block/floppy.c:3206 [inline]
fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544
fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571
__blkdev_driver_ioctl block/ioctl.c:303 [inline]
blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601
block_ioctl+0x105/0x150 fs/block_dev.c:1883
vfs_ioctl fs/ioctl.c:46 [inline]
do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687
ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702
__do_sys_ioctl fs/ioctl.c:709 [inline]
__se_sys_ioctl fs/ioctl.c:707 [inline]
__x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707
do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Note that this is not a kvmalloc path. It is just that the fast path
really depends on having sanitzed order as well. Therefore move the
order check to the fast path.
Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reported-by: Kyungtae Kim <kt0755@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Byoungyoung Lee <lifeasageek@gmail.com>
Cc: "Dae R. Jeong" <threeearcat@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Without this change the following happens when using Python3 (3.6.6):
$ echo "GPL-2.0" | python3 scripts/spdxcheck.py -
FAIL: 'str' object has no attribute 'decode'
Traceback (most recent call last):
File "scripts/spdxcheck.py", line 253, in <module>
parser.parse_lines(sys.stdin, args.maxlines, '-')
File "scripts/spdxcheck.py", line 171, in parse_lines
line = line.decode(locale.getpreferredencoding(False), errors='ignore')
AttributeError: 'str' object has no attribute 'decode'
So as the line is already a string, there is no need to decode it and
the line can be dropped.
/usr/bin/python on Arch is Python 3. So this would indeed be worth
going into 4.19.
Link: http://lkml.kernel.org/r/20181023070802.22558-1-u.kleine-koenig@pengutronix.de
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Joe Perches <joe@perches.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Other filesystems such as ext4, f2fs and ubifs all return ENXIO when
lseek (SEEK_DATA or SEEK_HOLE) requests a negative offset.
man 2 lseek says
: EINVAL whence is not valid. Or: the resulting file offset would be
: negative, or beyond the end of a seekable device.
:
: ENXIO whence is SEEK_DATA or SEEK_HOLE, and the file offset is beyond
: the end of the file.
Make tmpfs return ENXIO under these circumstances as well. After this,
tmpfs also passes xfstests's generic/448.
[akpm@linux-foundation.org: rewrite changelog]
Link: http://lkml.kernel.org/r/1540434176-14349-1-git-send-email-yuyufen@huawei.com
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
gcc-8 complains about the prototype for this function:
lib/ubsan.c:432:1: error: ignoring attribute 'noreturn' in declaration of a built-in function '__ubsan_handle_builtin_unreachable' because it conflicts with attribute 'const' [-Werror=attributes]
This is actually a GCC's bug. In GCC internals
__ubsan_handle_builtin_unreachable() declared with both 'noreturn' and
'const' attributes instead of only 'noreturn':
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84210
Workaround this by removing the noreturn attribute.
[aryabinin: add information about GCC bug in changelog]
Link: http://lkml.kernel.org/r/20181107144516.4587-1-aryabinin@virtuozzo.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Olof Johansson <olof@lixom.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Scan through the whole array to see if an update is needed. While we're
at it, use sizeof() to be safe against any possible type changes in the
future.
The bug here is that we wouldn't sync per-cpu counters into global ones
if there was an update of numa_stats for higher cpus. Highly
theoretical one though because it is much more probable that zone_stats
are updated so we would refresh anyway. So I wouldn't bother to mark
this for stable, yet something nice to fix.
[mhocko@suse.com: changelog enhancement]
Link: http://lkml.kernel.org/r/1541601517-17282-1-git-send-email-janne.huttunen@nokia.com
Fixes: 1d90ca897c ("mm: update NUMA counter threshold size")
Signed-off-by: Janne Huttunen <janne.huttunen@nokia.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit df06b37ffe ("mm/gup: cache dev_pagemap while pinning pages")
modified the signature of follow_page_mask() but left the parameter
description behind.
Update the description to make the code and comments agree again.
While at it, update formatting of the return value description to match
Documentation/doc-guide/kernel-doc.rst guidelines.
Link: http://lkml.kernel.org/r/1541603316-27832-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The write context should also be freed even when direct IO failed.
Otherwise a memory leak is introduced and entries remain in
oi->ip_unwritten_list causing the following BUG later in unlink path:
ERROR: bug expression: !list_empty(&oi->ip_unwritten_list)
ERROR: Clear inode of 215043, inode has unwritten extents
...
Call Trace:
? __set_current_blocked+0x42/0x68
ocfs2_evict_inode+0x91/0x6a0 [ocfs2]
? bit_waitqueue+0x40/0x33
evict+0xdb/0x1af
iput+0x1a2/0x1f7
do_unlinkat+0x194/0x28f
SyS_unlinkat+0x1b/0x2f
do_syscall_64+0x79/0x1ae
entry_SYSCALL_64_after_hwframe+0x151/0x0
This patch also logs, with frequency limit, direct IO failures.
Link: http://lkml.kernel.org/r/20181102170632.25921-1-wen.gang.wang@oracle.com
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Spock reported that commit 172b06c32b ("mm: slowly shrink slabs with a
relatively small number of objects") leads to a regression on his setup:
periodically the majority of the pagecache is evicted without an obvious
reason, while before the change the amount of free memory was balancing
around the watermark.
The reason behind is that the mentioned above change created some
minimal background pressure on the inode cache. The problem is that if
an inode is considered to be reclaimed, all belonging pagecache page are
stripped, no matter how many of them are there. So, if a huge
multi-gigabyte file is cached in the memory, and the goal is to reclaim
only few slab objects (unused inodes), we still can eventually evict all
gigabytes of the pagecache at once.
The workload described by Spock has few large non-mapped files in the
pagecache, so it's especially noticeable.
To solve the problem let's postpone the reclaim of inodes, which have
more than 1 attached page. Let's wait until the pagecache pages will be
evicted naturally by scanning the corresponding LRU lists, and only then
reclaim the inode structure.
Link: http://lkml.kernel.org/r/20181023164302.20436-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Spock <dairinin@gmail.com>
Tested-by: Spock <dairinin@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: <stable@vger.kernel.org> [4.19.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Page state checks are racy. Under a heavy memory workload (e.g. stress
-m 200 -t 2h) it is quite easy to hit a race window when the page is
allocated but its state is not fully populated yet. A debugging patch to
dump the struct page state shows
has_unmovable_pages: pfn:0x10dfec00, found:0x1, count:0x0
page:ffffea0437fb0000 count:1 mapcount:1 mapping:ffff880e05239841 index:0x7f26e5000 compound_mapcount: 1
flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked)
Note that the state has been checked for both PageLRU and PageSwapBacked
already. Closing this race completely would require some sort of retry
logic. This can be tricky and error prone (think of potential endless
or long taking loops).
Workaround this problem for movable zones at least. Such a zone should
only contain movable pages. Commit 15c30bc090 ("mm, memory_hotplug:
make has_unmovable_pages more robust") has told us that this is not
strictly true though. Bootmem pages should be marked reserved though so
we can move the original check after the PageReserved check. Pages from
other zones are still prone to races but we even do not pretend that
memory hotremove works for those so pre-mature failure doesn't hurt that
much.
Link: http://lkml.kernel.org/r/20181106095524.14629-1-mhocko@kernel.org
Fixes: 15c30bc090 ("mm, memory_hotplug: make has_unmovable_pages more robust")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Baoquan He <bhe@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit a2468cc9bf ("swap: choose swap device according to numa node")
changed 'avail_lists' field of 'struct swap_info_struct' to an array.
In popular linux distros it increased size of swap_info_struct up to 40
Kbytes and now swap_info_struct allocation requires order-4 page.
Switch to kvzmalloc allows to avoid unexpected allocation failures.
Link: http://lkml.kernel.org/r/fc23172d-3c75-21e2-d551-8b1808cbe593@virtuozzo.com
Fixes: a2468cc9bf ("swap: choose swap device according to numa node")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jarkko's e-mail address hasn't worked for a long time. We still want to
keep this driver working as it is critical for some of the OMAP boards.
I use and test this driver frequently, so change myself as a maintainer
with "Odd Fixes" status.
Link: http://lkml.kernel.org/r/20181106222750.12939-1-aaro.koskinen@iki.fi
Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi>
Acked-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This bug has been experienced several times by the Oracle DB team. The
BUG is in remove_inode_hugepages() as follows:
/*
* If page is mapped, it was faulted in after being
* unmapped in caller. Unmap (again) now after taking
* the fault mutex. The mutex will prevent faults
* until we finish removing the page.
*
* This race can only happen in the hole punch case.
* Getting here in a truncate operation is a bug.
*/
if (unlikely(page_mapped(page))) {
BUG_ON(truncate_op);
In this case, the elevated map count is not the result of a race.
Rather it was incorrectly incremented as the result of a bug in the huge
pmd sharing code. Consider the following:
- Process A maps a hugetlbfs file of sufficient size and alignment
(PUD_SIZE) that a pmd page could be shared.
- Process B maps the same hugetlbfs file with the same size and
alignment such that a pmd page is shared.
- Process B then calls mprotect() to change protections for the mapping
with the shared pmd. As a result, the pmd is 'unshared'.
- Process B then calls mprotect() again to chage protections for the
mapping back to their original value. pmd remains unshared.
- Process B then forks and process C is created. During the fork
process, we do dup_mm -> dup_mmap -> copy_page_range to copy page
tables. Copying page tables for hugetlb mappings is done in the
routine copy_hugetlb_page_range.
In copy_hugetlb_page_range(), the destination pte is obtained by:
dst_pte = huge_pte_alloc(dst, addr, sz);
If pmd sharing is possible, the returned pointer will be to a pte in an
existing page table. In the situation above, process C could share with
either process A or process B. Since process A is first in the list,
the returned pte is a pointer to a pte in process A's page table.
However, the check for pmd sharing in copy_hugetlb_page_range is:
/* If the pagetables are shared don't copy or take references */
if (dst_pte == src_pte)
continue;
Since process C is sharing with process A instead of process B, the
above test fails. The code in copy_hugetlb_page_range which follows
assumes dst_pte points to a huge_pte_none pte. It copies the pte entry
from src_pte to dst_pte and increments this map count of the associated
page. This is how we end up with an elevated map count.
To solve, check the dst_pte entry for huge_pte_none. If !none, this
implies PMD sharing so do not copy.
Link: http://lkml.kernel.org/r/20181105212315.14125-1-mike.kravetz@oracle.com
Fixes: c5c99429fa ("fix hugepages leak due to pagetable page sharing")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The existing code triggered an invalid warning about 'rq' possibly being
used uninitialized. Instead of doing the silly warning suppression by
initializa it to NULL, refactor the code to bail out early instead.
Warning was:
kernel/sched/psi.c: In function `cgroup_move_task':
kernel/sched/psi.c:639:13: warning: `rq' may be used uninitialized in this function [-Wmaybe-uninitialized]
Link: http://lkml.kernel.org/r/20181103183339.8669-1-olof@lixom.net
Fixes: 2ce7135adc ("psi: cgroup support")
Signed-off-by: Olof Johansson <olof@lixom.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>