When vcpu sets DEBUGCTLMSR_LBR in the MSR_IA32_DEBUGCTLMSR, the KVM handler
would create a guest LBR event which enables the callstack mode and none of
hardware counter is assigned. The host perf would schedule and enable this
event as usual but in an exclusive way.
The guest LBR event will be released when the vPMU is reset but soon,
the lazy release mechanism would be applied to this event like a vPMC.
Suggested-by: Andi Kleen <ak@linux.intel.com>
Co-developed-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Message-Id: <20210201051039.255478-6-like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Usespace could set the bits [0, 5] of the IA32_PERF_CAPABILITIES
MSR which tells about the record format stored in the LBR records.
The LBR will be enabled on the guest if host perf supports LBR
(checked via x86_perf_get_lbr()) and the vcpu model is compatible
with the host one.
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20210201051039.255478-4-like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Usespace could set the bits [0, 5] of the IA32_PERF_CAPABILITIES
MSR which tells about the record format stored in the LBR records.
The LBR will be enabled on the guest if host perf supports LBR
(checked via x86_perf_get_lbr()) and the vcpu model is compatible
with the host one.
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20210201051039.255478-4-like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Once MSR_IA32_PERF_CAPABILITIES is changed via vmx_set_msr(), the
value should not be changed by cpuid(). To ensure that the new value
is kept, the default initialization path is moved to intel_pmu_init().
The effective value of the MSR will be 0 if PDCM is clear, however.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
To make code responsibilities clear, we may resue and invoke the
vmx_set_intercept_for_msr() in other vmx-specific files (e.g. pmu_intel.c),
so expose it to passthrough LBR msrs later.
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Message-Id: <20210201051039.255478-2-like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SVM already has specific handlers of MSR_IA32_DEBUGCTLMSR in the
svm_get/set_msr, so the x86 common part can be safely moved to VMX.
This allows KVM to store the bits it supports in GUEST_IA32_DEBUGCTL.
Add vmx_supported_debugctl() to refactor the throwing logic of #GP.
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Message-Id: <20210108013704.134985-2-like.xu@linux.intel.com>
[Merge parts of Chenyi Qiang's "KVM: X86: Expose bus lock debug exception
to guest". - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use x2apic_mode instead of x2apic_enabled() when adjusting the
destination ID during Posted Interrupt updates. This avoids the costly
RDMSR that is hidden behind x2apic_enabled().
Reported-by: luferry <luferry@163.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210115220354.434807-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Export x2apic_mode so that KVM can query whether x2APIC is active
without having to incur the RDMSR in x2apic_enabled(). When Posted
Interrupts are in use for a guest with an assigned device, KVM ends up
checking for x2APIC at least once every time a vCPU halts. KVM could
obviously snapshot x2apic_enabled() to avoid the RDMSR, but that's
rather silly given that x2apic_mode holds the exact info needed by KVM.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210115220354.434807-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Virtual Machine can exploit bus locks to degrade the performance of
system. Bus lock can be caused by split locked access to writeback(WB)
memory or by using locks on uncacheable(UC) memory. The bus lock is
typically >1000 cycles slower than an atomic operation within a cache
line. It also disrupts performance on other cores (which must wait for
the bus lock to be released before their memory operations can
complete).
To address the threat, bus lock VM exit is introduced to notify the VMM
when a bus lock was acquired, allowing it to enforce throttling or other
policy based mitigations.
A VMM can enable VM exit due to bus locks by setting a new "Bus Lock
Detection" VM-execution control(bit 30 of Secondary Processor-based VM
execution controls). If delivery of this VM exit was preempted by a
higher priority VM exit (e.g. EPT misconfiguration, EPT violation, APIC
access VM exit, APIC write VM exit, exception bitmap exiting), bit 26 of
exit reason in vmcs field is set to 1.
In current implementation, the KVM exposes this capability through
KVM_CAP_X86_BUS_LOCK_EXIT. The user can get the supported mode bitmap
(i.e. off and exit) and enable it explicitly (disabled by default). If
bus locks in guest are detected by KVM, exit to user space even when
current exit reason is handled by KVM internally. Set a new field
KVM_RUN_BUS_LOCK in vcpu->run->flags to inform the user space that there
is a bus lock detected in guest.
Document for Bus Lock VM exit is now available at the latest "Intel
Architecture Instruction Set Extensions Programming Reference".
Document Link:
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20201106090315.18606-4-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reset the vcpu->run->flags at the beginning of kvm_arch_vcpu_ioctl_run.
It can avoid every thunk of code that needs to set the flag clear it,
which increases the odds of missing a case and ending up with a flag in
an undefined state.
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20201106090315.18606-3-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Convert vcpu_vmx.exit_reason from a u32 to a union (of size u32). The
full VM_EXIT_REASON field is comprised of a 16-bit basic exit reason in
bits 15:0, and single-bit modifiers in bits 31:16.
Historically, KVM has only had to worry about handling the "failed
VM-Entry" modifier, which could only be set in very specific flows and
required dedicated handling. I.e. manually stripping the FAILED_VMENTRY
bit was a somewhat viable approach. But even with only a single bit to
worry about, KVM has had several bugs related to comparing a basic exit
reason against the full exit reason store in vcpu_vmx.
Upcoming Intel features, e.g. SGX, will add new modifier bits that can
be set on more or less any VM-Exit, as opposed to the significantly more
restricted FAILED_VMENTRY, i.e. correctly handling everything in one-off
flows isn't scalable. Tracking exit reason in a union forces code to
explicitly choose between consuming the full exit reason and the basic
exit, and is a convenient way to document and access the modifiers.
No functional change intended.
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20201106090315.18606-2-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The SEV FW version >= 0.23 added a new command that can be used to query
the attestation report containing the SHA-256 digest of the guest memory
encrypted through the KVM_SEV_LAUNCH_UPDATE_{DATA, VMSA} commands and
sign the report with the Platform Endorsement Key (PEK).
See the SEV FW API spec section 6.8 for more details.
Note there already exist a command (KVM_SEV_LAUNCH_MEASURE) that can be
used to get the SHA-256 digest. The main difference between the
KVM_SEV_LAUNCH_MEASURE and KVM_SEV_ATTESTATION_REPORT is that the latter
can be called while the guest is running and the measurement value is
signed with PEK.
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: Tom Lendacky <Thomas.Lendacky@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: John Allen <john.allen@amd.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: linux-crypto@vger.kernel.org
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: David Rientjes <rientjes@google.com>
Tested-by: James Bottomley <jejb@linux.ibm.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Message-Id: <20210104151749.30248-1-brijesh.singh@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the update_pte() shadow paging logic, which was obsoleted by
commit 4731d4c7a0 ("KVM: MMU: out of sync shadow core"), but never
removed. As pointed out by Yu, KVM never write protects leaf page
tables for the purposes of shadow paging, and instead marks their
associated shadow page as unsync so that the guest can write PTEs at
will.
The update_pte() path, which predates the unsync logic, optimizes COW
scenarios by refreshing leaf SPTEs when they are written, as opposed to
zapping the SPTE, restarting the guest, and installing the new SPTE on
the subsequent fault. Since KVM no longer write-protects leaf page
tables, update_pte() is unreachable and can be dropped.
Reported-by: Yu Zhang <yu.c.zhang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210115004051.4099250-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Expose AVX (VEX-encoded) versions of the Vector Neural Network
Instructions to guest.
The bit definition:
CPUID.(EAX=7,ECX=1):EAX[bit 4] AVX_VNNI
The following instructions are available when this feature is
present in the guest.
1. VPDPBUS: Multiply and Add Unsigned and Signed Bytes
2. VPDPBUSDS: Multiply and Add Unsigned and Signed Bytes with Saturation
3. VPDPWSSD: Multiply and Add Signed Word Integers
4. VPDPWSSDS: Multiply and Add Signed Integers with Saturation
This instruction is currently documented in the latest "extensions"
manual (ISE). It will appear in the "main" manual (SDM) in the future.
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Message-Id: <20210105004909.42000-3-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add AVX version of the Vector Neural Network (VNNI) Instructions.
A processor supports AVX VNNI instructions if CPUID.0x07.0x1:EAX[4] is
present. The following instructions are available when this feature is
present.
1. VPDPBUS: Multiply and Add Unsigned and Signed Bytes
2. VPDPBUSDS: Multiply and Add Unsigned and Signed Bytes with Saturation
3. VPDPWSSD: Multiply and Add Signed Word Integers
4. VPDPWSSDS: Multiply and Add Signed Integers with Saturation
The only in-kernel usage of this is kvm passthrough. The CPU feature
flag is shown as "avx_vnni" in /proc/cpuinfo.
This instruction is currently documented in the latest "extensions"
manual (ISE). It will appear in the "main" manual (SDM) in the future.
Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Message-Id: <20210105004909.42000-2-yang.zhong@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fix the following coccicheck warning:
./arch/x86/kvm/x86.c:8012:5-48: WARNING: Comparison to bool
Signed-off-by: YANG LI <abaci-bugfix@linux.alibaba.com>
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Message-Id: <1610357578-66081-1-git-send-email-abaci-bugfix@linux.alibaba.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Walk the list of MMU pages in reverse in kvm_mmu_zap_oldest_mmu_pages().
The list is FIFO, meaning new pages are inserted at the head and thus
the oldest pages are at the tail. Using a "forward" iterator causes KVM
to zap MMU pages that were just added, which obliterates guest
performance once the max number of shadow MMU pages is reached.
Fixes: 6b82ef2c9c ("KVM: x86/mmu: Batch zap MMU pages when recycling oldest pages")
Reported-by: Zdenek Kaspar <zkaspar82@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210113205030.3481307-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Return a 'bool' instead of an 'int' for various PTE accessors that are
boolean in nature, e.g. is_shadow_present_pte(). Returning an int is
goofy and potentially dangerous, e.g. if a flag being checked is moved
into the upper 32 bits of a SPTE, then the compiler may silently squash
the entire check since casting to an int is guaranteed to yield a
return value of '0'.
Opportunistically refactor is_last_spte() so that it naturally returns
a bool value instead of letting it implicitly cast 0/1 to false/true.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210123003003.3137525-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Enter a SRCU critical section for a memslots lookup during steal time
update if and only if a steal time update is actually needed. Taking
the lock can be avoided if steal time is disabled by the guest, or if
KVM knows it has already flagged the vCPU as being preempted.
Reword the comment to be more precise as to exactly why memslots will
be queried.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210123000334.3123628-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the disabling of page faults across kvm_steal_time_set_preempted()
as KVM now accesses the steal time struct (shared with the guest) via a
cached mapping (see commit b043138246, "x86/KVM: Make sure
KVM_VCPU_FLUSH_TLB flag is not missed".) The cache lookup is flagged as
atomic, thus it would be a bug if KVM tried to resolve a new pfn, i.e.
we want the splat that would be reached via might_fault().
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210123000334.3123628-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is a bug in the TDP MMU function to zap SPTEs which could be
replaced with a larger mapping which prevents the function from doing
anything. Fix this by correctly zapping the last level SPTEs.
Cc: stable@vger.kernel.org
Fixes: 1488199856 ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU")
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210202185734.1680553-11-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If not in long mode, the low bits of CR3 are reserved but not enforced to
be zero, so remove those checks. If in long mode, however, the MBZ bits
extend down to the highest physical address bit of the guest, excluding
the encryption bit.
Make the checks consistent with the above, and match them between
nested_vmcb_checks and KVM_SET_SREGS.
Cc: stable@vger.kernel.org
Fixes: 761e416934 ("KVM: nSVM: Check that MBZ bits in CR3 and CR4 are not set on vmrun of nested guests")
Fixes: a780a3ea62 ("KVM: X86: Fix reserved bits check for MOV to CR3")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Don't let KVM load when running as an SEV guest, regardless of what
CPUID says. Memory is encrypted with a key that is not accessible to
the host (L0), thus it's impossible for L0 to emulate SVM, e.g. it'll
see garbage when reading the VMCB.
Technically, KVM could decrypt all memory that needs to be accessible to
the L0 and use shadow paging so that L0 does not need to shadow NPT, but
exposing such information to L0 largely defeats the purpose of running as
an SEV guest. This can always be revisited if someone comes up with a
use case for running VMs inside SEV guests.
Note, VMLOAD, VMRUN, etc... will also #GP on GPAs with C-bit set, i.e. KVM
is doomed even if the SEV guest is debuggable and the hypervisor is willing
to decrypt the VMCB. This may or may not be fixed on CPUs that have the
SVME_ADDR_CHK fix.
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210202212017.2486595-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Set the emulator context to PROT64 if SYSENTER transitions from 32-bit
userspace (compat mode) to a 64-bit kernel, otherwise the RIP update at
the end of x86_emulate_insn() will incorrectly truncate the new RIP.
Note, this bug is mostly limited to running an Intel virtual CPU model on
an AMD physical CPU, as other combinations of virtual and physical CPUs
do not trigger full emulation. On Intel CPUs, SYSENTER in compatibility
mode is legal, and unconditionally transitions to 64-bit mode. On AMD
CPUs, SYSENTER is illegal in compatibility mode and #UDs. If the vCPU is
AMD, KVM injects a #UD on SYSENTER in compat mode. If the pCPU is Intel,
SYSENTER will execute natively and not trigger #UD->VM-Exit (ignoring
guest TLB shenanigans).
Fixes: fede8076aa ("KVM: x86: handle wrap around 32-bit address space")
Cc: stable@vger.kernel.org
Signed-off-by: Jonny Barker <jonny@jonnybarker.com>
[sean: wrote changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210202165546.2390296-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Under the GHCB specification, SEV-ES guests can support string I/O.
The current #VC handler contains this support, so remove the need to
unroll kernel string I/O operations. This will reduce the number of #VC
exceptions generated as well as the number VM exits for the guest.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/3de04b5b638546ac75d42ba52307fe1a922173d3.1612203987.git.thomas.lendacky@amd.com
Add Alder Lake mobile processor to CPU list to enumerate and enable the
split lock feature.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20210201190007.4031869-1-fenghua.yu@intel.com
Commit 7a873e4555 ("KVM: selftests: Verify supported CR4 bits can be set
before KVM_SET_CPUID2") reveals that KVM allows to set X86_CR4_PCIDE even
when PCID support is missing:
==== Test Assertion Failure ====
x86_64/set_sregs_test.c:41: rc
pid=6956 tid=6956 - Invalid argument
1 0x000000000040177d: test_cr4_feature_bit at set_sregs_test.c:41
2 0x00000000004014fc: main at set_sregs_test.c:119
3 0x00007f2d9346d041: ?? ??:0
4 0x000000000040164d: _start at ??:?
KVM allowed unsupported CR4 bit (0x20000)
Add X86_FEATURE_PCID feature check to __cr4_reserved_bits() to make
kvm_is_valid_cr4() fail.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210201142843.108190-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Userspace that does not know about KVM_GET_MSR_FEATURE_INDEX_LIST
will generally use the default value for MSR_IA32_ARCH_CAPABILITIES.
When this happens and the host has tsx=on, it is possible to end up with
virtual machines that have HLE and RTM disabled, but TSX_CTRL available.
If the fleet is then switched to tsx=off, kvm_get_arch_capabilities()
will clear the ARCH_CAP_TSX_CTRL_MSR bit and it will not be possible to
use the tsx=off hosts as migration destinations, even though the guests
do not have TSX enabled.
To allow this migration, allow guests to write to their TSX_CTRL MSR,
while keeping the host MSR unchanged for the entire life of the guests.
This ensures that TSX remains disabled and also saves MSR reads and
writes, and it's okay to do because with tsx=off we know that guests will
not have the HLE and RTM features in their CPUID. (If userspace sets
bogus CPUID data, we do not expect HLE and RTM to work in guests anyway).
Cc: stable@vger.kernel.org
Fixes: cbbaa2727a ("KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Tom reported that one of the GDB test-cases failed, and Boris bisected
it to commit:
d53d9bc0cf ("x86/debug: Change thread.debugreg6 to thread.virtual_dr6")
The debugging session led us to commit:
6c0aca288e ("x86: Ignore trap bits on single step exceptions")
It turns out that TF and data breakpoints are both traps and will be
merged, while instruction breakpoints are faults and will not be merged.
This means 6c0aca288e is wrong, only TF and instruction breakpoints
need to be excluded while TF and data breakpoints can be merged.
[ bp: Massage commit message. ]
Fixes: d53d9bc0cf ("x86/debug: Change thread.debugreg6 to thread.virtual_dr6")
Fixes: 6c0aca288e ("x86: Ignore trap bits on single step exceptions")
Reported-by: Tom de Vries <tdevries@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/YBMAbQGACujjfz%2Bi@hirez.programming.kicks-ass.net
Link: https://lkml.kernel.org/r/20210128211627.GB4348@worktop.programming.kicks-ass.net
With Architectural Performance Monitoring Version 5, CPUID 10.ECX cpu
leaf indicates the fixed counter enumeration. This extends the previous
count to a bitmap which allows disabling even lower fixed counters.
It could be used by a Hypervisor.
The existing intel_ctrl variable is used to remember the bitmask of the
counters. All code that reads all counters is fixed to check this extra
bitmask.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Originally-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1611873611-156687-6-git-send-email-kan.liang@linux.intel.com
Add perf core PMU support for the Intel Sapphire Rapids server, which is
the successor of the Intel Ice Lake server. The enabling code is based
on Ice Lake, but there are several new features introduced.
The event encoding is changed and simplified, e.g., the event codes
which are below 0x90 are restricted to counters 0-3. The event codes
which above 0x90 are likely to have no restrictions. The event
constraints, extra_regs(), and hardware cache events table are changed
accordingly.
A new Precise Distribution (PDist) facility is introduced, which
further minimizes the skid when a precise event is programmed on the GP
counter 0. Enable the Precise Distribution (PDist) facility with :ppp
event. For this facility to work, the period must be initialized with a
value larger than 127. Add spr_limit_period() to apply the limit for
:ppp event.
Two new data source fields, data block & address block, are added in the
PEBS Memory Info Record for the load latency event. To enable the
feature,
- An auxiliary event has to be enabled together with the load latency
event on Sapphire Rapids. A new flag PMU_FL_MEM_LOADS_AUX is
introduced to indicate the case. A new event, mem-loads-aux, is
exposed to sysfs for the user tool.
Add a check in hw_config(). If the auxiliary event is not detected,
return an unique error -ENODATA.
- The union perf_mem_data_src is extended to support the new fields.
- Ice Lake and earlier models do not support block information, but the
fields may be set by HW on some machines. Add pebs_no_block to
explicitly indicate the previous platforms which don't support the new
block fields. Accessing the new block fields are ignored on those
platforms.
A new store Latency facility is introduced, which leverages the PEBS
facility where it can provide additional information about sampled
stores. The additional information includes the data address, memory
auxiliary info (e.g. Data Source, STLB miss) and the latency of the
store access. To enable the facility, the new event (0x02cd) has to be
programed on the GP counter 0. A new flag PERF_X86_EVENT_PEBS_STLAT is
introduced to indicate the event. The store_latency_data() is introduced
to parse the memory auxiliary info.
The layout of access latency field of PEBS Memory Info Record has been
changed. Two latency, instruction latency (bit 15:0) and cache access
latency (bit 47:32) are recorded.
- The cache access latency is similar to previous memory access latency.
For loads, the latency starts by the actual cache access until the
data is returned by the memory subsystem.
For stores, the latency starts when the demand write accesses the L1
data cache and lasts until the cacheline write is completed in the
memory subsystem.
The cache access latency is stored in low 32bits of the sample type
PERF_SAMPLE_WEIGHT_STRUCT.
- The instruction latency starts by the dispatch of the load operation
for execution and lasts until completion of the instruction it belongs
to.
Add a new flag PMU_FL_INSTR_LATENCY to indicate the instruction
latency support. The instruction latency is stored in the bit 47:32
of the sample type PERF_SAMPLE_WEIGHT_STRUCT.
Extends the PERF_METRICS MSR to feature TMA method level 2 metrics. The
lower half of the register is the TMA level 1 metrics (legacy). The
upper half is also divided into four 8-bit fields for the new level 2
metrics. Expose all eight Topdown metrics events to user space.
The full description for the SPR features can be found at Intel
Architecture Instruction Set Extensions and Future Features
Programming Reference, 319433-041.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1611873611-156687-5-git-send-email-kan.liang@linux.intel.com
Intel Sapphire Rapids server will introduce 8 metrics events. Intel
Ice Lake only supports 4 metrics events. A perf tool user may mistakenly
use the unsupported events via RAW format on Ice Lake. The user can
still get a value from the unsupported Topdown metrics event once the
following Sapphire Rapids enabling patch is applied.
To enable the 8 metrics events on Intel Sapphire Rapids, the
INTEL_TD_METRIC_MAX has to be updated, which impacts the
is_metric_event(). The is_metric_event() is a generic function.
On Ice Lake, the newly added SPR metrics events will be mistakenly
accepted as metric events on creation. At runtime, the unsupported
Topdown metrics events will be updated.
Add a variable num_topdown_events in x86_pmu to indicate the available
number of the Topdown metrics event on the platform. Apply the number
into is_metric_event(). Only the supported Topdown metrics events
should be created as metrics events.
Apply the num_topdown_events in icl_update_topdown_event() as well. The
function can be reused by the following patch.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1611873611-156687-4-git-send-email-kan.liang@linux.intel.com
Similar to Ice Lake, Intel Sapphire Rapids server also supports the
topdown performance metrics feature. The difference is that Intel
Sapphire Rapids server extends the PERF_METRICS MSR to feature TMA
method level two metrics, which will introduce 8 metrics events. Current
icl_update_topdown_event() only check 4 level one metrics events.
Factor out intel_update_topdown_event() to facilitate the code sharing
between Ice Lake and Sapphire Rapids.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1611873611-156687-3-git-send-email-kan.liang@linux.intel.com
Current PERF_SAMPLE_WEIGHT sample type is very useful to expresses the
cost of an action represented by the sample. This allows the profiler
to scale the samples to be more informative to the programmer. It could
also help to locate a hotspot, e.g., when profiling by memory latencies,
the expensive load appear higher up in the histograms. But current
PERF_SAMPLE_WEIGHT sample type is solely determined by one factor. This
could be a problem, if users want two or more factors to contribute to
the weight. For example, Golden Cove core PMU can provide both the
instruction latency and the cache Latency information as factors for the
memory profiling.
For current X86 platforms, although meminfo::latency is defined as a
u64, only the lower 32 bits include the valid data in practice (No
memory access could last than 4G cycles). The higher 32 bits can be used
to store new factors.
Add a new sample type, PERF_SAMPLE_WEIGHT_STRUCT, to indicate the new
sample weight structure. It shares the same space as the
PERF_SAMPLE_WEIGHT sample type.
Users can apply either the PERF_SAMPLE_WEIGHT sample type or the
PERF_SAMPLE_WEIGHT_STRUCT sample type to retrieve the sample weight, but
they cannot apply both sample types simultaneously.
Currently, only X86 and PowerPC use the PERF_SAMPLE_WEIGHT sample type.
- For PowerPC, there is nothing changed for the PERF_SAMPLE_WEIGHT
sample type. There is no effect for the new PERF_SAMPLE_WEIGHT_STRUCT
sample type. PowerPC can re-struct the weight field similarly later.
- For X86, the same value will be dumped for the PERF_SAMPLE_WEIGHT
sample type or the PERF_SAMPLE_WEIGHT_STRUCT sample type for now.
The following patches will apply the new factors for the
PERF_SAMPLE_WEIGHT_STRUCT sample type.
The field in the union perf_sample_weight should be shared among
different architectures. A generic name is required, but it's hard to
abstract a name that applies to all architectures. For example, on X86,
the fields are to store all kinds of latency. While on PowerPC, it
stores MMCRA[TECX/TECM], which should not be latency. So a general name
prefix 'var$NUM' is used here.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1611873611-156687-2-git-send-email-kan.liang@linux.intel.com
toolchains which do not generate section symbols anymore. And a cleanup
ontop.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmAWi8EACgkQEsHwGGHe
VUrZhBAAmbaEBdU05+ah24r+XHLsCJBQwlwHAr71YfNnEpq/KRHXLtc3QJLAdOVf
ku0536gDJvMUX7d7ap2ooSVAD9Ed1h4N7BvOn8eLFiaPc6NG9Tw6FZc/X6OKtyLd
DyFOsNAa9JnjjeqT7TTYdqbcJUzPSqd3Ufg5V4UZcVwvGTkbc+k1TktnteTMXWUI
t99wXCOfw2accdUrr3MIkdvSGNo099VZa/DBZQVmpjCcMSOfe/0KQIoeVagEpAew
T0WxONdM62Nz4Tv03N6m6EqVpIOc8BueRuOWlX/c5XVCmYx8BDSdFb6EY9sEh10i
hLU1U36BCUT1uAA/ZAuw/I22fy5MXqbrGvWJrcW8Wav1fQfaDYkDyGNE+aBjXysQ
uZGTzbfAdAS2B8XTElzYJZwh1WW7Je7b2pZhL5/6kwoa8E82NsR7a2inl6pdkKin
LcrLlxrSZYbAjhYuA3Da4iErvtu/UloQwfDhga7NasGdVQzlwUQBX67Tgt1PA9B2
JWoeY1NKBGboNEQa3NWq37yCtfcpx2hL4wWgyUbj0TMOXO06V/ZhrPzIQDrMmVGx
g52NrYnH/CujrKgWH3+Q+kBWA/BSVP5p3UnhLCDM1X7dyZiimuLOJNDUQ9WldENV
rsGgKyW3/6F4UzmqLr0oOB6X9/2v15LSktN9BJtv3UWUCl/PfXU=
=VV7d
-----END PGP SIGNATURE-----
Merge tag 'x86_entry_for_v5.11_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Borislav Petkov:
"A single fix for objtool to generate proper unwind info for newer
toolchains which do not generate section symbols anymore. And a
cleanup ontop.
This was originally going to go during the next merge window but
people can already trigger a build error with binutils-2.36 which
doesn't emit section symbols - something which objtool relies on - so
let's expedite it"
* tag 'x86_entry_for_v5.11_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry: Remove put_ret_addr_in_rdi THUNK macro argument
x86/entry: Emit a symbol for register restoring thunk
free_ldt_pgtables() uses the MMU gather API for batching TLB flushes
over the call to free_pgd_range(). However, tlb_gather_mmu() expects
to operate on user addresses and so passing LDT_{BASE,END}_ADDR will
confuse the range setting logic in __tlb_adjust_range(), causing the
gather to identify a range starting at TASK_SIZE. Such a large range
will be converted into a 'fullmm' flush by the low-level invalidation
code, so change the caller to invoke tlb_gather_mmu_fullmm() directly.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20210127235347.1402-7-will@kernel.org
If __tlb_remove_tlb_entry() is not defined by the architecture then
we provide an empty definition in asm-generic/tlb.h.
Remove the redundant empty definitions for sparc64 and x86.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20210127235347.1402-6-will@kernel.org
The 'start' and 'end' arguments to tlb_gather_mmu() are no longer
needed now that there is a separate function for 'fullmm' flushing.
Remove the unused arguments and update all callers.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/CAHk-=wjQWa14_4UpfDf=fiineNP+RH74kZeDMo_f1D35xNzq9w@mail.gmail.com
Since commit 7a30df49f6 ("mm: mmu_gather: remove __tlb_reset_range()
for force flush"), the 'start' and 'end' arguments to tlb_finish_mmu()
are no longer used, since we flush the whole mm in case of a nested
invalidation.
Remove the unused arguments and update all callers.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20210127235347.1402-3-will@kernel.org
With retpolines disabled, some configurations of GCC, and specifically
the GCC versions 9 and 10 in Ubuntu will add Intel CET instrumentation
to the kernel by default. That breaks certain tracing scenarios by
adding a superfluous ENDBR64 instruction before the fentry call, for
functions which can be called indirectly.
CET instrumentation isn't currently necessary in the kernel, as CET is
only supported in user space. Disable it unconditionally and move it
into the x86's Makefile as CET/CFI... enablement should be a per-arch
decision anyway.
[ bp: Massage and extend commit message. ]
Fixes: 29be86d7f9 ("kbuild: add -fcf-protection=none when using retpoline flags")
Reported-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Cc: <stable@vger.kernel.org>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Link: https://lkml.kernel.org/r/20210128215219.6kct3h2eiustncws@treble
Use sizeof() instead of a constant in fpstate_sanitize_xstate().
Remove use of the address of the 0th array element of ->st_space and
->xmm_space which is equivalent to the array address itself:
No code changed:
# arch/x86/kernel/fpu/xstate.o:
text data bss dec hex filename
9694 899 4 10597 2965 xstate.o.before
9694 899 4 10597 2965 xstate.o.after
md5:
5a43fc70bad8e2a1784f67f01b71aabb xstate.o.before.asm
5a43fc70bad8e2a1784f67f01b71aabb xstate.o.after.asm
[ bp: Massage commit message. ]
Signed-off-by: Yejune Deng <yejune.deng@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210122071925.41285-1-yejune.deng@gmail.com
The remaining callers of kernel_fpu_begin() in 64-bit kernels don't use 387
instructions, so there's no need to sanitize the FPU state. Skip it to get
most of the performance we lost back.
Reported-by: Krzysztof Olędzki <ole@ans.pl>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/57f8841ccbf9f3c25a23196c888f5f6ec5887577.1611205691.git.luto@kernel.org
EFI uses kernel_fpu_begin() to conform to the UEFI calling convention.
This specifically requires initializing FCW (FPU Control Word), whereas
no sane 64-bit kernel code should use legacy 387 operations that
reference FCW.
This should allow to safely change the default semantics of
kernel_fpu_begin() to stop initializing FCW on 64-bit kernels.
[ bp: Massage commit message a little. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/25d392fff64680e0f4bb8cf0b1003314dc29eafe.1611205691.git.luto@kernel.org
The "oprofile" user-space tools don't use the kernel OPROFILE support
any more, and haven't in a long time. User-space has been converted to
the perf interfaces.
Remove the old oprofile's architecture specific support.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Robert Richter <rric@kernel.org>
Acked-by: William Cohen <wcohen@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
drivers/net/can/dev.c
b552766c87 ("can: dev: prevent potential information leak in can_fill_info()")
3e77f70e73 ("can: dev: move driver related infrastructure into separate subdir")
0a042c6ec9 ("can: dev: move netlink related code into seperate file")
Code move.
drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
57ac4a31c4 ("net/mlx5e: Correctly handle changing the number of queues when the interface is down")
214baf2287 ("net/mlx5e: Support HTB offload")
Adjacent code changes
net/switchdev/switchdev.c
20776b465c ("net: switchdev: don't set port_obj_info->handled true when -EOPNOTSUPP")
ffb68fc58e ("net: switchdev: remove the transaction structure from port object notifiers")
bae33f2b5a ("net: switchdev: remove the transaction structure from port attributes")
Transaction parameter gets dropped otherwise keep the fix.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYBLX8QAKCRCAXGG7T9hj
vr2TAP4ylwxPVbf1l1V8zYCtCoNg087+Ubolr5kXXJkesG/nkgD6A2ix2oN1sC0Z
kbFBeZHqgP4AbVl7IhBALVFa1GPxWQg=
=NPGM
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.11-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- A fix for a regression introduced in 5.11 resulting in Xen dom0
having problems to correctly initialize Xenstore.
- A fix for avoiding WARN splats when booting as Xen dom0 with
CONFIG_AMD_MEM_ENCRYPT enabled due to a missing trap handler for the
#VC exception (even if the handler should never be called).
- A fix for the Xen bklfront driver adapting to the correct but
unexpected behavior of new qemu.
* tag 'for-linus-5.11-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: avoid warning in Xen pv guest with CONFIG_AMD_MEM_ENCRYPT enabled
xen: Fix XenStore initialisation for XS_LOCAL
xen-blkfront: allow discard-* nodes to be optional
Grab kvm->lock before pinning memory when registering an encrypted
region; sev_pin_memory() relies on kvm->lock being held to ensure
correctness when checking and updating the number of pinned pages.
Add a lockdep assertion to help prevent future regressions.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Fixes: 1e80fdc09d ("KVM: SVM: Pin guest memory when SEV is active")
Signed-off-by: Peter Gonda <pgonda@google.com>
V2
- Fix up patch description
- Correct file paths svm.c -> sev.c
- Add unlock of kvm->lock on sev_pin_memory error
V1
- https://lore.kernel.org/kvm/20210126185431.1824530-1-pgonda@google.com/
Message-Id: <20210127161524.2832400-1-pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Collect the scattered SME/SEV related feature flags into a dedicated
word. There are now five recognized features in CPUID.0x8000001F.EAX,
with at least one more on the horizon (SEV-SNP). Using a dedicated word
allows KVM to use its automagic CPUID adjustment logic when reporting
the set of supported features to userspace.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Brijesh Singh <brijesh.singh@amd.com>
Link: https://lkml.kernel.org/r/20210122204047.2860075-2-seanjc@google.com
This is similar to commit
b21ebf2fb4 ("x86: Treat R_X86_64_PLT32 as R_X86_64_PC32")
but for i386. As far as the kernel is concerned, R_386_PLT32 can be
treated the same as R_386_PC32.
R_386_PLT32/R_X86_64_PLT32 are PC-relative relocation types which
can only be used by branches. If the referenced symbol is defined
externally, a PLT will be used.
R_386_PC32/R_X86_64_PC32 are PC-relative relocation types which can be
used by address taking operations and branches. If the referenced symbol
is defined externally, a copy relocation/canonical PLT entry will be
created in the executable.
On x86-64, there is no PIC vs non-PIC PLT distinction and an
R_X86_64_PLT32 relocation is produced for both `call/jmp foo` and
`call/jmp foo@PLT` with newer (2018) GNU as/LLVM integrated assembler.
This avoids canonical PLT entries (st_shndx=0, st_value!=0).
On i386, there are 2 types of PLTs, PIC and non-PIC. Currently,
the GCC/GNU as convention is to use R_386_PC32 for non-PIC PLT and
R_386_PLT32 for PIC PLT. Copy relocations/canonical PLT entries
are possible ABI issues but GCC/GNU as will likely keep the status
quo because (1) the ABI is legacy (2) the change will drop a GNU
ld diagnostic for non-default visibility ifunc in shared objects.
clang-12 -fno-pic (since [1]) can emit R_386_PLT32 for compiler
generated function declarations, because preventing canonical PLT
entries is weighed over the rare ifunc diagnostic.
Further info for the more interested:
https://github.com/ClangBuiltLinux/linux/issues/1210https://sourceware.org/bugzilla/show_bug.cgi?id=27169a084c0388e [1]
[ bp: Massage commit message. ]
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Fangrui Song <maskray@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Link: https://lkml.kernel.org/r/20210127205600.1227437-1-maskray@google.com
Recent commit 255cbecfe0 modified struct kvm_vcpu_arch to make
'cpuid_entries' a pointer to an array of kvm_cpuid_entry2 entries
rather than embedding the array in the struct. KVM_SET_CPUID and
KVM_SET_CPUID2 were updated accordingly, but KVM_GET_CPUID2 was missed.
As a result, KVM_GET_CPUID2 currently returns random fields from struct
kvm_vcpu_arch to userspace rather than the expected CPUID values. Fix
this by treating 'cpuid_entries' as a pointer when copying its
contents to userspace buffer.
Fixes: 255cbecfe0 ("KVM: x86: allocate vcpu->arch.cpuid_entries dynamically")
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Michael Roth <michael.roth@amd.com.com>
Message-Id: <20210128024451.1816770-1-michael.roth@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit
a7e1f67ed2 ("x86/msr: Filter MSR writes")
introduced a module parameter to disable writing to the MSR device file
and tainted the kernel upon writing. As MSR registers can be written by
the X86_IOC_WRMSR_REGS ioctl too, the same filtering and tainting should
be applied to the ioctl as well.
[ bp: Massage commit message and space out statements. ]
Fixes: a7e1f67ed2 ("x86/msr: Filter MSR writes")
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210127122456.13939-1-misono.tomohiro@jp.fujitsu.com
Perfmon-v4 counter freezing is fundamentally broken; remove this default
disabled code to make sure nobody uses it.
The feature is called Freeze-on-PMI in the SDM, and if it would do that,
there wouldn't actually be a problem, *however* it does something subtly
different. It globally disables the whole PMU when it raises the PMI,
not when the PMI hits.
This means there's a window between the PMI getting raised and the PMI
actually getting served where we loose events and this violates the
perf counter independence. That is, a counting event should not result
in a different event count when there is a sampling event co-scheduled.
This is known to break existing software (RR).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Clean up that CONFIG_RETPOLINE crud and replace the
indirect call x86_pmu.guest_get_msrs with static_call().
Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210125121458.181635-1-like.xu@linux.intel.com
* Documentation fixes
* Avoid performance regression due to SEV-ES patches
ARM:
- Don't allow tagged pointers to point to memslots
- Filter out ARMv8.1+ PMU events on v8.0 hardware
- Hide PMU registers from userspace when no PMU is configured
- More PMU cleanups
- Don't try to handle broken PSCI firmware
- More sys_reg() to reg_to_encoding() conversions
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmAQSWEUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMgBQgAlKke5b6SljZs19uYbzwpOClI5V9M
x5Sl9doPI/8TgM1/Oe/T3THb4iulVAFOFulFt+gAGeEJPtQN42bQA0iSqqKOXsox
ordveCZ90hgo8SE+F3vcL4k7DAZw7eoHyQxJJ7L5UmMKosQy+ujouVYyxMeBsmvK
mZWtqPrAK+1Uot6YQu52SPCqqUuIAVd/loGR6s4TZgnaiOykIKWsLOBHLx0mfAI1
gCXdbJ6kMSH6+y+WCmETnqoo6n86a9bcbXEfs36vDpguolw+6I9pqfFUC01StjYa
bm+Ffxf9QYtuxXV51Z2VWHWtbCmYdSDxx4oeXKduLDiw65vWovcY+5KzbA==
=HJUK
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
- x86 bugfixes
- Documentation fixes
- Avoid performance regression due to SEV-ES patches
- ARM:
- Don't allow tagged pointers to point to memslots
- Filter out ARMv8.1+ PMU events on v8.0 hardware
- Hide PMU registers from userspace when no PMU is configured
- More PMU cleanups
- Don't try to handle broken PSCI firmware
- More sys_reg() to reg_to_encoding() conversions
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: allow KVM_REQ_GET_NESTED_STATE_PAGES outside guest mode for VMX
KVM: x86: Revert "KVM: x86: Mark GPRs dirty when written"
KVM: SVM: Unconditionally sync GPRs to GHCB on VMRUN of SEV-ES guest
KVM: nVMX: Sync unsync'd vmcs02 state to vmcs12 on migration
kvm: tracing: Fix unmatched kvm_entry and kvm_exit events
KVM: Documentation: Update description of KVM_{GET,CLEAR}_DIRTY_LOG
KVM: x86: get smi pending status correctly
KVM: x86/pmu: Fix HW_REF_CPU_CYCLES event pseudo-encoding in intel_arch_events[]
KVM: x86/pmu: Fix UBSAN shift-out-of-bounds warning in intel_pmu_refresh()
KVM: x86: Add more protection against undefined behavior in rsvd_bits()
KVM: Documentation: Fix spec for KVM_CAP_ENABLE_CAP_VM
KVM: Forbid the use of tagged userspace addresses for memslots
KVM: arm64: Filter out v8.1+ events on v8.0 HW
KVM: arm64: Compute TPIDR_EL2 ignoring MTE tag
KVM: arm64: Use the reg_to_encoding() macro instead of sys_reg()
KVM: arm64: Allow PSCI SYSTEM_OFF/RESET to return
KVM: arm64: Simplify handling of absent PMU system registers
KVM: arm64: Hide PMU registers from userspace when not available
The OBJECT_FILES_NON_STANDARD annotation is used to tell objtool to
ignore a file. File-level ignores won't work when validating vmlinux.o.
Instead, convert restore_image() and core_restore_code() to be ELF
functions. Their code is conventional enough for objtool to be able to
understand them.
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/974f8ceb5385e470f72e93974c70ab5c894bb0dc.1611263462.git.jpoimboe@redhat.com
Because restore_registers() is page-aligned, the assembler inexplicably
adds an unreachable jump from after the end of the previous function to
the beginning of restore_registers().
That confuses objtool, understandably. It also creates significant text
fragmentation. As a result, most of the object file is wasted text
(nops).
Move restore_registers() to the beginning of the file to both prevent
the text fragmentation and avoid the dead jump instruction.
$ size /tmp/hibernate_asm_64.before.o /tmp/hibernate_asm_64.after.o
text data bss dec hex filename
4415 0 0 4415 113f /tmp/hibernate_asm_64.before.o
524 0 0 524 20c /tmp/hibernate_asm_64.after.o
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/8c7f634201d26453d73fe55032cbbdc05d004387.1611263462.git.jpoimboe@redhat.com
The OBJECT_FILES_NON_STANDARD annotation is used to tell objtool to
ignore a file. File-level ignores won't work when validating vmlinux.o.
Instead, tell objtool to ignore do_suspend_lowlevel() directly with the
STACK_FRAME_NON_STANDARD annotation.
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/269eda576c53bc9ecc8167c211989111013a67aa.1611263462.git.jpoimboe@redhat.com
With objtool vmlinux.o validation of return_to_handler(), now that
objtool has visibility inside the retpoline, jumping from EMPTY state to
a proper function state results in a stack state mismatch.
return_to_handler() is actually quite normal despite the underlying
magic. Just annotate it as a normal function.
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/14f48e623f61dbdcd84cf27a56ed8ccae73199ef.1611263462.git.jpoimboe@redhat.com
The Xen hypercall page is filled with zeros, causing objtool to fall
through all the empty hypercall functions until it reaches a real
function, resulting in a stack state mismatch.
The build-time contents of the hypercall page don't matter because the
page gets rewritten by the hypervisor. Make it more palatable to
objtool by making each hypervisor function a true empty function, with
nops and a return.
Cc: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/0883bde1d7a1fb3b6a4c952bc0200e873752f609.1611263462.git.jpoimboe@redhat.com
The OBJECT_FILES_NON_STANDARD annotation is used to tell objtool to
ignore a file. File-level ignores won't work when validating vmlinux.o.
Tweak the ELF metadata and unwind hints to allow objtool to follow the
code.
Cc: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/8b042a09c69e8645f3b133ef6653ba28f896807d.1611263462.git.jpoimboe@redhat.com
The ORC metadata generated for UNWIND_HINT_FUNC isn't actually very
func-like. With certain usages it can cause stack state mismatches
because it doesn't set the return address (CFI_RA).
Also, users of UNWIND_HINT_RET_OFFSET no longer need to set a custom
return stack offset. Instead they just need to specify a func-like
situation, so the current ret_offset code is hacky for no good reason.
Solve both problems by simplifying the RET_OFFSET handling and
converting it into a more useful UNWIND_HINT_FUNC.
If we end up needing the old 'ret_offset' functionality again in the
future, we should be able to support it pretty easily with the addition
of a custom 'sp_offset' in UNWIND_HINT_FUNC.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/db9d1f5d79dddfbb3725ef6d8ec3477ad199948d.1611263462.git.jpoimboe@redhat.com
VMX also uses KVM_REQ_GET_NESTED_STATE_PAGES for the Hyper-V eVMCS,
which may need to be loaded outside guest mode. Therefore we cannot
WARN in that case.
However, that part of nested_get_vmcs12_pages is _not_ needed at
vmentry time. Split it out of KVM_REQ_GET_NESTED_STATE_PAGES handling,
so that both vmentry and migration (and in the latter case, independent
of is_guest_mode) do the parts that are needed.
Cc: <stable@vger.kernel.org> # 5.10.x: f2c7ef3ba: KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES
Cc: <stable@vger.kernel.org> # 5.10.x
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Revert the dirty/available tracking of GPRs now that KVM copies the GPRs
to the GHCB on any post-VMGEXIT VMRUN, even if a GPR is not dirty. Per
commit de3cd117ed ("KVM: x86: Omit caching logic for always-available
GPRs"), tracking for GPRs noticeably impacts KVM's code footprint.
This reverts commit 1c04d8c986.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210122235049.3107620-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop the per-GPR dirty checks when synchronizing GPRs to the GHCB, the
GRPs' dirty bits are set from time zero and never cleared, i.e. will
always be seen as dirty. The obvious alternative would be to clear
the dirty bits when appropriate, but removing the dirty checks is
desirable as it allows reverting GPR dirty+available tracking, which
adds overhead to all flavors of x86 VMs.
Note, unconditionally writing the GPRs in the GHCB is tacitly allowed
by the GHCB spec, which allows the hypervisor (or guest) to provide
unnecessary info; it's the guest's responsibility to consume only what
it needs (the hypervisor is untrusted after all).
The guest and hypervisor can supply additional state if desired but
must not rely on that additional state being provided.
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Fixes: 291bd20d5d ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210122235049.3107620-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Even when we are outside the nested guest, some vmcs02 fields
may not be in sync vs vmcs12. This is intentional, even across
nested VM-exit, because the sync can be delayed until the nested
hypervisor performs a VMCLEAR or a VMREAD/VMWRITE that affects those
rarely accessed fields.
However, during KVM_GET_NESTED_STATE, the vmcs12 has to be up to date to
be able to restore it. To fix that, call copy_vmcs02_to_vmcs12_rare()
before the vmcs12 contents are copied to userspace.
Fixes: 7952d769c2 ("KVM: nVMX: Sync rarely accessed guest fields only when needed")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210114205449.8715-2-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On VMX, if we exit and then re-enter immediately without leaving
the vmx_vcpu_run() function, the kvm_entry event is not logged.
That means we will see one (or more) kvm_exit, without its (their)
corresponding kvm_entry, as shown here:
CPU-1979 [002] 89.871187: kvm_entry: vcpu 1
CPU-1979 [002] 89.871218: kvm_exit: reason MSR_WRITE
CPU-1979 [002] 89.871259: kvm_exit: reason MSR_WRITE
It also seems possible for a kvm_entry event to be logged, but then
we leave vmx_vcpu_run() right away (if vmx->emulation_required is
true). In this case, we will have a spurious kvm_entry event in the
trace.
Fix these situations by moving trace_kvm_entry() inside vmx_vcpu_run()
(where trace_kvm_exit() already is).
A trace obtained with this patch applied looks like this:
CPU-14295 [000] 8388.395387: kvm_entry: vcpu 0
CPU-14295 [000] 8388.395392: kvm_exit: reason MSR_WRITE
CPU-14295 [000] 8388.395393: kvm_entry: vcpu 0
CPU-14295 [000] 8388.395503: kvm_exit: reason EXTERNAL_INTERRUPT
Of course, not calling trace_kvm_entry() in common x86 code any
longer means that we need to adjust the SVM side of things too.
Signed-off-by: Lorenzo Brescia <lorenzo.brescia@edu.unito.it>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Message-Id: <160873470698.11652.13483635328769030605.stgit@Wayrath>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The injection process of smi has two steps:
Qemu KVM
Step1:
cpu->interrupt_request &= \
~CPU_INTERRUPT_SMI;
kvm_vcpu_ioctl(cpu, KVM_SMI)
call kvm_vcpu_ioctl_smi() and
kvm_make_request(KVM_REQ_SMI, vcpu);
Step2:
kvm_vcpu_ioctl(cpu, KVM_RUN, 0)
call process_smi() if
kvm_check_request(KVM_REQ_SMI, vcpu) is
true, mark vcpu->arch.smi_pending = true;
The vcpu->arch.smi_pending will be set true in step2, unfortunately if
vcpu paused between step1 and step2, the kvm_run->immediate_exit will be
set and vcpu has to exit to Qemu immediately during step2 before mark
vcpu->arch.smi_pending true.
During VM migration, Qemu will get the smi pending status from KVM using
KVM_GET_VCPU_EVENTS ioctl at the downtime, then the smi pending status
will be lost.
Signed-off-by: Jay Zhou <jianjay.zhou@huawei.com>
Signed-off-by: Shengen Zhuang <zhuangshengen@huawei.com>
Message-Id: <20210118084720.1585-1-jianjay.zhou@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The HW_REF_CPU_CYCLES event on the fixed counter 2 is pseudo-encoded as
0x0300 in the intel_perfmon_event_map[]. Correct its usage.
Fixes: 62079d8a43 ("KVM: PMU: add proper support for fixed counter 2")
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20201230081916.63417-1-like.xu@linux.intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since we know vPMU will not work properly when (1) the guest bit_width(s)
of the [gp|fixed] counters are greater than the host ones, or (2) guest
requested architectural events exceeds the range supported by the host, so
we can setup a smaller left shift value and refresh the guest cpuid entry,
thus fixing the following UBSAN shift-out-of-bounds warning:
shift exponent 197 is too large for 64-bit type 'long long unsigned int'
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x107/0x163 lib/dump_stack.c:120
ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
__ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395
intel_pmu_refresh.cold+0x75/0x99 arch/x86/kvm/vmx/pmu_intel.c:348
kvm_vcpu_after_set_cpuid+0x65a/0xf80 arch/x86/kvm/cpuid.c:177
kvm_vcpu_ioctl_set_cpuid2+0x160/0x440 arch/x86/kvm/cpuid.c:308
kvm_arch_vcpu_ioctl+0x11b6/0x2d70 arch/x86/kvm/x86.c:4709
kvm_vcpu_ioctl+0x7b9/0xdb0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3386
vfs_ioctl fs/ioctl.c:48 [inline]
__do_sys_ioctl fs/ioctl.c:753 [inline]
__se_sys_ioctl fs/ioctl.c:739 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reported-by: syzbot+ae488dc136a4cc6ba32b@syzkaller.appspotmail.com
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20210118025800.34620-1-like.xu@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add compile-time asserts in rsvd_bits() to guard against KVM passing in
garbage hardcoded values, and cap the upper bound at '63' for dynamic
values to prevent generating a mask that would overflow a u64.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210113204515.3473079-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The ACPI-enabled Intel MID platforms neither have WDAT table nor proper IDs
to instantiate watchdog device. In order to keep them working move the board
code from arch/x86 to drivers/platform/x86.
Note, the complete SFI support is going to be removed, that's why PDx86
has been chosen as a new home for it. This is the only device which needs
additional code so far.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Merge misc fixes from Andrew Morton:
"18 patches.
Subsystems affected by this patch series: mm (pagealloc, memcg, kasan,
memory-failure, and highmem), ubsan, proc, and MAINTAINERS"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
MAINTAINERS: add a couple more files to the Clang/LLVM section
proc_sysctl: fix oops caused by incorrect command parameters
powerpc/mm/highmem: use __set_pte_at() for kmap_local()
mips/mm/highmem: use set_pte() for kmap_local()
mm/highmem: prepare for overriding set_pte_at()
sparc/mm/highmem: flush cache and TLB
mm: fix page reference leak in soft_offline_page()
ubsan: disable unsigned-overflow check for i386
kasan, mm: fix resetting page_alloc tags for HW_TAGS
kasan, mm: fix conflicts with init_on_alloc/free
kasan: fix HW_TAGS boot parameters
kasan: fix incorrect arguments passing in kasan_add_zero_shadow
kasan: fix unaligned address is unhandled in kasan_remove_zero_shadow
mm: fix numa stats for thp migration
mm: memcg: fix memcg file_dirty numa stat
mm: memcg/slab: optimize objcg stock draining
mm: fix initialization of struct page for holes in memory layout
x86/setup: don't remove E820_TYPE_RAM for pfn 0
single CPU vs such which are affine to only one CPU, mark per-cpu workqueue
threads as such and make sure that marking "survives" CPU hotplug. Fix CPU
hotplug issues with such kthreads.
- A fix to not push away tasks on CPUs coming online.
- Have workqueue CPU hotplug code use cpu_possible_mask when breaking affinity
on CPU offlining so that pending workers can finish on newly arrived onlined
CPUs too.
- Dump tasks which haven't vacated a CPU which is currently being unplugged.
- Register a special scale invariance callback which gets called on resume
from RAM to read out APERF/MPERF after resume and thus make the schedutil
scaling governor more precise.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmANYCAACgkQEsHwGGHe
VUo+OBAAjfqkijDlXiGX6lrT5gRx5NZICpeMgbWa7J13XHT1ysD/b0fMGFIUyF6k
aszDLTl8U/S1/qGAYlzTSPAFcdZ+ENiFqQ48ozMk4jZC3p0quHTjs/PdiSG6kYBi
+e4smht+bSyLKxsG8hN0kJ+mLEd+uIQ13kP4YkxPgWbJ9WNP/U6HHGBo0rBchtSe
Kn6bdd8CfwmC6rSazp7kdQoFoWeQaoMI1ODX3VphK1GtL1wq8WSICzRhpg3caeyG
3lCIddoNW9mCA9Nkc6R6HeV3uW9JGkPAjnmtTIEHDbg9pib7xNT978ieTQuqNDCi
DlAHDGumzoaiVJZhD/1fj/RXMJr2YUHxtrXWNsXpiKJ9g8Tn+WC0UW/4+Mx2L/km
0RSoXJlMs1fGopS2I/fObZ6RPhmg4D+gJsMCdaHQzX4NgxZAGhNNPxMckZ0IM8A0
2NNXSHUZHVTHeJEW0E/glOcpWb5hG+vDwiBMNEWfTwYpTfrw2EEOZaKniZE7WlSL
4ItM9rkLGl1KToJzAH4A0oUtSy3vtSCo8B1noGlc09Lj+oCIBlr81z9+C79a2oxG
qE7Xd4X7y7Qs3JeCbRZWQa7/2Kf1v4XnjELrJJeCZC85r0ZqJDwRX8w7lkmW2XPU
m4J2prr/DDZSqrRh23/xC1fsU+vcBKSfKUFKAH4Lg2VIaUfSUEk=
=2DAF
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Correct the marking of kthreads which are supposed to run on a
specific, single CPU vs such which are affine to only one CPU, mark
per-cpu workqueue threads as such and make sure that marking
"survives" CPU hotplug. Fix CPU hotplug issues with such kthreads.
- A fix to not push away tasks on CPUs coming online.
- Have workqueue CPU hotplug code use cpu_possible_mask when breaking
affinity on CPU offlining so that pending workers can finish on newly
arrived onlined CPUs too.
- Dump tasks which haven't vacated a CPU which is currently being
unplugged.
- Register a special scale invariance callback which gets called on
resume from RAM to read out APERF/MPERF after resume and thus make
the schedutil scaling governor more precise.
* tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Relax the set_cpus_allowed_ptr() semantics
sched: Fix CPU hotplug / tighten is_per_cpu_kthread()
sched: Prepare to use balance_push in ttwu()
workqueue: Restrict affinity change to rescuer
workqueue: Tag bound workers with KTHREAD_IS_PER_CPU
kthread: Extract KTHREAD_IS_PER_CPU
sched: Don't run cpu-online with balance_push() enabled
workqueue: Use cpu_possible_mask instead of cpu_active_mask to break affinity
sched/core: Print out straggler tasks in sched_cpu_dying()
x86: PM: Register syscore_ops for scale invariance
- Differentiate which aspects of the FPU state get saved/restored when the FPU
is used in-kernel and fix a boot crash on K7 due to early MXCSR access before
CR4.OSFXSR is even set.
- A couple of noinstr annotation fixes
- Correct die ID setting on AMD for users of topology information which need
the correct die ID
- A SEV-ES fix to handle string port IO to/from kernel memory properly
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmANUr0ACgkQEsHwGGHe
VUos4hAAlBik/z+y+DaZGJyxtpST2YQaEbwbW3UMqyLsdVnLTTRnKzC1T+fEfD2Q
SxtCPYH5iuPbCgOOoQboWt6Aa53JlX9bRBZ/87Ub/ELJ9NgMxMQFXAiaDZAAY6Zy
L2B13KpoGOifPjrGDgksnafyqYv1CYesiArfOffHgvC3/0j7ONdda2SRDQ697TBw
FSV/WfUjCo0+JdXRRaP6YH5t9MxFerHxVH38xTDFwXikS9CVyddosLo5EP2wAQvi
5+160i2jB25vyMEsFBr5wE0xDpWLUdClVpzHXXPG2i0P+NHATiBcreTMPzeYOUXu
Hfc/y4ukOVDoMGlHLNKHq89alI87soMJIEjm2sAG1ZIypKyMJw7YUXQNRR3TcP0U
c7/C3W1mCWD1+8nLtlIMM0Z20DacQOf9YWko95+uh08+S52KpTOgnx+mpoZjK1PQ
Wv9HxPJKycrgRNhfverN5FSiOEW/DdvqNfVHTjuuzNLyKdM1NoZ/YTIyABk4RfFq
USUnC5rk4GqvCYdaLTEKkAJvLCmRKgVYd75Rc4/pPKILS6kv82vpj3BjClBaH0h1
yrvpafvXzOhwKP/J5q0vm57NJdqPZwuW4Ah+74tptmQL4rga84U4FOs3JpNJq0uu
1mj6xSFD8ZyI11BSkYbZAHTy1eNERze+azftCSPq/6EifYvqnsE=
=3rZM
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Add a new Intel model number for Alder Lake
- Differentiate which aspects of the FPU state get saved/restored when
the FPU is used in-kernel and fix a boot crash on K7 due to early
MXCSR access before CR4.OSFXSR is even set.
- A couple of noinstr annotation fixes
- Correct die ID setting on AMD for users of topology information which
need the correct die ID
- A SEV-ES fix to handle string port IO to/from kernel memory properly
* tag 'x86_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Add another Alder Lake CPU to the Intel family
x86/mmx: Use KFPU_387 for MMX string operations
x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
x86/topology: Make __max_die_per_package available unconditionally
x86: __always_inline __{rd,wr}msr()
x86/mce: Remove explicit/superfluous tracing
locking/lockdep: Avoid noinstr warning for DEBUG_LOCKDEP
locking/lockdep: Cure noinstr fail
x86/sev: Fix nonistr violation
x86/entry: Fix noinstr fail
x86/cpu/amd: Set __max_die_per_package on AMD
x86/sev-es: Handle string port IO to kernel memory properly
Patch series "mm: fix initialization of struct page for holes in memory layout", v3.
Commit 73a6e474cb ("mm: memmap_init: iterate over memblock regions
rather that check each PFN") exposed several issues with the memory map
initialization and these patches fix those issues.
Initially there were crashes during compaction that Qian Cai reported
back in April [1]. It seemed back then that the problem was fixed, but
a few weeks ago Andrea Arcangeli hit the same bug [2] and there was an
additional discussion at [3].
[1] https://lore.kernel.org/lkml/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw
[2] https://lore.kernel.org/lkml/20201121194506.13464-1-aarcange@redhat.com
[3] https://lore.kernel.org/mm-commits/20201206005401.qKuAVgOXr%akpm@linux-foundation.org
This patch (of 2):
The first 4Kb of memory is a BIOS owned area and to avoid its allocation
for the kernel it was not listed in e820 tables as memory. As the result,
pfn 0 was never recognised by the generic memory management and it is not
a part of neither node 0 nor ZONE_DMA.
If set_pfnblock_flags_mask() would be ever called for the pageblock
corresponding to the first 2Mbytes of memory, having pfn 0 outside of
ZONE_DMA would trigger
VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
Along with reserving the first 4Kb in e820 tables, several first pages are
reserved with memblock in several places during setup_arch(). These
reservations are enough to ensure the kernel does not touch the BIOS area
and it is not necessary to remove E820_TYPE_RAM for pfn 0.
Remove the update of e820 table that changes the type of pfn 0 and move
the comment describing why it was done to trim_low_memory_range() that
reserves the beginning of the memory.
Link: https://lkml.kernel.org/r/20210111194017.22696-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This implements the missing mount_setattr() syscall. While the new mount
api allows to change the properties of a superblock there is currently
no way to change the properties of a mount or a mount tree using file
descriptors which the new mount api is based on. In addition the old
mount api has the restriction that mount options cannot be applied
recursively. This hasn't changed since changing mount options on a
per-mount basis was implemented in [1] and has been a frequent request
not just for convenience but also for security reasons. The legacy
mount syscall is unable to accommodate this behavior without introducing
a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND |
MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost
mount. Changing MS_REC to apply to the whole mount tree would mean
introducing a significant uapi change and would likely cause significant
regressions.
The new mount_setattr() syscall allows to recursively clear and set
mount options in one shot. Multiple calls to change mount options
requesting the same changes are idempotent:
int mount_setattr(int dfd, const char *path, unsigned flags,
struct mount_attr *uattr, size_t usize);
Flags to modify path resolution behavior are specified in the @flags
argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
restrict path resolution as introduced with openat2() might be supported
in the future.
The mount_setattr() syscall can be expected to grow over time and is
designed with extensibility in mind. It follows the extensible syscall
pattern we have used with other syscalls such as openat2(), clone3(),
sched_{set,get}attr(), and others.
The set of mount options is passed in the uapi struct mount_attr which
currently has the following layout:
struct mount_attr {
__u64 attr_set;
__u64 attr_clr;
__u64 propagation;
__u64 userns_fd;
};
The @attr_set and @attr_clr members are used to clear and set mount
options. This way a user can e.g. request that a set of flags is to be
raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
@attr_set while at the same time requesting that another set of flags is
to be lowered such as removing noexec from a mount tree by specifying
MOUNT_ATTR_NOEXEC in @attr_clr.
Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0,
not a bitmap, users wanting to transition to a different atime setting
cannot simply specify the atime setting in @attr_set, but must also
specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that
MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set
can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in
@attr_clr.
The @propagation field lets callers specify the propagation type of a
mount tree. Propagation is a single property that has four different
settings and as such is not really a flag argument but an enum.
Specifically, it would be unclear what setting and clearing propagation
settings in combination would amount to. The legacy mount() syscall thus
forbids the combination of multiple propagation settings too. The goal
is to keep the semantics of mount propagation somewhat simple as they
are overly complex as it is.
The @userns_fd field lets user specify a user namespace whose idmapping
becomes the idmapping of the mount. This is implemented and explained in
detail in the next patch.
[1]: commit 2e4b7fcd92 ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-api@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Taking ownership of the FPU in kernel mode disables preemption, and
this may result in excessive scheduling blackouts if the size of the
data being processed on the FPU is unbounded.
Given that taking and releasing the FPU is cheap these days on x86, we
can limit the impact of this issue easily for skcipher implementations,
by moving the FPU begin/end calls inside the skcipher walk processing
loop. Considering that skcipher walks operate on at most one page at a
time, doing so fully mitigates this issue.
This also permits the skcipher walk logic to use non-atomic kmalloc()
calls etc so we can change the 'atomic' bool argument in the calls to
skcipher_walk_virt() to false as well.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Indirect calls are very expensive on x86, so use a static call to set
the system-wide AES-NI/CTR asm helper.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add Alder Lake mobile CPU model number to Intel family.
Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210121215004.11618-1-tony.luck@intel.com
The implementation was rather buggy. It unconditionally marked PTEs
read-only, even for VM_SHARED mappings. I'm not sure whether this is
actually a problem, but it certainly seems unwise. More importantly, it
released the mmap lock before flushing the TLB, which could allow a racing
CoW operation to falsely believe that the underlying memory was not
writable.
I can't find any users at all of this mechanism, so just remove it.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Stas Sergeev <stsp2@yandex.ru>
Link: https://lkml.kernel.org/r/f3086de0babcab36f69949b5780bde851f719bc8.1611078018.git.luto@kernel.org
device_initcall() expects a function of type initcall_t, which returns
an integer. Change the signature of sgx_init() to match.
Fixes: e7e0545299 ("x86/sgx: Initialize metadata for Enclave Page Cache (EPC) sections")
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/20210113232311.277302-1-samitolvanen@google.com
The default kernel_fpu_begin() doesn't work on systems that support XMM but
haven't yet enabled CR4.OSFXSR. This causes crashes when _mmx_memcpy() is
called too early because LDMXCSR generates #UD when the aforementioned bit
is clear.
Fix it by using kernel_fpu_begin_mask(KFPU_387) explicitly.
Fixes: 7ad816762f ("x86/fpu: Reset MXCSR to default in kernel_fpu_begin()")
Reported-by: Krzysztof Mazur <krzysiek@podlesie.net>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Krzysztof Piotr Olędzki <ole@ans.pl>
Tested-by: Krzysztof Mazur <krzysiek@podlesie.net>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/e7bf21855fe99e5f3baa27446e32623358f69e8d.1611205691.git.luto@kernel.org
Currently, requesting kernel FPU access doesn't distinguish which parts of
the extended ("FPU") state are needed. This is nice for simplicity, but
there are a few cases in which it's suboptimal:
- The vast majority of in-kernel FPU users want XMM/YMM/ZMM state but do
not use legacy 387 state. These users want MXCSR initialized but don't
care about the FPU control word. Skipping FNINIT would save time.
(Empirically, FNINIT is several times slower than LDMXCSR.)
- Code that wants MMX doesn't want or need MXCSR initialized.
_mmx_memcpy(), for example, can run before CR4.OSFXSR gets set, and
initializing MXCSR will fail because LDMXCSR generates an #UD when the
aforementioned CR4 bit is not set.
- Any future in-kernel users of XFD (eXtended Feature Disable)-capable
dynamic states will need special handling.
Add a more specific API that allows callers to specify exactly what they
want.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Krzysztof Piotr Olędzki <ole@ans.pl>
Link: https://lkml.kernel.org/r/aff1cac8b8fc7ee900cf73e8f2369966621b053f.1611205691.git.luto@kernel.org
The x64 bpf jit expects bpf images converge within the given passes, but
it could fail to do so with some corner cases. For example:
l0: ja 40
l1: ja 40
[... repeated ja 40 ]
l39: ja 40
l40: ret #0
This bpf program contains 40 "ja 40" instructions which are effectively
NOPs and designed to be replaced with valid code dynamically. Ideally,
bpf jit should optimize those "ja 40" instructions out when translating
the bpf instructions into x64 machine code. However, do_jit() can only
remove one "ja 40" for offset==0 on each pass, so it requires at least
40 runs to eliminate those JMPs and exceeds the current limit of
passes(20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON
is set even though it's legit as a classic socket filter.
To make bpf images more likely converge within 20 passes, this commit
pads some instructions with NOPs in the last 5 passes:
1. conditional jumps
A possible size variance comes from the adoption of imm8 JMP. If the
offset is imm8, we calculate the size difference of this BPF instruction
between the previous and the current pass and fill the gap with NOPs.
To avoid the recalculation of jump offset, those NOPs are inserted before
the JMP code, so we have to subtract the 2 bytes of imm8 JMP when
calculating the NOP number.
2. BPF_JA
There are two conditions for BPF_JA.
a.) nop jumps
If this instruction is not optimized out in the previous pass,
instead of removing it, we insert the equivalent size of NOPs.
b.) label jumps
Similar to condition jumps, we prepend NOPs right before the JMP
code.
To make the code concise, emit_nops() is modified to use the signed len and
return the number of inserted NOPs.
For bpf-to-bpf, we always enable padding for the extra pass since there
is only one extra run and the jump padding doesn't affected the images
that converge without padding.
After applying this patch, the corner case was loaded with the following
jit code:
flen=45 proglen=77 pass=17 image=ffffffffc03367d4 from=jump pid=10097
JIT code: 00000000: 0f 1f 44 00 00 55 48 89 e5 53 41 55 31 c0 45 31
JIT code: 00000010: ed 48 89 fb eb 30 eb 2e eb 2c eb 2a eb 28 eb 26
JIT code: 00000020: eb 24 eb 22 eb 20 eb 1e eb 1c eb 1a eb 18 eb 16
JIT code: 00000030: eb 14 eb 12 eb 10 eb 0e eb 0c eb 0a eb 08 eb 06
JIT code: 00000040: eb 04 eb 02 66 90 31 c0 41 5d 5b c9 c3
0: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
5: 55 push rbp
6: 48 89 e5 mov rbp,rsp
9: 53 push rbx
a: 41 55 push r13
c: 31 c0 xor eax,eax
e: 45 31 ed xor r13d,r13d
11: 48 89 fb mov rbx,rdi
14: eb 30 jmp 0x46
16: eb 2e jmp 0x46
...
3e: eb 06 jmp 0x46
40: eb 04 jmp 0x46
42: eb 02 jmp 0x46
44: 66 90 xchg ax,ax
46: 31 c0 xor eax,eax
48: 41 5d pop r13
4a: 5b pop rbx
4b: c9 leave
4c: c3 ret
At the 16th pass, 15 jumps were already optimized out, and one jump was
replaced with NOPs at 44 and the image converged at the 17th pass.
v4:
- Add the detailed comments about the possible padding bytes
v3:
- Copy the instructions of prologue separately or the size calculation
of the first BPF instruction would include the prologue.
- Replace WARN_ONCE() with pr_err() and EFAULT
- Use MAX_PASSES in the for loop condition check
- Remove the "padded" flag from x64_jit_data. For the extra pass of
subprogs, padding is always enabled since it won't hurt the images
that converge without padding.
v2:
- Simplify the sample code in the description and provide the jit code
- Check the expected padding bytes with WARN_ONCE
- Move the 'padded' flag to 'struct x64_jit_data'
Signed-off-by: Gary Lin <glin@suse.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210119102501.511-2-glin@suse.com
Conflicts:
drivers/net/can/dev.c
commit 03f16c5075 ("can: dev: can_restart: fix use after free bug")
commit 3e77f70e73 ("can: dev: move driver related infrastructure into separate subdir")
Code move.
drivers/net/dsa/b53/b53_common.c
commit 8e4052c32d ("net: dsa: b53: fix an off by one in checking "vlan->vid"")
commit b7a9e0da2d ("net: switchdev: remove vid_begin -> vid_end range from VLAN objects")
Field rename.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYAhFXwAKCRCAXGG7T9hj
vjCeAQCR74dTRa34VJNDWdDME1NMSlFln8bIL+lvrC3/2p2E5QD/eLcgVFdDpeZ9
dMjKzcddPL1mI99SurwNmyai4U9d6A4=
=PbJO
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.11-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fix from Juergen Gross:
"A fix for build failure showing up in some configurations"
* tag 'for-linus-5.11-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: fix 'nopvspin' build error
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmAGxt0THHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXjMhCACtIFiJERxNFqKLVn1s+2IyA3+Lv0UI
Suyl+tUTgiN3WzPNAbYfd6cYKk/JvAF0KvCtvuFfRS+oRJrT8+ONBBRxSzpDixGO
aZfvzObOstWUQ0XsP2Q+S4YZlsLHtG76sUSsWCB8wdsnG0hujbrQ43M1Vj15EQYj
EKFmsJfxeMSjekaI7QRkU7LR9gZgacCWthCbn4KFgtXCREA1sI5OUnW9NAei1MSG
M/adfAP4xqoBpNjXq+6M2Ha4H5SmtZRlOQ4bk6ZgebhG0T9LT7UBPqaHHed8CcOj
PzreKfYYYGExhJwLKuCji7r6kSGZcZ6kiIMdFm+vliSK/QJeFCFclhdX
=c3XM
-----END PGP SIGNATURE-----
Merge tag 'hyperv-fixes-signed-20210119' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fix from Wei Liu:
"One patch from Dexuan to fix clockevent initialization"
* tag 'hyperv-fixes-signed-20210119' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
x86/hyperv: Initialize clockevents after LAPIC is initialized
EFI on x86_64 keeps track of the process's MM pointer by storing it
in a global struct called 'efi_scratch', which also used to contain
the mixed mode stack pointer. Let's clean this up a little bit, by
getting rid of the struct, and pushing the mm handling into the
callees entirely.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
As a first step to removing the awkward 'struct efi_scratch' definition
that conveniently combines the storage of the mixed mode stack pointer
with the MM pointer variable that records the task's MM pointer while it
is being replaced with the EFI MM one, move the mixed mode stack pointer
into a separate variable.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
On x86 scale invariace tends to be disabled during resume from
suspend-to-RAM, because the MPERF or APERF MSR values are not as
expected then due to updates taking place after the platform
firmware has been invoked to complete the suspend transition.
That, of course, is not desirable, especially if the schedutil
scaling governor is in use, because the lack of scale invariance
causes it to be less reliable.
To counter that effect, modify init_freq_invariance() to register
a syscore_ops object for scale invariance with the ->resume callback
pointing to init_counter_refs() which will run on the CPU starting
the resume transition (the other CPUs will be taken care of the
"online" operations taking place later).
Fixes: e2b0d619b4 ("x86, sched: check for counters overflow in frequency invariant accounting")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Link: https://lkml.kernel.org/r/1803209.Mvru99baaF@kreacher
Fix build error in x86/xen/ when PARAVIRT_SPINLOCKS is not enabled.
Fixes this build error:
../arch/x86/xen/smp_hvm.c: In function ‘xen_hvm_smp_init’:
../arch/x86/xen/smp_hvm.c:77:3: error: ‘nopvspin’ undeclared (first use in this function)
nopvspin = true;
Fixes: 3d7746bea9 ("x86/xen: Fix xen_hvm_smp_init() when vector callback not available")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20210115191123.27572-1-rdunlap@infradead.org
Signed-off-by: Juergen Gross <jgross@suse.com>
With commit 4df4cb9e99, the Hyper-V direct-mode STIMER is actually
initialized before LAPIC is initialized: see
apic_intr_mode_init()
x86_platform.apic_post_init()
hyperv_init()
hv_stimer_alloc()
apic_bsp_setup()
setup_local_APIC()
setup_local_APIC() temporarily disables LAPIC, initializes it and
re-eanble it. The direct-mode STIMER depends on LAPIC, and when it's
registered, it can be programmed immediately and the timer can fire
very soon:
hv_stimer_init
clockevents_config_and_register
clockevents_register_device
tick_check_new_device
tick_setup_device
tick_setup_periodic(), tick_setup_oneshot()
clockevents_program_event
When the timer fires in the hypervisor, if the LAPIC is in the
disabled state, new versions of Hyper-V ignore the event and don't inject
the timer interrupt into the VM, and hence the VM hangs when it boots.
Note: when the VM starts/reboots, the LAPIC is pre-enabled by the
firmware, so the window of LAPIC being temporarily disabled is pretty
small, and the issue can only happen once out of 100~200 reboots for
a 40-vCPU VM on one dev host, and on another host the issue doesn't
reproduce after 2000 reboots.
The issue is more noticeable for kdump/kexec, because the LAPIC is
disabled by the first kernel, and stays disabled until the kdump/kexec
kernel enables it. This is especially an issue to a Generation-2 VM
(for which Hyper-V doesn't emulate the PIT timer) when CONFIG_HZ=1000
(rather than CONFIG_HZ=250) is used.
Fix the issue by moving hv_stimer_alloc() to a later place where the
LAPIC timer is initialized.
Fixes: 4df4cb9e99 ("x86/hyperv: Initialize clockevents earlier in CPU onlining")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20210116223136.13892-1-decui@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2021-01-16
1) Extend atomic operations to the BPF instruction set along with x86-64 JIT support,
that is, atomic{,64}_{xchg,cmpxchg,fetch_{add,and,or,xor}}, from Brendan Jackman.
2) Add support for using kernel module global variables (__ksym externs in BPF
programs) retrieved via module's BTF, from Andrii Nakryiko.
3) Generalize BPF stackmap's buildid retrieval and add support to have buildid
stored in mmap2 event for perf, from Jiri Olsa.
4) Various fixes for cross-building BPF sefltests out-of-tree which then will
unblock wider automated testing on ARM hardware, from Jean-Philippe Brucker.
5) Allow to retrieve SOL_SOCKET opts from sock_addr progs, from Daniel Borkmann.
6) Clean up driver's XDP buffer init and split into two helpers to init per-
descriptor and non-changing fields during processing, from Lorenzo Bianconi.
7) Minor misc improvements to libbpf & bpftool, from Ian Rogers.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (41 commits)
perf: Add build id data in mmap2 event
bpf: Add size arg to build_id_parse function
bpf: Move stack_map_get_build_id into lib
bpf: Document new atomic instructions
bpf: Add tests for new BPF atomic operations
bpf: Add bitwise atomic instructions
bpf: Pull out a macro for interpreting atomic ALU operations
bpf: Add instructions for atomic_[cmp]xchg
bpf: Add BPF_FETCH field / create atomic_fetch_add instruction
bpf: Move BPF_STX reserved field check into BPF_STX verifier code
bpf: Rename BPF_XADD and prepare to encode other atomics in .imm
bpf: x86: Factor out a lookup table for some ALU opcodes
bpf: x86: Factor out emission of REX byte
bpf: x86: Factor out emission of ModR/M for *(reg + off)
tools/bpftool: Add -Wall when building BPF programs
bpf, libbpf: Avoid unused function warning on bpf_tail_call_static
selftests/bpf: Install btf_dump test cases
selftests/bpf: Fix installation of urandom_read
selftests/bpf: Move generated test files to $(TEST_GEN_FILES)
selftests/bpf: Fix out-of-tree build
...
====================
Link: https://lore.kernel.org/r/20210116012922.17823-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYAGllQAKCRCAXGG7T9hj
vqEtAP9uws/W/JPcnsohK76hMcFAVxZCVdX7C3HvfW5tp6hqMgEAg9ic8sYiuHhn
6FouRu/ZXHJEg3PpS5W66yKNIYPvGgw=
=rR+L
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.11-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- A series to fix a regression when running as a fully virtualized
guest on an old Xen hypervisor not supporting PV interrupt callbacks
for HVM guests.
- A patch to add support to query Xen resource sizes (setting was
possible already) from user mode.
* tag 'for-linus-5.11-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: Fix xen_hvm_smp_init() when vector callback not available
x86/xen: Don't register Xen IPIs when they aren't going to be used
x86/xen: Add xen_no_vector_callback option to test PCI INTX delivery
xen: Set platform PCI device INTX affinity to CPU0
xen: Fix event channel callback via INTX/GSI
xen/privcmd: allow fetching resource sizes
Look up the LED from a GPIO machine descriptor table. The Geode LEDs
should be on the CS5535 companion chip.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Link: https://lkml.kernel.org/r/20201124085339.6181-3-linus.walleij@linaro.org
Look up the LED from a GPIO machine descriptor table. The Geode LEDs
should be on the CS5535 companion chip.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Link: https://lkml.kernel.org/r/20201124085339.6181-2-linus.walleij@linaro.org
Look up the LED from a GPIO machine descriptor table. The Geode LEDs
should be on the CS5535 companion chip.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Link: https://lkml.kernel.org/r/20201124085339.6181-1-linus.walleij@linaro.org
Defining DEBUG should only be done in development. So remove it.
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20210114212827.47584-1-trix@redhat.com
This adds instructions for
atomic[64]_[fetch_]and
atomic[64]_[fetch_]or
atomic[64]_[fetch_]xor
All these operations are isomorphic enough to implement with the same
verifier, interpreter, and x86 JIT code, hence being a single commit.
The main interesting thing here is that x86 doesn't directly support
the fetch_ version these operations, so we need to generate a CMPXCHG
loop in the JIT. This requires the use of two temporary registers,
IIUC it's safe to use BPF_REG_AX and x86's AUX_REG for this purpose.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-10-jackmanb@google.com
This adds two atomic opcodes, both of which include the BPF_FETCH
flag. XCHG without the BPF_FETCH flag would naturally encode
atomic_set. This is not supported because it would be of limited
value to userspace (it doesn't imply any barriers). CMPXCHG without
BPF_FETCH woulud be an atomic compare-and-write. We don't have such
an operation in the kernel so it isn't provided to BPF either.
There are two significant design decisions made for the CMPXCHG
instruction:
- To solve the issue that this operation fundamentally has 3
operands, but we only have two register fields. Therefore the
operand we compare against (the kernel's API calls it 'old') is
hard-coded to be R0. x86 has similar design (and A64 doesn't
have this problem).
A potential alternative might be to encode the other operand's
register number in the immediate field.
- The kernel's atomic_cmpxchg returns the old value, while the C11
userspace APIs return a boolean indicating the comparison
result. Which should BPF do? A64 returns the old value. x86 returns
the old value in the hard-coded register (and also sets a
flag). That means return-old-value is easier to JIT, so that's
what we use.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-8-jackmanb@google.com
The BPF_FETCH field can be set in bpf_insn.imm, for BPF_ATOMIC
instructions, in order to have the previous value of the
atomically-modified memory location loaded into the src register
after an atomic op is carried out.
Suggested-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-7-jackmanb@google.com
A subsequent patch will add additional atomic operations. These new
operations will use the same opcode field as the existing XADD, with
the immediate discriminating different operations.
In preparation, rename the instruction mode BPF_ATOMIC and start
calling the zero immediate BPF_ADD.
This is possible (doesn't break existing valid BPF progs) because the
immediate field is currently reserved MBZ and BPF_ADD is zero.
All uses are removed from the tree but the BPF_XADD definition is
kept around to avoid breaking builds for people including kernel
headers.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@gmail.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-5-jackmanb@google.com
A later commit will need to lookup a subset of these opcodes. To
avoid duplicating code, pull out a table.
The shift opcodes won't be needed by that later commit, but they're
already duplicated, so fold them into the table anyway.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-4-jackmanb@google.com
The JIT case for encoding atomic ops is about to get more
complicated. In order to make the review & resulting code easier,
let's factor out some shared helpers.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-3-jackmanb@google.com
The case for JITing atomics is about to get more complicated. Let's
factor out some common code to make the review and result more
readable.
NB the atomics code doesn't yet use the new helper - a subsequent
patch will add its use as a side-effect of other changes.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210114181751.768687-2-jackmanb@google.com
Arnd found a randconfig that produces the warning:
arch/x86/entry/thunk_64.o: warning: objtool: missing symbol for insn at
offset 0x3e
when building with LLVM_IAS=1 (Clang's integrated assembler). Josh
notes:
With the LLVM assembler not generating section symbols, objtool has no
way to reference this code when it generates ORC unwinder entries,
because this code is outside of any ELF function.
The limitation now being imposed by objtool is that all code must be
contained in an ELF symbol. And .L symbols don't create such symbols.
So basically, you can use an .L symbol *inside* a function or a code
segment, you just can't use the .L symbol to contain the code using a
SYM_*_START/END annotation pair.
Fangrui notes that this optimization is helpful for reducing image size
when compiling with -ffunction-sections and -fdata-sections. I have
observed on the order of tens of thousands of symbols for the kernel
images built with those flags.
A patch has been authored against GNU binutils to match this behavior
of not generating unused section symbols ([1]), so this will
also become a problem for users of GNU binutils once they upgrade to 2.36.
Omit the .L prefix on a label so that the assembler will emit an entry
into the symbol table for the label, with STB_LOCAL binding. This
enables objtool to generate proper unwind info here with LLVM_IAS=1 or
GNU binutils 2.36+.
[ bp: Massage commit message. ]
Reported-by: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Suggested-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20210112194625.4181814-1-ndesaulniers@google.com
Link: https://github.com/ClangBuiltLinux/linux/issues/1209
Link: https://reviews.llvm.org/D93783
Link: https://sourceware.org/binutils/docs/as/Symbol-Names.html
Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=d1bcae833b32f1408485ce69f844dcd7ded093a8 [1]
Move it outside of CONFIG_SMP in order to avoid ifdeffery at the usage
sites.
Fixes: 76e2fc63ca ("x86/cpu/amd: Set __max_die_per_package on AMD")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210114111814.5346-1-bp@alien8.de
The registers used to determine which die a pci bus belongs to don't
contain enough information to uniquely specify more than 8 dies, so
when more than 8 dies are present, use NUMA information instead.
Continue to use the previous method for 8 or fewer because it
works there, and covers cases of NUMA being disabled.
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lkml.kernel.org/r/20210108153549.108989-3-steve.wahl@hpe.com
The phys_id isn't really used other than to map to a logical die id.
Calculate the logical die id earlier, and store that instead of the
phys_id.
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lkml.kernel.org/r/20210108153549.108989-2-steve.wahl@hpe.com
The Camellia, Serpent and Twofish related header files only contain
declarations that are shared between different implementations of the
respective algorithms residing under arch/x86/crypto, and none of their
contents should be used elsewhere. So move the header files into the
same location, and use local #includes instead.
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
All dependencies on the x86 glue helper module have been replaced by
local instantiations of the new ECB/CBC preprocessor helper macros, so
the glue helper module can be retired.
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Replace the glue helper dependency with implementations of ECB and CBC
based on the new CPP macros, which avoid the need for indirect calls.
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Replace the glue helper dependency with implementations of ECB and CBC
based on the new CPP macros, which avoid the need for indirect calls.
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Replace the glue helper dependency with implementations of ECB and CBC
based on the new CPP macros, which avoid the need for indirect calls.
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Replace the glue helper dependency with implementations of ECB and CBC
based on the new CPP macros, which avoid the need for indirect calls.
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Replace the glue helper dependency with implementations of ECB and CBC
based on the new CPP macros, which avoid the need for indirect calls.
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The x86 glue helper module is starting to show its age:
- It relies heavily on function pointers to invoke asm helper functions that
operate on fixed input sizes that are relatively small. This means the
performance is severely impacted by retpolines.
- It goes to great lengths to amortize the cost of kernel_fpu_begin()/end()
over as much work as possible, which is no longer necessary now that FPU
save/restore is done lazily, and doing so may cause unbounded scheduling
blackouts due to the fact that enabling the FPU in kernel mode disables
preemption.
- The CBC mode decryption helper makes backward strides through the input, in
order to avoid a single block size memcpy() between chunks. Consuming the
input in this manner is highly likely to defeat any hardware prefetchers,
so it is better to go through the data linearly, and perform the extra
memcpy() where needed (which is turned into direct loads and stores by the
compiler anyway). Note that benchmarks won't show this effect, given that
the memory they use is always cache hot.
- It implements blockwise XOR in terms of le128 pointers, which imply an
alignment that is not guaranteed by the API, violating the C standard.
GCC does not seem to be smart enough to elide the indirect calls when the
function pointers are passed as arguments to static inline helper routines
modeled after the existing ones. So instead, let's create some CPP macros
that encapsulate the core of the ECB and CBC processing, so we can wire
them up for existing users of the glue helper module, i.e., Camellia,
Serpent, Twofish and CAST6.
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Blowfish in counter mode is never used in the kernel, so there
is no point in keeping an accelerated implementation around.
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
DES or Triple DES in counter mode is never used in the kernel, so there
is no point in keeping an accelerated implementation around.
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The glue helper's CTR routines are no longer used, so drop them.
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Twofish in CTR mode is never used by the kernel directly, and is highly
unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop
the accelerated CTR mode implementation, and instead, rely on the CTR
template and the bare cipher.
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
CAST6 in CTR mode is never used by the kernel directly, and is highly
unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop
the accelerated CTR mode implementation, and instead, rely on the CTR
template and the bare cipher.
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
CAST5 in CTR mode is never used by the kernel directly, and is highly
unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop
the accelerated CTR mode implementation, and instead, rely on the CTR
template and the bare cipher.
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Serpent in CTR mode is never used by the kernel directly, and is highly
unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop
the accelerated CTR mode implementation, and instead, rely on the CTR
template and the bare cipher.
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Camellia in CTR mode is never used by the kernel directly, and is highly
unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop
the accelerated CTR mode implementation, and instead, rely on the CTR
template and the bare cipher.
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The glue helper's XTS routines are no longer used, so drop them.
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Now that the XTS template can wrap accelerated ECB modes, it can be
used to implement Twofish in XTS mode as well, which turns out to
be at least as fast, and sometimes even faster
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Now that the XTS template can wrap accelerated ECB modes, it can be
used to implement Serpent in XTS mode as well, which turns out to
be at least as fast, and sometimes even faster
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Now that the XTS template can wrap accelerated ECB modes, it can be
used to implement CAST6 in XTS mode as well, which turns out to
be at least as fast, and sometimes even faster
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Now that the XTS template can wrap accelerated ECB modes, it can be
used to implement Camellia in XTS mode as well, which turns out to
be at least as fast, and sometimes even faster.
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Replace the function pointers in the GCM implementation with static branches,
which are based on code patching, which occurs only at module load time.
This avoids the severe performance penalty caused by the use of retpolines.
In order to retain the ability to switch between different versions of the
implementation based on the input size on cores that support AVX and AVX2,
use static branches instead of static calls.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Currently, the gcm(aes-ni) driver open codes the scatterlist handling
that is encapsulated by the skcipher walk API. So let's switch to that
instead.
Also, move the handling at the end of gcmaes_crypt_by_sg() that is
dependent on whether we are encrypting or decrypting into the callers,
which always do one or the other.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The gcm(aes-ni) driver is only built for x86_64, which does not make
use of highmem. So testing for PageHighMem is pointless and can be
omitted.
While at it, replace GFP_ATOMIC with the appropriate runtime decided
value based on the context.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Drop some prototypes that are declared but never called.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The GCM mode driver uses 16 byte aligned buffers on the stack to pass
the IV to the asm helpers, but unfortunately, the x86 port does not
guarantee that the stack pointer is 16 byte aligned upon entry in the
first place. Since the compiler is not aware of this, it will not emit
the additional stack realignment sequence that is needed, and so the
alignment is not guaranteed to be more than 8 bytes.
So instead, allocate some padding on the stack, and realign the IV
pointer by hand.
Cc: <stable@vger.kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Running instruction decoder posttest on an s390 host with an x86 target
with allyesconfig shows errors. Instructions used in a couple of kernel
objects could not be correctly decoded on big endian system.
insn_decoder_test: warning: objdump says 6 bytes, but insn_get_length() says 5
insn_decoder_test: warning: Found an x86 instruction decoder bug, please report this.
insn_decoder_test: warning: ffffffff831eb4e1: 62 d1 fd 48 7f 04 24 vmovdqa64 %zmm0,(%r12)
insn_decoder_test: warning: objdump says 7 bytes, but insn_get_length() says 6
insn_decoder_test: warning: Found an x86 instruction decoder bug, please report this.
insn_decoder_test: warning: ffffffff831eb4e8: 62 51 fd 48 7f 44 24 01 vmovdqa64 %zmm8,0x40(%r12)
insn_decoder_test: warning: objdump says 8 bytes, but insn_get_length() says 6
This is because in a few places instruction field bytes are set directly
with further usage of "value". To address that introduce and use a
insn_set_byte() helper, which correctly updates "value" on big endian
systems.
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Correct objtool orc generation endianness problems to enable fully
functional x86 cross-compiles on big endian hardware.
Introduce bswap_if_needed() macro, which does a byte swap if target
endianness doesn't match the host, i.e. cross-compilation for little
endian on big endian and vice versa. The macro is used for conversion
of multi-byte values which are read from / about to be written to a
target native endianness ELF file.
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
The x86 instruction decoder code is shared across the kernel source and
the tools. Currently objtool seems to be the only tool from build tools
needed which breaks x86 cross-compilation on big endian systems. Make
the x86 instruction decoder build host endianness agnostic to support
x86 cross-compilation and enable objtool to implement endianness
awareness for big endian architectures support.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Co-developed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Currently the x86 instruction decoder is used from:
- the kernel itself,
- from tools like objtool and perf,
- within x86 tools, i.e. instruction decoder selftests.
The first two cases are similar, because tools headers try to mimic
kernel headers.
Instruction decoder selftests include some of the kernel headers
directly, including uapi headers. This works until headers dependencies
are kept to a minimum and tools are not cross-compiled. Since the goal
of the x86 instruction decoder selftests is not to verify uapi headers,
move it to using tools headers, like is already done for vdso2c tool,
mkpiggy and other tools in arch/x86/boot/.
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Only the IPI-related functions in the smp_ops should be conditional
on the vector callback being available. The rest should still happen:
• xen_hvm_smp_prepare_boot_cpu()
This function does two things, both of which should still happen if
there is no vector callback support.
The call to xen_vcpu_setup() for vCPU0 should still happen as it just
sets up the vcpu_info for CPU0. That does happen for the secondary
vCPUs too, from xen_cpu_up_prepare_hvm().
The second thing it does is call xen_init_spinlocks(), which perhaps
counter-intuitively should *also* still be happening in the case
without vector callbacks, so that it can clear its local xen_pvspin
flag and disable the virt_spin_lock_key accordingly.
Checking xen_have_vector_callback in xen_init_spinlocks() itself
would affect PV guests, so set the global nopvspin flag in
xen_hvm_smp_init() instead, when vector callbacks aren't available.
• xen_hvm_smp_prepare_cpus()
This does some IPI-related setup by calling xen_smp_intr_init() and
xen_init_lock_cpu(), which can be made conditional. And it sets the
xen_vcpu_id to XEN_VCPU_ID_INVALID for all possible CPUS, which does
need to happen.
• xen_smp_cpus_done()
This offlines any vCPUs which doesn't fit in the global shared_info
page, if separate vcpu_info placement isn't available. That part also
needs to happen regardless of vector callback support.
• xen_hvm_cpu_die()
This doesn't actually do anything other than commin_cpu_die() right
right now in the !vector_callback case; all three teardown functions
it calls should be no-ops. But to guard against future regressions
it's useful to call it anyway, and for it to explicitly check for
xen_have_vector_callback before calling those additional functions.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20210106153958.584169-6-dwmw2@infradead.org
Signed-off-by: Juergen Gross <jgross@suse.com>
In the case where xen_have_vector_callback is false, we still register
the IPI vectors in xen_smp_intr_init() for the secondary CPUs even
though they aren't going to be used. Stop doing that.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20210106153958.584169-5-dwmw2@infradead.org
Signed-off-by: Juergen Gross <jgross@suse.com>
It's useful to be able to test non-vector event channel delivery, to make
sure Linux will work properly on older Xen which doesn't have it.
It's also useful for those working on Xen and Xen-compatible hypervisors,
because there are guest kernels still in active use which use PCI INTX
even when vector delivery is available.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20210106153958.584169-4-dwmw2@infradead.org
Signed-off-by: Juergen Gross <jgross@suse.com>
When the compiler choses to not inline the trivial MSR helpers:
vmlinux.o: warning: objtool: __sev_es_nmi_complete()+0xce: call to __wrmsr.constprop.14() leaves .noinstr.text section
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Link: https://lore.kernel.org/r/X/bf3gV+BW7kGEsB@hirez.programming.kicks-ass.net
There's some explicit tracing left in exc_machine_check_kernel(),
remove it, as it's already implied by irqentry_nmi_enter().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210106144017.719310466@infradead.org
When the compiler fails to inline, it violates nonisntr:
vmlinux.o: warning: objtool: __sev_es_nmi_complete()+0xc7: call to sev_es_wr_ghcb_msr() leaves .noinstr.text section
Fixes: 4ca68e023b ("x86/sev-es: Handle NMI State")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210106144017.532902065@infradead.org
do_IRQ() has been replaced by common_interrupt() in
fa5e5c4092 ("x86/entry: Use idtentry for interrupts")
Remove its now unused declaration.
Signed-off-by: Hao Lee <haolee.swjtu@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210103030834.GA15432@haolee.github.io
Set the maximum DIE per package variable on AMD using the
NodesPerProcessor topology value. This will be used by RAPL, among
others, to determine the maximum number of DIEs on the system in order
to do per-DIE manipulations.
[ bp: Productize into a proper patch. ]
Fixes: 028c221ed1 ("x86/CPU/AMD: Save AMD NodeId as cpu_die_id")
Reported-by: Johnathan Smithinovic <johnathan.smithinovic@gmx.at>
Reported-by: Rafael Kitover <rkitover@gmail.com>
Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Johnathan Smithinovic <johnathan.smithinovic@gmx.at>
Tested-by: Rafael Kitover <rkitover@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=210939
Link: https://lkml.kernel.org/r/20210106112106.GE5729@zn.tnic
Link: https://lkml.kernel.org/r/20210111101455.1194-1-bp@alien8.de
Don't assume dest/source buffers are userspace addresses when manually
copying data for string I/O or MOVS MMIO, as {get,put}_user() will fail
if handed a kernel address and ultimately lead to a kernel panic.
When invoking INSB/OUTSB instructions in kernel space in a
SEV-ES-enabled VM, the kernel crashes with the following message:
"SEV-ES: Unsupported exception in #VC instruction emulation - can't continue"
Handle that case properly.
[ bp: Massage commit message. ]
Fixes: f980f9c31a ("x86/sev-es: Compile early handler code into kernel image")
Signed-off-by: Hyunwook (Wooky) Baek <baekhw@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Link: https://lkml.kernel.org/r/20210110071102.2576186-1-baekhw@google.com
A CPU's current task can have its {closid, rmid} fields read locally
while they are being concurrently written to from another CPU.
This can happen anytime __resctrl_sched_in() races with either
__rdtgroup_move_task() or rdt_move_group_tasks().
Prevent load / store tearing for those accesses by giving them the
READ_ONCE() / WRITE_ONCE() treatment.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/9921fda88ad81afb9885b517fbe864a2bc7c35a9.1608243147.git.reinette.chatre@intel.com
James reported in [1] that there could be two tasks running on the same CPU
with task_struct->on_cpu set. Using task_struct->on_cpu as a test if a task
is running on a CPU may thus match the old task for a CPU while the
scheduler is running and IPI it unnecessarily.
task_curr() is the correct helper to use. While doing so move the #ifdef
check of the CONFIG_SMP symbol to be a C conditional used to determine
if this helper should be used to ensure the code is always checked for
correctness by the compiler.
[1] https://lore.kernel.org/lkml/a782d2f3-d2f6-795f-f4b1-9462205fd581@arm.com
Reported-by: James Morse <james.morse@arm.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/e9e68ce1441a73401e08b641cc3b9a3cf13fe6d4.1608243147.git.reinette.chatre@intel.com
Mark the function with the __printf attribute to allow the compiler to
more thoroughly typecheck its arguments against a format string with
-Wformat and similar flags.
[ bp: Massage commit message. ]
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20201221160009.3752017-1-trix@redhat.com
syscalls which take 64-bit arguments to the compat handlers due to
former having a general compat handler. (Brian Gerst)
- Add a forgotten pmd page destructor call to pud_free_pmd_page() where
a pmd page is freed. (Dan Williams)
- Make IN/OUT insns with an u8 immediate port operand handling for
SEV-ES guests more precise by using only the single port byte and not
the whole s32 value of the insn decoder. (Peter Gonda)
- Correct a straddling end range check before returning the proper MTRR
type, when the end address is the same as top of memory. (Ying-Tsun
Huang)
- Change PQR_ASSOC MSR update scheme when moving a task to a resctrl
resource group to avoid significant performance overhead with some
resctrl workloads. (Fenghua Yu)
- Avoid the actual task move overhead when the task is already in the
resource group. (Fenghua Yu)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/61xYACgkQEsHwGGHe
VUrcvQ//dAWAteCC/BXVHpgcWrBOgPrkwv7aAo70bIO50fUj4pHPYbfhOJU1ey7j
5o4FrqdsOVhGfZjQzvT/juLsr9mQHsfszxKpDTLyK3wVtUtIODYXzgiXRc/qfZDO
ozXCVUsUSKJgrIcKTBQbmugK36iZZk+ER+qzUaqd0aq8mocdtSSO8b14uaRJw3MR
vumqmEmEEcyM9XK0UgTLPcf6Uhu+Mlg3YSNkV5Qhu0yiCTJaqeEySsytUcRsnnF/
z8AkxZP03Q65o3aoRoSGZihHNKTkNucbavYp70LkcqopoHlC+XERvya9ANRibLPi
/+s9GQUm4QPg7XRHLB8dXFZ9RY3YGUeE60BUxVZa4vI3pwciPQD5tbvUF3F/jEN0
PYLy/zVlAkDfI6Z8wTl8DNmd8nd/rE0F4p5zayjpQUWsjjfZDrh+GzBl/YsMuYRp
G8dk3tEUc8KREBEccv/YzuVcE0AhX4t1tkn3l2Le5v+4PbwRWBm2uNOiRfn4OM31
iB4E4yCHBnBhTyBA0TkWuHV1TJX6Tb2+0g+D49ZoMGFVoBd8NL6f+dBr0psjX/U+
RsZucit0FcJG2VhJNXEPD+rwNZ6XPfDmIU9GNTAmXUuoKR/kqT8D/NWYkqmKh/Vw
+F2EIgOZVhQVOvLKWRut+4qmQRStm6B3UBJimEDySUJPT72O+dU=
=2/Eq
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.11_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
"As expected, fixes started trickling in after the holidays so here is
the accumulated pile of x86 fixes for 5.11:
- A fix for fanotify_mark() missing the conversion of x86_32 native
syscalls which take 64-bit arguments to the compat handlers due to
former having a general compat handler. (Brian Gerst)
- Add a forgotten pmd page destructor call to pud_free_pmd_page()
where a pmd page is freed. (Dan Williams)
- Make IN/OUT insns with an u8 immediate port operand handling for
SEV-ES guests more precise by using only the single port byte and
not the whole s32 value of the insn decoder. (Peter Gonda)
- Correct a straddling end range check before returning the proper
MTRR type, when the end address is the same as top of memory.
(Ying-Tsun Huang)
- Change PQR_ASSOC MSR update scheme when moving a task to a resctrl
resource group to avoid significant performance overhead with some
resctrl workloads. (Fenghua Yu)
- Avoid the actual task move overhead when the task is already in the
resource group. (Fenghua Yu)"
* tag 'x86_urgent_for_v5.11_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Don't move a task to the same resource group
x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR
x86/mtrr: Correct the range check before performing MTRR type lookups
x86/sev-es: Fix SEV-ES OUT/IN immediate opcode vc handling
x86/mm: Fix leak of pmd ptlock
fanotify: Fix sys_fanotify_mark() on native x86-32
* Fixes for the new scalable MMU
* Fixes for migration of nested hypervisors on AMD
* Fix for clang integrated assembler
* Fix for left shift by 64 (UBSAN)
* Small cleanups
* Straggler SEV-ES patch
ARM:
* VM init cleanups
* PSCI relay cleanups
* Kill CONFIG_KVM_ARM_PMU
* Fixup __init annotations
* Fixup reg_to_encoding()
* Fix spurious PMCR_EL0 access
* selftests cleanups
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl/4YOMUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMg0Qf/eGDOoLL18KhA9MpoPXusR5kU+82G
AcbFj9siNW46EF7mL+sw/xAx+gZaqSpIEmn/f6BzgiaUBdFTv9CKX3B54e43e59G
HAKD0NpqwvDIi1b0T6bcgC92mY3Qx/IDCc7/9JYjBs/iORfqyWW6xVtkF/Gfymxt
eK+MnfMqqNZODgR/cZnCH1E48fuwOvRMxLqilLi3OOMSUfs2cQOSLTNfYQYqjeaJ
dsQ4YeyPJO5JHtfHFr6VPIo/jDhowniac9CNvOomWWVIx2zPYVSl9d8ub6ESEPNF
GM7UeBCOMmZG/a3qFEZPAUJ7znW4yYE85Z6pjxlrGhd1I54MJi4dd+RApw==
=5Nfj
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"x86:
- Fixes for the new scalable MMU
- Fixes for migration of nested hypervisors on AMD
- Fix for clang integrated assembler
- Fix for left shift by 64 (UBSAN)
- Small cleanups
- Straggler SEV-ES patch
ARM:
- VM init cleanups
- PSCI relay cleanups
- Kill CONFIG_KVM_ARM_PMU
- Fixup __init annotations
- Fixup reg_to_encoding()
- Fix spurious PMCR_EL0 access
Misc:
- selftests cleanups"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (38 commits)
KVM: x86: __kvm_vcpu_halt can be static
KVM: SVM: Add support for booting APs in an SEV-ES guest
KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit
KVM: nSVM: mark vmcb as dirty when forcingly leaving the guest mode
KVM: nSVM: correctly restore nested_run_pending on migration
KVM: x86/mmu: Clarify TDP MMU page list invariants
KVM: x86/mmu: Ensure TDP MMU roots are freed after yield
kvm: check tlbs_dirty directly
KVM: x86: change in pv_eoi_get_pending() to make code more readable
MAINTAINERS: Really update email address for Sean Christopherson
KVM: x86: fix shift out of bounds reported by UBSAN
KVM: selftests: Implement perf_test_util more conventionally
KVM: selftests: Use vm_create_with_vcpus in create_vm
KVM: selftests: Factor out guest mode code
KVM/SVM: Remove leftover __svm_vcpu_run prototype from svm.c
KVM: SVM: Add register operand to vmsave call in sev_es_vcpu_load
KVM: x86/mmu: Optimize not-present/MMIO SPTE check in get_mmio_spte()
KVM: x86/mmu: Use raw level to index into MMIO walks' sptes array
KVM: x86/mmu: Get root level from walkers when retrieving MMIO SPTE
KVM: x86/mmu: Use -1 to flag an undefined spte in get_mmio_spte()
...
_PAGE_PAT_LARGE is already defined next to _PAGE_PAT. Remove the
duplicate.
Fixes: 4efb566491 ("x86/mm: Tabulate the page table encoding definitions")
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20201111160946.147341-2-nivedita@alum.mit.edu
The
"Timeout: Not all CPUs entered broadcast exception handler"
message will appear from time to time given enough systems, but this
message does not identify which CPUs failed to enter the broadcast
exception handler. This information would be valuable if available,
for example, in order to correlate with other hardware-oriented error
messages.
Add a cpumask of CPUs which maintains which CPUs have entered this
handler, and print out which ones failed to enter in the event of a
timeout.
[ bp: Massage. ]
Reported-by: Jonathan Lemon <bsd@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20210106174102.GA23874@paulmck-ThinkPad-P72
Add a missing __iomem annotation to address a sparse warning. The caller
is expected to pass an __iomem annotated pointer to this function. The
current usages send a 64-bytes command descriptor to an MMIO location
(portal) on a device for consumption.
Also, from the comment in movdir64b(), which also applies to enqcmds(),
@__dst must be supplied as an lvalue because this tells the compiler
what the object is (its size) the instruction accesses. I.e., not the
pointers but what they point to, thus the deref'ing '*'."
The actual sparse warning is:
drivers/dma/idxd/submit.c: note: in included file (through arch/x86/include/asm/processor.h, \
arch/x86/include/asm/timex.h, include/linux/timex.h, include/linux/time32.h, \
include/linux/time.h, include/linux/stat.h, ...):
./arch/x86/include/asm/special_insns.h:289:41: warning: incorrect type in initializer (different address spaces)
./arch/x86/include/asm/special_insns.h:289:41: expected struct <noident> *__dst
./arch/x86/include/asm/special_insns.h:289:41: got void [noderef] __iomem *dst
[ bp: Massage commit message. ]
Fixes: 7f5933f81b ("x86/asm: Add an enqcmds() wrapper for the ENQCMDS instruction")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ben Widawsky <ben.widawsky@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lkml.kernel.org/r/161003789741.4062451.14362269365703761223.stgit@djiang5-desk3.ch.intel.com
Add a missing __iomem annotation to address a sparse warning. The caller
is expected to pass an __iomem annotated pointer to this function. The
current usages send a 64-bytes command descriptor to an MMIO location
(portal) on a device for consumption. When future usages for the
MOVDIR64B instruction warrant a separate variant of a memory to memory
operation, the argument annotation can be revisited.
Also, from the comment in movdir64b() @__dst must be supplied as an
lvalue because this tells the compiler what the object is (its size) the
instruction accesses. I.e., not the pointers but what they point to,
thus the deref'ing '*'."
The actual sparse warning is:
sparse warnings: (new ones prefixed by >>)
drivers/dma/idxd/submit.c: note: in included file (through include/linux/io.h, include/linux/pci.h):
>> arch/x86/include/asm/io.h:422:27: sparse: sparse: incorrect type in \
argument 1 (different address spaces)
@@ expected void *dst
@@ got void [noderef] __iomem *dst @@
arch/x86/include/asm/io.h:422:27: sparse: expected void *dst
arch/x86/include/asm/io.h:422:27: sparse: got void [noderef] __iomem *dst
[ bp: Massage commit message. ]
Fixes: 0888e1030d ("x86/asm: Carve out a generic movdir64b() helper for general usage")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ben Widawsky <ben.widawsky@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lkml.kernel.org/r/161003787823.4062451.6564503265464317197.stgit@djiang5-desk3.ch.intel.com
Currently, when moving a task to a resource group the PQR_ASSOC MSR is
updated with the new closid and rmid in an added task callback. If the
task is running, the work is run as soon as possible. If the task is not
running, the work is executed later in the kernel exit path when the
kernel returns to the task again.
Updating the PQR_ASSOC MSR as soon as possible on the CPU a moved task
is running is the right thing to do. Queueing work for a task that is
not running is unnecessary (the PQR_ASSOC MSR is already updated when
the task is scheduled in) and causing system resource waste with the way
in which it is implemented: Work to update the PQR_ASSOC register is
queued every time the user writes a task id to the "tasks" file, even if
the task already belongs to the resource group.
This could result in multiple pending work items associated with a
single task even if they are all identical and even though only a single
update with most recent values is needed. Specifically, even if a task
is moved between different resource groups while it is sleeping then it
is only the last move that is relevant but yet a work item is queued
during each move.
This unnecessary queueing of work items could result in significant
system resource waste, especially on tasks sleeping for a long time.
For example, as demonstrated by Shakeel Butt in [1] writing the same
task id to the "tasks" file can quickly consume significant memory. The
same problem (wasted system resources) occurs when moving a task between
different resource groups.
As pointed out by Valentin Schneider in [2] there is an additional issue
with the way in which the queueing of work is done in that the task_struct
update is currently done after the work is queued, resulting in a race with
the register update possibly done before the data needed by the update is
available.
To solve these issues, update the PQR_ASSOC MSR in a synchronous way
right after the new closid and rmid are ready during the task movement,
only if the task is running. If a moved task is not running nothing
is done since the PQR_ASSOC MSR will be updated next time the task is
scheduled. This is the same way used to update the register when tasks
are moved as part of resource group removal.
[1] https://lore.kernel.org/lkml/CALvZod7E9zzHwenzf7objzGKsdBmVwTgEJ0nPgs0LUFU3SN5Pw@mail.gmail.com/
[2] https://lore.kernel.org/lkml/20201123022433.17905-1-valentin.schneider@arm.com
[ bp: Massage commit message and drop the two update_task_closid_rmid()
variants. ]
Fixes: e02737d5b8 ("x86/intel_rdt: Add tasks files")
Reported-by: Shakeel Butt <shakeelb@google.com>
Reported-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: James Morse <james.morse@arm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/17aa2fb38fc12ce7bb710106b3e7c7b45acb9e94.1608243147.git.reinette.chatre@intel.com
The AES-NI driver implements XTS via the glue helper, which consumes
a struct with sets of function pointers which are invoked on chunks
of input data of the appropriate size, as annotated in the struct.
Let's get rid of this indirection, so that we can perform direct calls
to the assembler helpers. Instead, let's adopt the arm64 strategy, i.e.,
provide a helper which can consume inputs of any size, provided that the
penultimate, full block is passed via the last call if ciphertext stealing
needs to be applied.
This also allows us to enable the XTS mode for i386.
Tested-by: Eric Biggers <ebiggers@google.com> # x86_64
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The XTS asm helper arrangement is a bit odd: the 8-way stride helper
consists of back-to-back calls to the 4-way core transforms, which
are called indirectly, based on a boolean that indicates whether we
are performing encryption or decryption.
Given how costly indirect calls are on x86, let's switch to direct
calls, and given how the 8-way stride doesn't really add anything
substantial, use a 4-way stride instead, and make the asm core
routine deal with any multiple of 4 blocks. Since 512 byte sectors
or 4 KB blocks are the typical quantities XTS operates on, increase
the stride exported to the glue helper to 512 bytes as well.
As a result, the number of indirect calls is reduced from 3 per 64 bytes
of in/output to 1 per 512 bytes of in/output, which produces a 65% speedup
when operating on 1 KB blocks (measured on a Intel(R) Core(TM) i7-8650U CPU)
Fixes: 9697fa39ef ("x86/retpoline/crypto: Convert crypto assembler indirect jumps")
Tested-by: Eric Biggers <ebiggers@google.com> # x86_64
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Typically under KVM, an AP is booted using the INIT-SIPI-SIPI sequence,
where the guest vCPU register state is updated and then the vCPU is VMRUN
to begin execution of the AP. For an SEV-ES guest, this won't work because
the guest register state is encrypted.
Following the GHCB specification, the hypervisor must not alter the guest
register state, so KVM must track an AP/vCPU boot. Should the guest want
to park the AP, it must use the AP Reset Hold exit event in place of, for
example, a HLT loop.
First AP boot (first INIT-SIPI-SIPI sequence):
Execute the AP (vCPU) as it was initialized and measured by the SEV-ES
support. It is up to the guest to transfer control of the AP to the
proper location.
Subsequent AP boot:
KVM will expect to receive an AP Reset Hold exit event indicating that
the vCPU is being parked and will require an INIT-SIPI-SIPI sequence to
awaken it. When the AP Reset Hold exit event is received, KVM will place
the vCPU into a simulated HLT mode. Upon receiving the INIT-SIPI-SIPI
sequence, KVM will make the vCPU runnable. It is again up to the guest
to then transfer control of the AP to the proper location.
To differentiate between an actual HLT and an AP Reset Hold, a new MP
state is introduced, KVM_MP_STATE_AP_RESET_HOLD, which the vCPU is
placed in upon receiving the AP Reset Hold exit event. Additionally, to
communicate the AP Reset Hold exit event up to userspace (if needed), a
new exit reason is introduced, KVM_EXIT_AP_RESET_HOLD.
A new x86 ops function is introduced, vcpu_deliver_sipi_vector, in order
to accomplish AP booting. For VMX, vcpu_deliver_sipi_vector is set to the
original SIPI delivery function, kvm_vcpu_deliver_sipi_vector(). SVM adds
a new function that, for non SEV-ES guests, invokes the original SIPI
delivery function, kvm_vcpu_deliver_sipi_vector(), but for SEV-ES guests,
implements the logic above.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <e8fbebe8eb161ceaabdad7c01a5859a78b424d5e.1609791600.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It is possible to exit the nested guest mode, entered by
svm_set_nested_state prior to first vm entry to it (e.g due to pending event)
if the nested run was not pending during the migration.
In this case we must not switch to the nested msr permission bitmap.
Also add a warning to catch similar cases in the future.
Fixes: a7d5c7ce41 ("KVM: nSVM: delay MSR permission processing to first nested VM run")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210107093854.882483-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We overwrite most of vmcb fields while doing so, so we must
mark it as dirty.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210107093854.882483-5-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The code to store it on the migration exists, but no code was restoring it.
One of the side effects of fixing this is that L1->L2 injected events
are no longer lost when migration happens with nested run pending.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210107093854.882483-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The tdp_mmu_roots and tdp_mmu_pages in struct kvm_arch should only contain
pages with tdp_mmu_page set to true. tdp_mmu_pages should not contain any
pages with a non-zero root_count and tdp_mmu_roots should only contain
pages with a positive root_count, unless a thread holds the MMU lock and
is in the process of modifying the list. Various functions expect these
invariants to be maintained, but they are not explictily documented. Add
to the comments on both fields to document the above invariants.
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210107001935.3732070-2-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Many TDP MMU functions which need to perform some action on all TDP MMU
roots hold a reference on that root so that they can safely drop the MMU
lock in order to yield to other threads. However, when releasing the
reference on the root, there is a bug: the root will not be freed even
if its reference count (root_count) is reduced to 0.
To simplify acquiring and releasing references on TDP MMU root pages, and
to ensure that these roots are properly freed, move the get/put operations
into another TDP MMU root iterator macro.
Moving the get/put operations into an iterator macro also helps
simplify control flow when a root does need to be freed. Note that using
the list_for_each_entry_safe macro would not have been appropriate in
this situation because it could keep a pointer to the next root across
an MMU lock release + reacquire, during which time that root could be
freed.
Reported-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes: faaf05b00a ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 063afacd87 ("kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU")
Fixes: a6a0b05da9 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
Fixes: 1488199856 ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU")
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210107001935.3732070-1-bgardon@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stephen Zhang <stephenzhangzsd@gmail.com>
Message-Id: <1608277897-1932-1-git-send-email-stephenzhangzsd@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since we know that e >= s, we can reassociate the left shift,
changing the shifted number from 1 to 2 in exchange for
decreasing the right hand side by 1.
Reported-by: syzbot+e87846c48bf72bc85311@syzkaller.appspotmail.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 16809ecdc1 moved __svm_vcpu_run the prototype to svm.h,
but forgot to remove the original from svm.c.
Fixes: 16809ecdc1 ("KVM: SVM: Provide an updated VMRUN invocation for SEV-ES guests")
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Message-Id: <20201220200339.65115-1-ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When using LLVM's integrated assembler (LLVM_IAS=1) while building
x86_64_defconfig + CONFIG_KVM=y + CONFIG_KVM_AMD=y, the following build
error occurs:
$ make LLVM=1 LLVM_IAS=1 arch/x86/kvm/svm/sev.o
arch/x86/kvm/svm/sev.c:2004:15: error: too few operands for instruction
asm volatile(__ex("vmsave") : : "a" (__sme_page_pa(sd->save_area)) : "memory");
^
arch/x86/kvm/svm/sev.c:28:17: note: expanded from macro '__ex'
#define __ex(x) __kvm_handle_fault_on_reboot(x)
^
./arch/x86/include/asm/kvm_host.h:1646:10: note: expanded from macro '__kvm_handle_fault_on_reboot'
"666: \n\t" \
^
<inline asm>:2:2: note: instantiated into assembly here
vmsave
^
1 error generated.
This happens because LLVM currently does not support calling vmsave
without the fixed register operand (%rax for 64-bit and %eax for
32-bit). This will be fixed in LLVM 12 but the kernel currently supports
LLVM 10.0.1 and newer so this needs to be handled.
Add the proper register using the _ASM_AX macro, which matches the
vmsave call in vmenter.S.
Fixes: 861377730a ("KVM: SVM: Provide support for SEV-ES vCPU loading")
Link: https://reviews.llvm.org/D93524
Link: https://github.com/ClangBuiltLinux/linux/issues/1216
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Message-Id: <20201219063711.3526947-1-natechancellor@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Check only the terminal leaf for a "!PRESENT || MMIO" SPTE when looking
for reserved bits on valid, non-MMIO SPTEs. The get_walk() helpers
terminate their walks if a not-present or MMIO SPTE is encountered, i.e.
the non-terminal SPTEs have already been verified to be regular SPTEs.
This eliminates an extra check-and-branch in a relatively hot loop.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20201218003139.2167891-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Bump the size of the sptes array by one and use the raw level of the
SPTE to index into the sptes array. Using the SPTE level directly
improves readability by eliminating the need to reason out why the level
is being adjusted when indexing the array. The array is on the stack
and is not explicitly initialized; bumping its size is nothing more than
a superficial adjustment to the stack frame.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20201218003139.2167891-4-seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Get the so called "root" level from the low level shadow page table
walkers instead of manually attempting to calculate it higher up the
stack, e.g. in get_mmio_spte(). When KVM is using PAE shadow paging,
the starting level of the walk, from the callers perspective, is not
the CR3 root but rather the PDPTR "root". Checking for reserved bits
from the CR3 root causes get_mmio_spte() to consume uninitialized stack
data due to indexing into sptes[] for a level that was not filled by
get_walk(). This can result in false positives and/or negatives
depending on what garbage happens to be on the stack.
Opportunistically nuke a few extra newlines.
Fixes: 95fb5b0258 ("kvm: x86/mmu: Support MMIO in the TDP MMU")
Reported-by: Richard Herbert <rherbert@sympatico.ca>
Cc: Ben Gardon <bgardon@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20201218003139.2167891-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Return -1 from the get_walk() helpers if the shadow walk doesn't fill at
least one spte, which can theoretically happen if the walk hits a
not-present PDPTR. Returning the root level in such a case will cause
get_mmio_spte() to return garbage (uninitialized stack data). In
practice, such a scenario should be impossible as KVM shouldn't get a
reserved-bit page fault with a not-present PDPTR.
Note, using mmu->root_level in get_walk() is wrong for other reasons,
too, but that's now a moot point.
Fixes: 95fb5b0258 ("kvm: x86/mmu: Support MMIO in the TDP MMU")
Cc: Ben Gardon <bgardon@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20201218003139.2167891-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The Android Studio Emulator (aka goldfish) does not use
arch/x86/platform/goldfish since 5.4 kernel.
Signed-off-by: Roman Kiryanov <rkir@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201117025522.1874864-1-rkir@google.com
with mips converted to use of fs/config_binfmt_elf.c, there's no
need to keep selects of that thing all over arch/* - we can simply
turn into def_bool y if COMPAT && BINFMT_ELF (in fs/Kconfig.binfmt)
and get rid of all selects.
Several architectures got those selects wrong (e.g. you could
end up with sparc64 sans BINFMT_ELF, with select violating
dependencies, etc.)
Randy Dunlap has spotted some of those; IMO this is simpler than
his fix, but it depends upon the stuff that would need to be
backported, so we might end up using his variant for -stable.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It's really trivial - the only wrinkle is making sure that
compiler knows that ia32-related side of COMPAT_ARCH_DLINFO
is dead code on such configs (we don't get there without
having passed compat_elf_check_arch(), and on such configs
that'll fail for ia32 binary).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
To get rid of hardcoded size/offset in those macros we need to have
a definition of i386 variant of struct elf_prstatus. However, we can't
do that in asm/compat.h - the types needed for that are not there and
adding an include of asm/user32.h into asm/compat.h would cause a lot
of mess.
That could be conveniently done in elfcore-compat.h, but currently there
is nowhere to put arch-dependent parts of it - no asm/elfcore-compat.h.
So we introduce a new file (asm/elfcore-compat.h, present on architectures
that have CONFIG_ARCH_HAS_ELFCORE_COMPAT set, currently only on x86),
have it pulled by linux/elfcore-compat.h and move the definitions there.
As a side benefit, we don't need to worry about accidental inclusion of
that file into binfmt_elf.c itself, so we don't need the dance with
COMPAT_PRSTATUS_SIZE, etc. - only fs/compat_binfmt_elf.c will see
that header.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In mtrr_type_lookup(), if the input memory address region is not in the
MTRR, over 4GB, and not over the top of memory, a write-back attribute
is returned. These condition checks are for ensuring the input memory
address region is actually mapped to the physical memory.
However, if the end address is just aligned with the top of memory,
the condition check treats the address is over the top of memory, and
write-back attribute is not returned.
And this hits in a real use case with NVDIMM: the nd_pmem module tries
to map NVDIMMs as cacheable memories when NVDIMMs are connected. If a
NVDIMM is the last of the DIMMs, the performance of this NVDIMM becomes
very low since it is aligned with the top of memory and its memory type
is uncached-minus.
Move the input end address change to inclusive up into
mtrr_type_lookup(), before checking for the top of memory in either
mtrr_type_lookup_{variable,fixed}() helpers.
[ bp: Massage commit message. ]
Fixes: 0cc705f56e ("x86/mm/mtrr: Clean up mtrr_type_lookup()")
Signed-off-by: Ying-Tsun Huang <ying-tsun.huang@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201215070721.4349-1-ying-tsun.huang@amd.com
We've observed crashes due to an empty cpu mask in
hyperv_flush_tlb_others. Obviously the cpu mask in question is changed
between the cpumask_empty call at the beginning of the function and when
it is actually used later.
One theory is that an interrupt comes in between and a code path ends up
changing the mask. Move the check after interrupt has been disabled to
see if it fixes the issue.
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20210105175043.28325-1-wei.liu@kernel.org
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
It is unnecessary to use the local variable 'error' in the mmap syscall
implementation function - just return -EINVAL directly and get rid of
the local variable altogether.
[ bp: Massage commit message. ]
Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/20201217052648.24656-1-adrianhuang0701@gmail.com
The IN and OUT instructions with port address as an immediate operand
only use an 8-bit immediate (imm8). The current VC handler uses the
entire 32-bit immediate value but these instructions only set the first
bytes.
Cast the operand to an u8 for that.
[ bp: Massage commit message. ]
Fixes: 25189d08e5 ("x86/sev-es: Add support for handling IOIO exceptions")
Signed-off-by: Peter Gonda <pgonda@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Link: https://lkml.kernel.org/r/20210105163311.221490-1-pgonda@google.com
Currently the kexec kernel can panic or hang due to 2 causes:
1) hv_cpu_die() is not called upon kexec, so the hypervisor corrupts the
old VP Assist Pages when the kexec kernel runs. The same issue is fixed
for hibernation in commit 421f090c81 ("x86/hyperv: Suspend/resume the
VP assist page for hibernation"). Now fix it for kexec.
2) hyperv_cleanup() is called too early. In the kexec path, the other CPUs
are stopped in hv_machine_shutdown() -> native_machine_shutdown(), so
between hv_kexec_handler() and native_machine_shutdown(), the other CPUs
can still try to access the hypercall page and cause panic. The workaround
"hv_hypercall_pg = NULL;" in hyperv_cleanup() is unreliabe. Move
hyperv_cleanup() to a better place.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20201222065541.24312-1-decui@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Currently, kprobes decodes the opcode right after single-stepping in
resume_execution(). But the opcode was already decoded while preparing
arch_specific_insn in arch_copy_kprobe().
Decode the opcode in arch_copy_kprobe() instead of in resume_execution()
and set some flags which classify the opcode for the resuming process.
[ bp: Massage commit message. ]
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/160830072561.349576.3014979564448023213.stgit@devnote2
Commit
28ee90fe60 ("x86/mm: implement free pmd/pte page interfaces")
introduced a new location where a pmd was released, but neglected to
run the pmd page destructor. In fact, this happened previously for a
different pmd release path and was fixed by commit:
c283610e44 ("x86, mm: do not leak page->ptl for pmd page tables").
This issue was hidden until recently because the failure mode is silent,
but commit:
b2b29d6d01 ("mm: account PMD tables like PTE tables")
turns the failure mode into this signature:
BUG: Bad page state in process lt-pmem-ns pfn:15943d
page:000000007262ed7b refcount:0 mapcount:-1024 mapping:0000000000000000 index:0x0 pfn:0x15943d
flags: 0xaffff800000000()
raw: 00affff800000000 dead000000000100 0000000000000000 0000000000000000
raw: 0000000000000000 ffff913a029bcc08 00000000fffffbff 0000000000000000
page dumped because: nonzero mapcount
[..]
dump_stack+0x8b/0xb0
bad_page.cold+0x63/0x94
free_pcp_prepare+0x224/0x270
free_unref_page+0x18/0xd0
pud_free_pmd_page+0x146/0x160
ioremap_pud_range+0xe3/0x350
ioremap_page_range+0x108/0x160
__ioremap_caller.constprop.0+0x174/0x2b0
? memremap+0x7a/0x110
memremap+0x7a/0x110
devm_memremap+0x53/0xa0
pmem_attach_disk+0x4ed/0x530 [nd_pmem]
? __devm_release_region+0x52/0x80
nvdimm_bus_probe+0x85/0x210 [libnvdimm]
Given this is a repeat occurrence it seemed prudent to look for other
places where this destructor might be missing and whether a better
helper is needed. try_to_free_pmd_page() looks like a candidate, but
testing with setting up and tearing down pmd mappings via the dax unit
tests is thus far not triggering the failure.
As for a better helper pmd_free() is close, but it is a messy fit
due to requiring an @mm arg. Also, ___pmd_free_tlb() wants to call
paravirt_tlb_remove_table() instead of free_page(), so open-coded
pgtable_pmd_page_dtor() seems the best way forward for now.
Debugged together with Matthew Wilcox <willy@infradead.org>.
Fixes: 28ee90fe60 ("x86/mm: implement free pmd/pte page interfaces")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/160697689204.605323.17629854984697045602.stgit@dwillia2-desk3.amr.corp.intel.com
On 64bit architectures that support 32bit processes there are
two possible layouts for NT_PRSTATUS note in ELF coredumps.
For one thing, several fields are 64bit for native processes
and 32bit for compat ones (pr_sigpend, etc.). For another,
the register dump is obviously different - the size and number
of registers are not going to be the same for 32bit and 64bit
variants of processor.
Usually that's handled by having two structures - elf_prstatus
for native layout and compat_elf_prstatus for 32bit one.
32bit processes are handled by fs/compat_binfmt_elf.c, which
defines a macro called 'elf_prstatus' that expands to compat_elf_prstatus.
Then it includes fs/binfmt_elf.c, which makes all references to
struct elf_prstatus to be textually replaced with struct
compat_elf_prstatus. Ugly and somewhat brittle, but it works.
However, amd64 is worse - there are _three_ possible layouts.
One for native 64bit processes, another for i386 (32bit) processes
and yet another for x32 (32bit address space with full 64bit
registers).
Both i386 and x32 processes are handled by fs/compat_binfmt_elf.c,
with usual compat_binfmt_elf.c trickery. However, the layouts
for i386 and x32 are not identical - they have the common beginning,
but the register dump part (pr_reg) is bigger on x32. Worse, pr_reg
is not the last field - it's followed by int pr_fpvalid, so that
field ends up at different offsets for i386 and x32 layouts.
Fortunately, there's not much code that cares about any of that -
it's all encapsulated in fill_thread_core_info(). Since x32
variant is bigger, we define compat_elf_prstatus to match that
layout. That way i386 processes have enough space to fit
their layout into.
Moreover, since these layouts are identical prior to pr_reg,
we don't need to distinguish x32 and i386 cases when we are
setting the fields prior to pr_reg.
Filling pr_reg itself is done by calling ->get() method of
appropriate regset, and that method knows what layout (and size)
to use.
We do need to distinguish x32 and i386 cases only for two
things: setting ->pr_fpvalid (offset differs for x32 and
i386) and choosing the right size for our note.
The way it's done is Not Nice, for the lack of more accurate
printable description. There are two macros (PRSTATUS_SIZE and
SET_PR_FPVALID), that default essentially to sizeof(struct elf_prstatus)
and (S)->pr_fpvalid = 1. On x86 asm/compat.h provides its own
variants.
Unfortunately, quite a few things go wrong there:
* PRSTATUS_SIZE doesn't use the normal test for process
being an x32 one; it compares the size reported by regset with
the size of pr_reg.
* it hardcodes the sizes of x32 and i386 variants (296 and 144
resp.), so if some change in includes leads to asm/compat.h pulled
in by fs/binfmt_elf.c we are in trouble - it will end up using
the size of x32 variant for 64bit processes.
* it's in the wrong place; asm/compat.h couldn't define
the structure for i386 layout, since it lacks quite a few types
needed for it. Hardcoded sizes are largely due to that.
The proper fix would be to have an explicitly defined i386 variant
of structure and have PRSTATUS_SIZE/SET_PR_FPVALID check for
TIF_X32 to choose the variant that should be used. Unfortunately,
that requires some manipulations of headers; we'll do that later
in the series, but for now let's go with the minimal variant -
rename PRSTATUS_SIZE in asm/compat.h to COMPAT_PRSTATUS_SIZE,
have fs/compat_binfmt_elf.c define PRSTATUS_SIZE to COMPAT_PRSTATUS_SIZE
and use the normal TIF_X32 check in that macro. The size of i386 variant
is kept hardcoded for now. Similar story for SET_PR_FPVALID.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
pgt_buf is used to allocate page tables on initial direct page mapping
which bootstraps the kernel into being able to allocate these before the
direct mapping makes further pages available.
INIT_PGD_PAGE_COUNT is set to 6 pages (doubled for KASLR) - 3 (PUD, PMD,
PTE) for the 1 MiB ISA mapping and 3 more for the first direct mapping
assignment in each case providing 2 MiB of address space.
This has not been updated for 5-level page tables which has an
additional P4D page table level above PUD.
In most instances, this will not have a material impact as the first
4 page levels allocated for the ISA mapping will provide sufficient
address space to encompass all further address mappings.
If the first direct mapping is within 512 GiB of the ISA mapping, only
a PMD and PTE needs to be added in the instance the kernel is using 4
KiB page tables (e.g. CONFIG_DEBUG_PAGEALLOC is enabled) and only a PMD
if the kernel can use 2 MiB pages (the first allocation is limited to
PMD_SIZE so a GiB page cannot be used there).
However, if the machine has more than 512 GiB of RAM and the kernel is
allocating 4 KiB page size, 3 further page tables are required.
If the machine has more than 256 TiB of RAM at 4 KiB or 2 MiB page size,
further 3 or 4 page tables are required respectively.
Update INIT_PGD_PAGE_COUNT to reflect this.
[ bp: Sanitize text into passive voice without ambiguous personal pronouns. ]
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/20201215205641.34096-1-lstoakes@gmail.com
Add helper functions for shash implementations of BLAKE2s to
include/crypto/internal/blake2s.h, taking advantage of
__blake2s_update() and __blake2s_final() that were added by the previous
patch to share more code between the library and shash implementations.
crypto_blake2s_setkey() and crypto_blake2s_init() are usable as
shash_alg::setkey and shash_alg::init directly, while
crypto_blake2s_update() and crypto_blake2s_final() take an extra
'blake2s_compress_t' function pointer parameter. This allows the
implementation of the compression function to be overridden, which is
the only part that optimized implementations really care about.
The new functions are inline functions (similar to those in sha1_base.h,
sha256_base.h, and sm3_base.h) because this avoids needing to add a new
module blake2s_helpers.ko, they aren't *too* long, and this avoids
indirect calls which are expensive these days. Note that they can't go
in blake2s_generic.ko, as that would require selecting CRYPTO_BLAKE2S
from CRYPTO_BLAKE2S_X86, which would cause a recursive dependency.
Finally, use these new helper functions in the x86 implementation of
BLAKE2s. (This part should be a separate patch, but unfortunately the
x86 implementation used the exact same function names like
"crypto_blake2s_update()", so it had to be updated at the same time.)
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The shash_alg structs for the four variants of BLAKE2s are identical
except for the algorithm name, driver name, and digest size. So, avoid
code duplication by using a macro to define these structs.
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Follow the same approach as the arm64 driver for implementing a version
of AES-NI in CBC mode that supports ciphertext stealing. This results in
a ~2x speed increase for relatively short inputs (less than 256 bytes),
which is relevant given that AES-CBC with ciphertext stealing is used
for filename encryption in the fscrypt layer. For larger inputs, the
speedup is still significant (~25% on decryption, ~6% on encryption)
Tested-by: Eric Biggers <ebiggers@google.com> # x86_64
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
On this system the M.2 PCIe WiFi card isn't detected after reboot, only
after cold boot. reboot=pci fixes this behavior. In [0] the same issue
is described, although on another system and with another Intel WiFi
card. In case it's relevant, both systems have Celeron CPUs.
Add a PCI reboot quirk on affected systems until a more generic fix is
available.
[0] https://bugzilla.kernel.org/show_bug.cgi?id=202399
[ bp: Massage commit message. ]
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1524eafd-f89c-cfa4-ed70-0bde9e45eec9@gmail.com
Replace a comma between expression statements with a semicolon.
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Link: https://lkml.kernel.org/r/20201216131107.14339-1-zhengyongjun3@huawei.com
Make <asm-generic/local64.h> mandatory in include/asm-generic/Kbuild and
remove all arch/*/include/asm/local64.h arch-specific files since they
only #include <asm-generic/local64.h>.
This fixes build errors on arch/c6x/ and arch/nios2/ for
block/blk-iocost.c.
Build-tested on 21 of 25 arch-es. (tools problems on the others)
Yes, we could even rename <asm-generic/local64.h> to
<linux/local64.h> and change all #includes to use
<linux/local64.h> instead.
Link: https://lkml.kernel.org/r/20201227024446.17018-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Masahiro Yamada <masahiroy@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Aurelien Jacquiot <jacquiot.aurelien@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the targets which add additional items to the .config which
facilitate running the kernel as a guest, to the 'make help' output so
that they can be found easier and there's no need to grep the tree each
time to remember what they should be called.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201217134608.31811-1-bp@alien8.de
Commit
121b32a58a ("x86/entry/32: Use IA32-specific wrappers for syscalls taking 64-bit arguments")
converted native x86-32 which take 64-bit arguments to use the
compat handlers to allow conversion to passing args via pt_regs.
sys_fanotify_mark() was however missed, as it has a general compat
handler. Add a config option that will use the syscall wrapper that
takes the split args for native 32-bit.
[ bp: Fix typo in Kconfig help text. ]
Fixes: 121b32a58a ("x86/entry/32: Use IA32-specific wrappers for syscalls taking 64-bit arguments")
Reported-by: Paweł Jasiak <pawel@jasiak.xyz>
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20201130223059.101286-1-brgerst@gmail.com
accesses, inefficient and disfunctional code. The goal is to remove the
export of irq_to_desc() to prevent these things from creeping up again.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/ifgsTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYm6EACAo8sObkuY3oWLagtGj1KHxon53oGZ
VfDw2LYKM+rgJjDWdiyocyxQU5gtm6loWCrIHjH2adRQ4EisB5r8hfI8NZHxNMyq
8khUi822NRBfFN6SCpO8eW9o95euscNQwCzqi7gV9/U/BAKoDoSEYzS4y0YmJlup
mhoikkrFiBuFXplWI0gbP4ihb8S/to2+kTL6o7eBoJY9+fSXIFR3erZ6f3fLjYZG
CQUUysTywdDhLeDkC9vaesXwgdl2XnaPRwcQqmK8Ez0QYNYpawyILUHLD75cIHDu
bHdK2ZoDv/wtad/3BoGTK3+wChz20a/4/IAnBIUVgmnSLsPtW8zNEOPWNNc0aGg+
rtafi5bvJ1lMoSZhkjLWQDOGU6vFaXl9NkC2fpF+dg1skFMT2CyLC8LD/ekmocon
zHAPBva9j3m2A80hI3dUH9azo/IOl1GHG8ccM6SCxY3S/9vWSQChNhQDLe25xBEO
VtKZS7DYFCRiL8mIy9GgwZWof8Vy2iMua2ML+W9a3mC9u3CqSLbCFmLMT/dDoXl1
oHnMdAHk1DRatA8pJAz83C75RxbAS2riGEqtqLEQ6OaNXn6h0oXCanJX9jdKYDBh
z6ijWayPSRMVktN6FDINsVNFe95N4GwYcGPfagIMqyMMhmJDic6apEzEo7iA76lk
cko28MDqTIK4UQ==
=BXv+
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2020-12-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"This is the second attempt after the first one failed miserably and
got zapped to unblock the rest of the interrupt related patches.
A treewide cleanup of interrupt descriptor (ab)use with all sorts of
racy accesses, inefficient and disfunctional code. The goal is to
remove the export of irq_to_desc() to prevent these things from
creeping up again"
* tag 'irq-core-2020-12-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
genirq: Restrict export of irq_to_desc()
xen/events: Implement irq distribution
xen/events: Reduce irq_info:: Spurious_cnt storage size
xen/events: Only force affinity mask for percpu interrupts
xen/events: Use immediate affinity setting
xen/events: Remove disfunct affinity spreading
xen/events: Remove unused bind_evtchn_to_irq_lateeoi()
net/mlx5: Use effective interrupt affinity
net/mlx5: Replace irq_to_desc() abuse
net/mlx4: Use effective interrupt affinity
net/mlx4: Replace irq_to_desc() abuse
PCI: mobiveil: Use irq_data_get_irq_chip_data()
PCI: xilinx-nwl: Use irq_data_get_irq_chip_data()
NTB/msi: Use irq_has_action()
mfd: ab8500-debugfs: Remove the racy fiddling with irq_desc
pinctrl: nomadik: Use irq_has_action()
drm/i915/pmu: Replace open coded kstat_irqs() copy
drm/i915/lpe_audio: Remove pointless irq_to_desc() usage
s390/irq: Use irq_desc_kstat_cpu() in show_msi_interrupt()
parisc/irq: Use irq_desc_kstat_cpu() in show_interrupts()
...
- Don't move BSS section around pointlessly in the x86 decompressor
- Refactor helper for discovering the EFI secure boot mode
- Wire up EFI secure boot to IMA for arm64
- Some fixes for the capsule loader
- Expose the RT_PROP table via the EFI test module
- Relax DT and kernel placement restrictions on ARM
+ followup fixes:
- fix the build breakage on IA64 caused by recent capsule loader changes
- suppress a type mismatch build warning in the expansion of
EFI_PHYS_ALIGN on ARM
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/kWCMACgkQEsHwGGHe
VUqVlxAAg3jSS5w5fuaXON2xYZmgKdlRB0fjbklo1ZrWS6sEHrP+gmVmrJSWGZP+
qFleQ6AxaYK57UiBXxS6Xfn7hHRToqdOAGnaSYzIg1aQIofRoLxvm3YHBMKllb+g
x73IBS/Hu9/kiH8EVDrJSkBpVdbPwDnw+FeW4ZWUMF9GVmV8oA6Zx23BVSVsbFda
jat/cEsJQS3GfECJ/Fg5ae+c/2zn5NgbaVtLxVnMnJfAwEpoPz3ogKoANSskdZg3
z6pA1aMFoHr+lnlzcsM5zdboQlwZRKPHvFpsXPexESBy5dPkYhxFnHqgK4hSZglC
c3QoO9Gn+KOJl4KAKJWNzCrd3G9kKY5RXkoei4bH9wGMjW2c68WrbFyXgNsO3vYR
v5CKpq3+jlwGo03GiLJgWQFdgqX0EgTVHHPTcwFpt8qAMi9/JIPSIeTE41p2+AjZ
cW5F0IlikaR+N8vxc2TDvQTuSsroMiLcocvRWR61oV/48pFlEjqiUjV31myDsASg
gGkOxZOOz2iBJfK8lCrKp5p9JwGp0M0/GSHTxlYQFy+p4SrcOiPX4wYYdLsWxioK
AbVhvOClgB3kN7y7TpLvdjND00ciy4nKEC0QZ5p5G59jSLnpSBM/g6av24LsSQwo
S1HJKhQPbzcI1lhaPjo91HQoOOMZHWLes0SqK4FGNIH+0imHliA=
=n7Gc
-----END PGP SIGNATURE-----
Merge tag 'efi_updates_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI updates from Borislav Petkov:
"These got delayed due to a last minute ia64 build issue which got
fixed in the meantime.
EFI updates collected by Ard Biesheuvel:
- Don't move BSS section around pointlessly in the x86 decompressor
- Refactor helper for discovering the EFI secure boot mode
- Wire up EFI secure boot to IMA for arm64
- Some fixes for the capsule loader
- Expose the RT_PROP table via the EFI test module
- Relax DT and kernel placement restrictions on ARM
with a few followup fixes:
- fix the build breakage on IA64 caused by recent capsule loader
changes
- suppress a type mismatch build warning in the expansion of
EFI_PHYS_ALIGN on ARM"
* tag 'efi_updates_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi: arm: force use of unsigned type for EFI_PHYS_ALIGN
efi: ia64: disable the capsule loader
efi: stub: get rid of efi_get_max_fdt_addr()
efi/efi_test: read RuntimeServicesSupported
efi: arm: reduce minimum alignment of uncompressed kernel
efi: capsule: clean scatter-gather entries from the D-cache
efi: capsule: use atomic kmap for transient sglist mappings
efi: x86/xen: switch to efi_get_secureboot_mode helper
arm64/ima: add ima_arch support
ima: generalize x86/EFI arch glue for other EFI architectures
efi: generalize efi_get_secureboot
efi/libstub: EFI_GENERIC_STUB_INITRD_CMDLINE_LOADER should not default to yes
efi/x86: Only copy the compressed kernel image in efi_relocate_kernel()
efi/libstub/x86: simplify efi_is_native()
Merge KASAN updates from Andrew Morton.
This adds a new hardware tag-based mode to KASAN. The new mode is
similar to the existing software tag-based KASAN, but relies on arm64
Memory Tagging Extension (MTE) to perform memory and pointer tagging
(instead of shadow memory and compiler instrumentation).
By Andrey Konovalov and Vincenzo Frascino.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (60 commits)
kasan: update documentation
kasan, mm: allow cache merging with no metadata
kasan: sanitize objects when metadata doesn't fit
kasan: clarify comment in __kasan_kfree_large
kasan: simplify assign_tag and set_tag calls
kasan: don't round_up too much
kasan, mm: rename kasan_poison_kfree
kasan, mm: check kasan_enabled in annotations
kasan: add and integrate kasan boot parameters
kasan: inline (un)poison_range and check_invalid_free
kasan: open-code kasan_unpoison_slab
kasan: inline random_tag for HW_TAGS
kasan: inline kasan_reset_tag for tag-based modes
kasan: remove __kasan_unpoison_stack
kasan: allow VMAP_STACK for HW_TAGS mode
kasan, arm64: unpoison stack only with CONFIG_KASAN_STACK
kasan: introduce set_alloc_info
kasan: rename get_alloc/free_info
kasan: simplify quarantine_put call site
kselftest/arm64: check GCR_EL1 after context switch
...
When a split lock is detected always make sure to disable interrupts
before returning from the trap handler.
The kernel exit code assumes that all exits run with interrupts
disabled, otherwise the SWAPGS sequence can race against interrupts and
cause recursing page faults and later panics.
The problem will only happen on CPUs with split lock disable
functionality, so Icelake Server, Tiger Lake, Snow Ridge, Jacobsville.
Fixes: ca4c6a9858 ("x86/traps: Make interrupt enable/disable symmetric in C code")
Fixes: bce9b042ec ("x86/traps: Disable interrupts in exc_aligment_check()") # v5.8+
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a config option CONFIG_KASAN_STACK that has to be enabled for
KASAN to use stack instrumentation and perform validity checks for
stack variables.
There's no need to unpoison stack when CONFIG_KASAN_STACK is not enabled.
Only call kasan_unpoison_task_stack[_below]() when CONFIG_KASAN_STACK is
enabled.
Note, that CONFIG_KASAN_STACK is an option that is currently always
defined when CONFIG_KASAN is enabled, and therefore has to be tested
with #if instead of #ifdef.
Link: https://lkml.kernel.org/r/d09dd3f8abb388da397fd11598c5edeaa83fe559.1606162397.git.andreyknvl@google.com
Link: https://linux-review.googlesource.com/id/If8a891e9fe01ea543e00b576852685afec0887e3
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the intoduction of hardware tag-based KASAN some kernel checks of
this kind:
ifdef CONFIG_KASAN
will be updated to:
if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
x86 and s390 use a trick to #undef CONFIG_KASAN for some of the code
that isn't linked with KASAN runtime and shouldn't have any KASAN
annotations.
Also #undef CONFIG_KASAN_GENERIC with CONFIG_KASAN.
Link: https://lkml.kernel.org/r/9d84bfaaf8fabe0fc89f913c9e420a30bd31a260.1606161801.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Marco Elver <elver@google.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* PSCI relay at EL2 when "protected KVM" is enabled
* New exception injection code
* Simplification of AArch32 system register handling
* Fix PMU accesses when no PMU is enabled
* Expose CSV3 on non-Meltdown hosts
* Cache hierarchy discovery fixes
* PV steal-time cleanups
* Allow function pointers at EL2
* Various host EL2 entry cleanups
* Simplification of the EL2 vector allocation
s390:
* memcg accouting for s390 specific parts of kvm and gmap
* selftest for diag318
* new kvm_stat for when async_pf falls back to sync
x86:
* Tracepoints for the new pagetable code from 5.10
* Catch VFIO and KVM irqfd events before userspace
* Reporting dirty pages to userspace with a ring buffer
* SEV-ES host support
* Nested VMX support for wait-for-SIPI activity state
* New feature flag (AVX512 FP16)
* New system ioctl to report Hyper-V-compatible paravirtualization features
Generic:
* Selftest improvements
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl/bdL4UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNgQQgAnTH6rhXa++Zd5F0EM2NwXwz3iEGb
lOq1DZSGjs6Eekjn8AnrWbmVQr+CBCuGU9MrxpSSzNDK/awryo3NwepOWAZw9eqk
BBCVwGBbJQx5YrdgkGC0pDq2sNzcpW/VVB3vFsmOxd9eHblnuKSIxEsCCXTtyqIt
XrLpQ1UhvI4yu102fDNhuFw2EfpzXm+K0Lc0x6idSkdM/p7SyeOxiv8hD4aMr6+G
bGUQuMl4edKZFOWFigzr8NovQAvDHZGrwfihu2cLRYKLhV97QuWVmafv/yYfXcz2
drr+wQCDNzDOXyANnssmviazrhOX0QmTAhbIXGGX/kTxYKcfPi83ZLoI3A==
=ISud
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"Much x86 work was pushed out to 5.12, but ARM more than made up for it.
ARM:
- PSCI relay at EL2 when "protected KVM" is enabled
- New exception injection code
- Simplification of AArch32 system register handling
- Fix PMU accesses when no PMU is enabled
- Expose CSV3 on non-Meltdown hosts
- Cache hierarchy discovery fixes
- PV steal-time cleanups
- Allow function pointers at EL2
- Various host EL2 entry cleanups
- Simplification of the EL2 vector allocation
s390:
- memcg accouting for s390 specific parts of kvm and gmap
- selftest for diag318
- new kvm_stat for when async_pf falls back to sync
x86:
- Tracepoints for the new pagetable code from 5.10
- Catch VFIO and KVM irqfd events before userspace
- Reporting dirty pages to userspace with a ring buffer
- SEV-ES host support
- Nested VMX support for wait-for-SIPI activity state
- New feature flag (AVX512 FP16)
- New system ioctl to report Hyper-V-compatible paravirtualization features
Generic:
- Selftest improvements"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (171 commits)
KVM: SVM: fix 32-bit compilation
KVM: SVM: Add AP_JUMP_TABLE support in prep for AP booting
KVM: SVM: Provide support to launch and run an SEV-ES guest
KVM: SVM: Provide an updated VMRUN invocation for SEV-ES guests
KVM: SVM: Provide support for SEV-ES vCPU loading
KVM: SVM: Provide support for SEV-ES vCPU creation/loading
KVM: SVM: Update ASID allocation to support SEV-ES guests
KVM: SVM: Set the encryption mask for the SVM host save area
KVM: SVM: Add NMI support for an SEV-ES guest
KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest
KVM: SVM: Do not report support for SMM for an SEV-ES guest
KVM: x86: Update __get_sregs() / __set_sregs() to support SEV-ES
KVM: SVM: Add support for CR8 write traps for an SEV-ES guest
KVM: SVM: Add support for CR4 write traps for an SEV-ES guest
KVM: SVM: Add support for CR0 write traps for an SEV-ES guest
KVM: SVM: Add support for EFER write traps for an SEV-ES guest
KVM: SVM: Support string IO operations for an SEV-ES guest
KVM: SVM: Support MMIO for an SEV-ES guest
KVM: SVM: Create trace events for VMGEXIT MSR protocol processing
KVM: SVM: Create trace events for VMGEXIT processing
...
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCX92dmQAKCRCAXGG7T9hj
vqfiAQCWM8eVftAiTxuQGzdYHYBl5p57UtOc3n0Ap8cHbl9SOwEAqlLmQEopnbHT
P4WzH/PegsCU5BopSb+NgXkBesMtPgk=
=Yjsy
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.11-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull more xen updates from Juergen Gross:
"Some minor cleanup patches and a small series disentangling some Xen
related Kconfig options"
* tag 'for-linus-5.11-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: Kconfig: remove X86_64 depends from XEN_512GB
xen/manage: Fix fall-through warnings for Clang
xen-blkfront: Fix fall-through warnings for Clang
xen: remove trailing semicolon in macro definition
xen: Kconfig: nest Xen guest options
xen: Remove Xen PVH/PVHVM dependency on PCI
x86/xen: Convert to DEFINE_SHOW_ATTRIBUTE
Split off from prev patch in the series that implements the syscall.
Link: https://lkml.kernel.org/r/20201121144401.3727659-4-willemdebruijn.kernel@gmail.com
Signed-off-by: Willem de Bruijn <willemb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A VCPU of a VM can allocate couple of pages which can be mmap'ed by the
user space application. At the moment this memory is not charged to the
memcg of the VMM. On a large machine running large number of VMs or
small number of VMs having large number of VCPUs, this unaccounted
memory can be very significant. So, charge this memory to the memcg of
the VMM. Please note that lifetime of these allocations corresponds to
the lifetime of the VMM.
Link: https://lkml.kernel.org/r/20201106202923.2087414-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The major update to this release is that there's a new arch config option called:
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS. Currently, only x86_64 enables it.
All the ftrace callbacks now take a struct ftrace_regs instead of a struct
pt_regs. If the architecture has HAVE_DYNAMIC_FTRACE_WITH_ARGS enabled, then
the ftrace_regs will have enough information to read the arguments of the
function being traced, as well as access to the stack pointer. This way, if
a user (like live kernel patching) only cares about the arguments, then it
can avoid using the heavier weight "regs" callback, that puts in enough
information in the struct ftrace_regs to simulate a breakpoint exception
(needed for kprobes).
New config option that audits the timestamps of the ftrace ring buffer at
most every event recorded. The "check_buffer()" calls will conflict with
mainline, because I purposely added the check without including the fix that
it caught, which is in mainline. Running a kernel built from the commit of
the added check will trigger it.
Ftrace recursion protection has been cleaned up to move the protection to
the callback itself (this saves on an extra function call for those
callbacks).
Perf now handles its own RCU protection and does not depend on ftrace to do
it for it (saving on that extra function call).
New debug option to add "recursed_functions" file to tracefs that lists all
the places that triggered the recursion protection of the function tracer.
This will show where things need to be fixed as recursion slows down the
function tracer.
The eval enum mapping updates done at boot up are now offloaded to a work
queue, as it caused a noticeable pause on slow embedded boards.
Various clean ups and last minute fixes.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCX9uq8xQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qtrwAQCHevqWMjKc1Q76bnCgwB0AbFKB6vqy
5b6g/co5+ihv8wD/eJPWlZMAt97zTVW7bdp5qj/GTiCDbAsODMZ597LsxA0=
=rZEz
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"The major update to this release is that there's a new arch config
option called CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS.
Currently, only x86_64 enables it. All the ftrace callbacks now take a
struct ftrace_regs instead of a struct pt_regs. If the architecture
has HAVE_DYNAMIC_FTRACE_WITH_ARGS enabled, then the ftrace_regs will
have enough information to read the arguments of the function being
traced, as well as access to the stack pointer.
This way, if a user (like live kernel patching) only cares about the
arguments, then it can avoid using the heavier weight "regs" callback,
that puts in enough information in the struct ftrace_regs to simulate
a breakpoint exception (needed for kprobes).
A new config option that audits the timestamps of the ftrace ring
buffer at most every event recorded.
Ftrace recursion protection has been cleaned up to move the protection
to the callback itself (this saves on an extra function call for those
callbacks).
Perf now handles its own RCU protection and does not depend on ftrace
to do it for it (saving on that extra function call).
New debug option to add "recursed_functions" file to tracefs that
lists all the places that triggered the recursion protection of the
function tracer. This will show where things need to be fixed as
recursion slows down the function tracer.
The eval enum mapping updates done at boot up are now offloaded to a
work queue, as it caused a noticeable pause on slow embedded boards.
Various clean ups and last minute fixes"
* tag 'trace-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
tracing: Offload eval map updates to a work queue
Revert: "ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS"
ring-buffer: Add rb_check_bpage in __rb_allocate_pages
ring-buffer: Fix two typos in comments
tracing: Drop unneeded assignment in ring_buffer_resize()
tracing: Disable ftrace selftests when any tracer is running
seq_buf: Avoid type mismatch for seq_buf_init
ring-buffer: Fix a typo in function description
ring-buffer: Remove obsolete rb_event_is_commit()
ring-buffer: Add test to validate the time stamp deltas
ftrace/documentation: Fix RST C code blocks
tracing: Clean up after filter logic rewriting
tracing: Remove the useless value assignment in test_create_synth_event()
livepatch: Use the default ftrace_ops instead of REGS when ARGS is available
ftrace/x86: Allow for arguments to be passed in to ftrace_regs by default
ftrace: Have the callbacks receive a struct ftrace_regs instead of pt_regs
MAINTAINERS: assign ./fs/tracefs to TRACING
tracing: Fix some typos in comments
ftrace: Remove unused varible 'ret'
ring-buffer: Add recording of ring buffer recursion into recursed_functions
...
There are a couple of subsystems maintained by other people that
merge their drivers through the SoC tree, those changes include:
- The SCMI firmware framework gains support for sensor notifications
and for controlling voltage domains.
- A large update for the Tegra memory controller driver, integrating
it better with the interconnect framework
- The memory controller subsystem gains support for Mediatek MT8192
- The reset controller framework gains support for sharing pulsed
resets
For Soc specific drivers in drivers/soc, the main changes are
- The Allwinner/sunxi MBUS gets a rework for the way it handles
dma_map_ops and offsets between physical and dma address spaces.
- An errata fix plus some cleanups for Freescale Layerscape SoCs
- A cleanup for renesas drivers regarding MMIO accesses.
- New SoC specific drivers for Mediatek MT8192 and MT8183 power domains
- New SoC specific drivers for Aspeed AST2600 LPC bus control
and SoC identification.
- Core Power Domain support for Qualcomm MSM8916, MSM8939, SDM660
and SDX55.
- A rework of the TI AM33xx 'genpd' power domain support to use
information from DT instead of platform data
- Support for TI AM64x SoCs
- Allow building some Amlogic drivers as modules instead of built-in
Finally, there are numerous cleanups and smaller bug fixes for
Mediatek, Tegra, Samsung, Qualcomm, TI OMAP, Amlogic, Rockchips,
Renesas, and Xilinx SoCs.
There is a trivial conflict in the cedrus driver, with two branches
adding the same CEDRUS_CAPABILITY_H265_DEC flag, and another trivial
remove/remove conflict in linux/dma-mapping.h.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAl/alSUACgkQmmx57+YA
GNm7GRAAlNMVi7F0f4Ixf1bEh+J2QUonYIpZfrdxOLFwISGQ+nstGrFW2He/OeQv
KAi027tZLl6Sdzjy809cLDPA4Z2IKwjVWhEbBHybvy1+irPYjnixtLd0x3YvPhjH
iadlcjQ3uaGue8PvubK6CVnBEy82A+Pp29n9i4A4wX/8w+BVIhVsxwQWUBF8pFXE
3La2UZYZMVMvVZMrpTOqwCgdmLDCk+RLMVZ1IiRqBEBq5/DVq03uIXgjGEOrq8tl
PXC89w7K510Is891mbBdBThQf+pZkU1vwORuknDcEJKWs9ngbEha7ebVgp32kbFl
pi8DEK205d106WQgfn0Zxkpbsp8XD058wDILwkhBcteXlBaUEL6btGVLDTUCJZuv
/pkH8tL4lNGpThQFbCEXC8oHZBp2xk55P+SW9RRZOoA5tAp+sz7hlf3y3YKdCSxv
4xybeeVOAgjl01WtbEC7CuIkqcKVSQ7njhLhC8r5ASteNywDThqxLT6nd0VegcQc
YH3Eu9QRXpvFwQ35zMkTMWa27bMG5d60fp90bWT0R5amXZpxJJot87w8trFCxv74
mE5KvCbefCRNsTt8GOBA/WR7hVaG369g07qOvs7g4LjJEM3Nl2h/A4/zVFef9O0t
yq3Nm4YCGfDSAQXzGR2SJ3nxiqbDknzJTAtZPf4BmbaMuPOIJ5k=
=BjJf
-----END PGP SIGNATURE-----
Merge tag 'arm-soc-drivers-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC driver updates from Arnd Bergmann:
"There are a couple of subsystems maintained by other people that merge
their drivers through the SoC tree, those changes include:
- The SCMI firmware framework gains support for sensor notifications
and for controlling voltage domains.
- A large update for the Tegra memory controller driver, integrating
it better with the interconnect framework
- The memory controller subsystem gains support for Mediatek MT8192
- The reset controller framework gains support for sharing pulsed
resets
For Soc specific drivers in drivers/soc, the main changes are
- The Allwinner/sunxi MBUS gets a rework for the way it handles
dma_map_ops and offsets between physical and dma address spaces.
- An errata fix plus some cleanups for Freescale Layerscape SoCs
- A cleanup for renesas drivers regarding MMIO accesses.
- New SoC specific drivers for Mediatek MT8192 and MT8183 power
domains
- New SoC specific drivers for Aspeed AST2600 LPC bus control and SoC
identification.
- Core Power Domain support for Qualcomm MSM8916, MSM8939, SDM660 and
SDX55.
- A rework of the TI AM33xx 'genpd' power domain support to use
information from DT instead of platform data
- Support for TI AM64x SoCs
- Allow building some Amlogic drivers as modules instead of built-in
Finally, there are numerous cleanups and smaller bug fixes for
Mediatek, Tegra, Samsung, Qualcomm, TI OMAP, Amlogic, Rockchips,
Renesas, and Xilinx SoCs"
* tag 'arm-soc-drivers-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (222 commits)
soc: mediatek: mmsys: Specify HAS_IOMEM dependency for MTK_MMSYS
firmware: xilinx: Properly align function parameter
firmware: xilinx: Add a blank line after function declaration
firmware: xilinx: Remove additional newline
firmware: xilinx: Fix kernel-doc warnings
firmware: xlnx-zynqmp: fix compilation warning
soc: xilinx: vcu: add missing register NUM_CORE
soc: xilinx: vcu: use vcu-settings syscon registers
dt-bindings: soc: xlnx: extract xlnx, vcu-settings to separate binding
soc: xilinx: vcu: drop useless success message
clk: samsung: mark PM functions as __maybe_unused
soc: samsung: exynos-chipid: initialize later - with arch_initcall
soc: samsung: exynos-chipid: order list of SoCs by name
memory: jz4780_nemc: Fix potential NULL dereference in jz4780_nemc_probe()
memory: ti-emif-sram: only build for ARMv7
memory: tegra30: Support interconnect framework
memory: tegra20: Support hardware versioning and clean up OPP table initialization
dt-bindings: memory: tegra20-emc: Document opp-supported-hw property
soc: rockchip: io-domain: Fix error return code in rockchip_iodomain_probe()
reset-controller: ti: force the write operation when assert or deassert
...
Pull swiotlb update from Konrad Rzeszutek Wilk:
"A generic (but for right now engaged only with AMD SEV) mechanism to
adjust a larger size SWIOTLB based on the total memory of the SEV
guests which right now require the bounce buffer for interacting with
the outside world.
Normal knobs (swiotlb=XYZ) still work"
* 'stable/for-linus-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
x86,swiotlb: Adjust SWIOTLB bounce buffer size for SEV guests
VCPU_REGS_R8...VCPU_REGS_R15 are not defined on 32-bit x86,
so cull them from the synchronization of the VMSA.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This cleans up two ancient timer features that were never completed in
the past, CONFIG_GENERIC_CLOCKEVENTS and CONFIG_ARCH_USES_GETTIMEOFFSET.
There was only one user left for the ARCH_USES_GETTIMEOFFSET variant
of clocksource implementations, the ARM EBSA110 platform. Rather than
changing to use modern timekeeping, we remove the platform entirely as
Russell no longer uses his machine and nobody else seems to have one
any more.
The conditional code for using arch_gettimeoffset() is removed as
a result.
For CONFIG_GENERIC_CLOCKEVENTS, there are still a couple of platforms
not using clockevent drivers: parisc, ia64, most of m68k, and one
Arm platform. These all do timer ticks slighly differently, and this
gets cleaned up to the point they at least all call the same helper
function. Instead of most platforms using 'select GENERIC_CLOCKEVENTS'
in Kconfig, the polarity is now reversed, with the few remaining ones
selecting LEGACY_TIMER_TICK instead.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAl/Y1v8ACgkQmmx57+YA
GNmCvQ/9EDlgCt92r8SB+LGafDtgB8TUQZeIrs9S2mByzdxwnw0lxObIXFCnhQgh
RpG3dR+ONRDnC5eI149B377JOEFMZWe2+BtYHUHkFARtUEWatslQcz7yAGvVRK/l
TS/qReb6piKltlzuanF1bMZbjy2OhlaDRcm+OlC3y5mALR33M4emb+rJ6cSdfk3K
v1iZhrxtfQT77ztesh/oPkPiyQ6kNcz7SfpyYOb6f5VLlml2BZ7YwBSVyGY7urHk
RL3XqOUP4KKlMEAI8w0E2nvft6Fk+luziBhrMYWK0GvbmI1OESENuX/c6tgT2OQ1
DRaVHvcPG/EAY8adOKxxVyHhEJDSoz5GJV/EtjlOegsJk6RomczR1uuiT3Kvm7Ah
PktMKv4xQht1E15KPSKbOvNIEP18w2s5z6gw+jVDv8pw42pVEQManm1D+BICqrhl
fcpw6T1drf9UxAjwX4+zXtmNs+a+mqiFG8puU4VVgT4GpQ8umHvunXz2WUjZO0jc
3m8ErJHBvtJwW5TOHGyXnjl9SkwPzHOfF6IcXTYWEDU4/gQIK9TwUvCjLc0lE27t
FMCV2ds7/K1CXwRgpa5IrefSkb8yOXSbRZ56NqqF7Ekxw4J5bYRSaY7jb+qD/e+3
5O1y+iPxFrpH+16hSahvzrtcdFNbLQvBBuRtEQOYuHLt2UJrNoU=
=QpNs
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-timers-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic cross-architecture timer cleanup from Arnd Bergmann:
"This cleans up two ancient timer features that were never completed in
the past, CONFIG_GENERIC_CLOCKEVENTS and CONFIG_ARCH_USES_GETTIMEOFFSET.
There was only one user left for the ARCH_USES_GETTIMEOFFSET variant
of clocksource implementations, the ARM EBSA110 platform. Rather than
changing to use modern timekeeping, we remove the platform entirely as
Russell no longer uses his machine and nobody else seems to have one
any more.
The conditional code for using arch_gettimeoffset() is removed as a
result.
For CONFIG_GENERIC_CLOCKEVENTS, there are still a couple of platforms
not using clockevent drivers: parisc, ia64, most of m68k, and one Arm
platform. These all do timer ticks slighly differently, and this gets
cleaned up to the point they at least all call the same helper
function.
Instead of most platforms using 'select GENERIC_CLOCKEVENTS' in
Kconfig, the polarity is now reversed, with the few remaining ones
selecting LEGACY_TIMER_TICK instead"
* tag 'asm-generic-timers-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
timekeeping: default GENERIC_CLOCKEVENTS to enabled
timekeeping: remove xtime_update
m68k: remove timer_interrupt() function
m68k: change remaining timers to legacy_timer_tick
m68k: m68328: use legacy_timer_tick()
m68k: sun3/sun3c: use legacy_timer_tick
m68k: split heartbeat out of timer function
m68k: coldfire: use legacy_timer_tick()
parisc: use legacy_timer_tick
ARM: rpc: use legacy_timer_tick
ia64: convert to legacy_timer_tick
timekeeping: add CONFIG_LEGACY_TIMER_TICK
timekeeping: remove arch_gettimeoffset
net: remove am79c961a driver
ARM: remove ebsa110 platform
This is a cleanup series from Nicholas Piggin, preparing for
later changes. The asm/mmu_context.h header are generalized
and common code moved to asm-gneneric/mmu_context.h.
This saves a bit of code and makes it easier to change in
the future.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAl/Y1LsACgkQmmx57+YA
GNm6kBAAq4/n6nuNnh6b9LhjXaZRG75gEyW7JvHl8KE5wmZHwDHqbwiQgU1b3lUs
JJGbfKqi5ASKxNg6MpfYodmCOqeTUUYG0FUCb6lMhcxxMdfLTLYBvkNd6Y143M+T
boi5b/iz+OUQdNPzlVeSsUEVsD59FIXmP/GhscWZN9VAyf/aLV2MDBIOhrDSJlPo
ObexnP0Iw1E1NRQYDQ6L2dKTHa6XmHyUtw40ABPmd/6MSd1S+D+j3FGg+CYmvnzG
k9g8FbNby8xtUfc0pZV4W/322WN8cDFF9bc04eTDZiAv1bk9lmfvWJ2bWjs3s2qt
RO/suiZEOAta/WUX9vVLgYn2td00ef+AyjNUgffiUfvQfl++fiCDFTGl+MoCLjbh
xQUPcRuRdED7bMKNrC0CcDOSwWEBWVXvkU/szBLDeE1sPjXzGQ80q1Y72k9y961I
mqg7FrHqjZsxT9luXMAzClHNhXAtvehkJZBIdHlFok83EFoTQp48Da4jaDuOOhlq
p/lkPJWOHegIQMWtGwRyGmG1qzil7b/QBNAPLgu9pF4TA+ySRBEB2BOr2jRSkj6N
mNTHQbSYxBoktdt+VhtrSsxR+i8lwlegx+RNRFmKK3VH5da2nfiBaOY7zBQQHxCK
yxQvXvsljSVpfkFKLc/S2nLQL1zTkRfFKV1Xmd3+3owR+EoqM60=
=NpMX
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-mmu-context-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic mmu-context cleanup from Arnd Bergmann:
"This is a cleanup series from Nicholas Piggin, preparing for later
changes. The asm/mmu_context.h header are generalized and common code
moved to asm-gneneric/mmu_context.h.
This saves a bit of code and makes it easier to change in the future"
* tag 'asm-generic-mmu-context-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: (25 commits)
h8300: Fix generic mmu_context build
m68k: mmu_context: Fix Sun-3 build
xtensa: use asm-generic/mmu_context.h for no-op implementations
x86: use asm-generic/mmu_context.h for no-op implementations
um: use asm-generic/mmu_context.h for no-op implementations
sparc: use asm-generic/mmu_context.h for no-op implementations
sh: use asm-generic/mmu_context.h for no-op implementations
s390: use asm-generic/mmu_context.h for no-op implementations
riscv: use asm-generic/mmu_context.h for no-op implementations
powerpc: use asm-generic/mmu_context.h for no-op implementations
parisc: use asm-generic/mmu_context.h for no-op implementations
openrisc: use asm-generic/mmu_context.h for no-op implementations
nios2: use asm-generic/mmu_context.h for no-op implementations
nds32: use asm-generic/mmu_context.h for no-op implementations
mips: use asm-generic/mmu_context.h for no-op implementations
microblaze: use asm-generic/mmu_context.h for no-op implementations
m68k: use asm-generic/mmu_context.h for no-op implementations
ia64: use asm-generic/mmu_context.h for no-op implementations
hexagon: use asm-generic/mmu_context.h for no-op implementations
csky: use asm-generic/mmu_context.h for no-op implementations
...
Moving XEN_512GB allows it to nest under XEN_PV. That also allows
XEN_PVH to nest under XEN as a sibling to XEN_PV and XEN_PVHVM giving:
[*] Xen guest support
[*] Xen PV guest support
[*] Limit Xen pv-domain memory to 512GB
[*] Xen PV Dom0 support
[*] Xen PVHVM guest support
[*] Xen PVH guest support
Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20201014175342.152712-3-jandryuk@gmail.com
Signed-off-by: Juergen Gross <jgross@suse.com>
A Xen PVH domain doesn't have a PCI bus or devices, so it doesn't need
PCI support built in. Currently, XEN_PVH depends on XEN_PVHVM which
depends on PCI.
Introduce XEN_PVHVM_GUEST as a toplevel item and change XEN_PVHVM to a
hidden variable. This allows XEN_PVH to depend on XEN_PVHVM without PCI
while XEN_PVHVM_GUEST depends on PCI.
In drivers/xen, compile platform-pci depending on XEN_PVHVM_GUEST since
that pulls in the PCI dependency for linking.
Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20201014175342.152712-2-jandryuk@gmail.com
Signed-off-by: Juergen Gross <jgross@suse.com>
- Use local_clock() instead of jiffies in the cpufreq statistics to
improve accuracy (Viresh Kumar).
- Fix up OPP usage in the cpufreq-dt and qcom-cpufreq-nvmem cpufreq
drivers (Viresh Kumar).
- Clean up the cpufreq core, the intel_pstate driver and the
schedutil cpufreq governor (Rafael Wysocki).
- Fix up error code paths in the sti-cpufreq and mediatek cpufreq
drivers (Yangtao Li, Qinglang Miao).
- Fix cpufreq_online() to return error codes instead of success (0)
in all cases when it fails (Wang ShaoBo).
- Add mt8167 support to the mediatek cpufreq driver and blacklist
mt8516 in the cpufreq-dt-platdev driver (Fabien Parent).
- Modify the tegra194 cpufreq driver to always return values from
the frequency table as the current frequency and clean up that
driver (Sumit Gupta, Jon Hunter).
- Modify the arm_scmi cpufreq driver to allow it to discover the
power scale present in the performance protocol and provide this
information to the Energy Model (Lukasz Luba).
- Add missing MODULE_DEVICE_TABLE to several cpufreq drivers (Pali
Rohár).
- Clean up the CPPC cpufreq driver (Ionela Voinescu).
- Fix NVMEM_IMX_OCOTP dependency in the imx cpufreq driver (Arnd
Bergmann).
- Rework the poling interval selection for the polling state in
cpuidle (Mel Gorman).
- Enable suspend-to-idle for PSCI OSI mode in the PSCI cpuidle
driver (Ulf Hansson).
- Modify the OPP framework to support empty (node-less) OPP tables
in DT for passing dependency information (Nicola Mazzucato).
- Fix potential lockdep issue in the OPP core and clean up the OPP
core (Viresh Kumar).
- Modify dev_pm_opp_put_regulators() to accept a NULL argument and
update its users accordingly (Viresh Kumar).
- Add frequency changes tracepoint to devfreq (Matthias Kaehlcke).
- Add support for governor feature flags to devfreq, make devfreq
sysfs file permissions depend on the governor and clean up the
devfreq core (Chanwoo Choi).
- Clean up the tegra20 devfreq driver and deprecate it to allow
another driver based on EMC_STAT to be used instead of it (Dmitry
Osipenko).
- Add interconnect support to the tegra30 devfreq driver, allow it
to take the interconnect and OPP information from DT and clean it
up ((Dmitry Osipenko).
- Add interconnect support to the exynos-bus devfreq driver along
with interconnect properties documentation (Sylwester Nawrocki).
- Add suport for AMD Fam17h and Fam19h processors to the RAPL power
capping driver (Victor Ding, Kim Phillips).
- Fix handling of overly long constraint names in the powercap
framework (Lukasz Luba).
- Fix the wakeup configuration handling for bridges in the ACPI
device power management core (Rafael Wysocki).
- Add support for using an abstract scale for power units in the
Energy Model (EM) and document it (Lukasz Luba).
- Add em_cpu_energy() micro-optimization to the EM (Pavankumar
Kondeti).
- Modify the generic power domains (genpd) framwework to support
suspend-to-idle (Ulf Hansson).
- Fix creation of debugfs nodes in genpd (Thierry Strudel).
- Clean up genpd (Lina Iyer).
- Clean up the core system-wide suspend code and make it print
driver flags for devices with debug enabled (Alex Shi, Patrice
Chotard, Chen Yu).
- Modify the ACPI system reboot code to make it prepare for system
power off to avoid confusing the platform firmware (Kai-Heng Feng).
- Update the pm-graph (multiple changes, mostly usability-related)
and cpupower (online and offline CPU information support) PM
utilities (Todd Brandt, Brahadambal Srinivasan).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl/Y8mcSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxjY4QAKsNFJeEtjGCxq7MxQIML3QLAsdJM9of
9kkY9skMEw4v1TRmyy7sW9jZW2pLSRcLJwWRKWu4143qUS3YUp2DQ0lqX4WyXoWu
BhnkhkMUl6iCeBO8CWnt8zsTuqSa20A13sL9LyqN1+7OZKHD8StbT4hKjBncdNNN
4aDj+1uAPyOgj2iCUZuHQ8DtpBvOLjgTh367vbhbufjeJ//8/9+R7s4Xzrj7wtmv
JlE0LDgvge9QeGTpjhxQJzn0q2/H5fg9jbmjPXUfbHJNuyKhrqnmjGyrN5m256JI
8DqGqQtJpmFp7Ihrur3uKTk3gWO05YwJ1FdeEooAKEjEMObm5xuYhKVRoDhmlJAu
G6ui+OAUvNR0FffJtbzvWe/pLovLGOEOHdvTrZxUF8Abo6br3untTm8rKTi1fhaF
wWndSMw0apGsPzCx5T+bE7AbJz2QHFpLhaVAutenuCzNI8xoMlxNKEzsaVz/+FqL
Pq/PdFaM4vNlMbv7hkb/fujkCs/v3EcX2ihzvt7I2o8dBS0D1X8A4mnuWJmiGslw
1ftbJ6M9XacwkPBTHPgeXxJh2C1yxxe5VQ9Z5fWWi7sPOUeJnUwxKaluv+coFndQ
sO6JxsPQ4hQihg8yOxLEkL6Wn68sZlmp+u2Oj+TPFAsAGANIA8rJlBPo1ppJWvdQ
j1OCIc/qzwpH
=BVdX
-----END PGP SIGNATURE-----
Merge tag 'pm-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These update cpufreq (core and drivers), cpuidle (polling state
implementation and the PSCI driver), the OPP (operating performance
points) framework, devfreq (core and drivers), the power capping RAPL
(Running Average Power Limit) driver, the Energy Model support, the
generic power domains (genpd) framework, the ACPI device power
management, the core system-wide suspend code and power management
utilities.
Specifics:
- Use local_clock() instead of jiffies in the cpufreq statistics to
improve accuracy (Viresh Kumar).
- Fix up OPP usage in the cpufreq-dt and qcom-cpufreq-nvmem cpufreq
drivers (Viresh Kumar).
- Clean up the cpufreq core, the intel_pstate driver and the
schedutil cpufreq governor (Rafael Wysocki).
- Fix up error code paths in the sti-cpufreq and mediatek cpufreq
drivers (Yangtao Li, Qinglang Miao).
- Fix cpufreq_online() to return error codes instead of success (0)
in all cases when it fails (Wang ShaoBo).
- Add mt8167 support to the mediatek cpufreq driver and blacklist
mt8516 in the cpufreq-dt-platdev driver (Fabien Parent).
- Modify the tegra194 cpufreq driver to always return values from the
frequency table as the current frequency and clean up that driver
(Sumit Gupta, Jon Hunter).
- Modify the arm_scmi cpufreq driver to allow it to discover the
power scale present in the performance protocol and provide this
information to the Energy Model (Lukasz Luba).
- Add missing MODULE_DEVICE_TABLE to several cpufreq drivers (Pali
Rohár).
- Clean up the CPPC cpufreq driver (Ionela Voinescu).
- Fix NVMEM_IMX_OCOTP dependency in the imx cpufreq driver (Arnd
Bergmann).
- Rework the poling interval selection for the polling state in
cpuidle (Mel Gorman).
- Enable suspend-to-idle for PSCI OSI mode in the PSCI cpuidle driver
(Ulf Hansson).
- Modify the OPP framework to support empty (node-less) OPP tables in
DT for passing dependency information (Nicola Mazzucato).
- Fix potential lockdep issue in the OPP core and clean up the OPP
core (Viresh Kumar).
- Modify dev_pm_opp_put_regulators() to accept a NULL argument and
update its users accordingly (Viresh Kumar).
- Add frequency changes tracepoint to devfreq (Matthias Kaehlcke).
- Add support for governor feature flags to devfreq, make devfreq
sysfs file permissions depend on the governor and clean up the
devfreq core (Chanwoo Choi).
- Clean up the tegra20 devfreq driver and deprecate it to allow
another driver based on EMC_STAT to be used instead of it (Dmitry
Osipenko).
- Add interconnect support to the tegra30 devfreq driver, allow it to
take the interconnect and OPP information from DT and clean it up
(Dmitry Osipenko).
- Add interconnect support to the exynos-bus devfreq driver along
with interconnect properties documentation (Sylwester Nawrocki).
- Add suport for AMD Fam17h and Fam19h processors to the RAPL power
capping driver (Victor Ding, Kim Phillips).
- Fix handling of overly long constraint names in the powercap
framework (Lukasz Luba).
- Fix the wakeup configuration handling for bridges in the ACPI
device power management core (Rafael Wysocki).
- Add support for using an abstract scale for power units in the
Energy Model (EM) and document it (Lukasz Luba).
- Add em_cpu_energy() micro-optimization to the EM (Pavankumar
Kondeti).
- Modify the generic power domains (genpd) framwework to support
suspend-to-idle (Ulf Hansson).
- Fix creation of debugfs nodes in genpd (Thierry Strudel).
- Clean up genpd (Lina Iyer).
- Clean up the core system-wide suspend code and make it print driver
flags for devices with debug enabled (Alex Shi, Patrice Chotard,
Chen Yu).
- Modify the ACPI system reboot code to make it prepare for system
power off to avoid confusing the platform firmware (Kai-Heng Feng).
- Update the pm-graph (multiple changes, mostly usability-related)
and cpupower (online and offline CPU information support) PM
utilities (Todd Brandt, Brahadambal Srinivasan)"
* tag 'pm-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (86 commits)
cpufreq: Fix cpufreq_online() return value on errors
cpufreq: Fix up several kerneldoc comments
cpufreq: stats: Use local_clock() instead of jiffies
cpufreq: schedutil: Simplify sugov_update_next_freq()
cpufreq: intel_pstate: Simplify intel_cpufreq_update_pstate()
PM: domains: create debugfs nodes when adding power domains
opp: of: Allow empty opp-table with opp-shared
dt-bindings: opp: Allow empty OPP tables
media: venus: dev_pm_opp_put_*() accepts NULL argument
drm/panfrost: dev_pm_opp_put_*() accepts NULL argument
drm/lima: dev_pm_opp_put_*() accepts NULL argument
PM / devfreq: exynos: dev_pm_opp_put_*() accepts NULL argument
cpufreq: qcom-cpufreq-nvmem: dev_pm_opp_put_*() accepts NULL argument
cpufreq: dt: dev_pm_opp_put_regulators() accepts NULL argument
opp: Allow dev_pm_opp_put_*() APIs to accept NULL opp_table
opp: Don't create an OPP table from dev_pm_opp_get_opp_table()
cpufreq: dt: Don't (ab)use dev_pm_opp_get_opp_table() to create OPP table
opp: Reduce the size of critical section in _opp_kref_release()
PM / EM: Micro optimization in em_cpu_energy
cpufreq: arm_scmi: Discover the power scale in performance protocol
...
Merge misc updates from Andrew Morton:
- a few random little subsystems
- almost all of the MM patches which are staged ahead of linux-next
material. I'll trickle to post-linux-next work in as the dependents
get merged up.
Subsystems affected by this patch series: kthread, kbuild, ide, ntfs,
ocfs2, arch, and mm (slab-generic, slab, slub, dax, debug, pagecache,
gup, swap, shmem, memcg, pagemap, mremap, hmm, vmalloc, documentation,
kasan, pagealloc, memory-failure, hugetlb, vmscan, z3fold, compaction,
oom-kill, migration, cma, page-poison, userfaultfd, zswap, zsmalloc,
uaccess, zram, and cleanups).
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (200 commits)
mm: cleanup kstrto*() usage
mm: fix fall-through warnings for Clang
mm: slub: convert sysfs sprintf family to sysfs_emit/sysfs_emit_at
mm: shmem: convert shmem_enabled_show to use sysfs_emit_at
mm:backing-dev: use sysfs_emit in macro defining functions
mm: huge_memory: convert remaining use of sprintf to sysfs_emit and neatening
mm: use sysfs_emit for struct kobject * uses
mm: fix kernel-doc markups
zram: break the strict dependency from lzo
zram: add stat to gather incompressible pages since zram set up
zram: support page writeback
mm/process_vm_access: remove redundant initialization of iov_r
mm/zsmalloc.c: rework the list_add code in insert_zspage()
mm/zswap: move to use crypto_acomp API for hardware acceleration
mm/zswap: fix passing zero to 'PTR_ERR' warning
mm/zswap: make struct kernel_param_ops definitions const
userfaultfd/selftests: hint the test runner on required privilege
userfaultfd/selftests: fix retval check for userfaultfd_open()
userfaultfd/selftests: always dump something in modes
userfaultfd: selftests: make __{s,u}64 format specifiers portable
...
For architectures that enable ARCH_HAS_SET_MEMORY having the ability to
verify that a page is mapped in the kernel direct map can be useful
regardless of hibernation.
Add RISC-V implementation of kernel_page_present(), update its forward
declarations and stubs to be a part of set_memory API and remove ugly
ifdefery in inlcude/linux/mm.h around current declarations of
kernel_page_present().
Link: https://lkml.kernel.org/r/20201109192128.960-5-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The design of DEBUG_PAGEALLOC presumes that __kernel_map_pages() must
never fail. With this assumption is wouldn't be safe to allow general
usage of this function.
Moreover, some architectures that implement __kernel_map_pages() have this
function guarded by #ifdef DEBUG_PAGEALLOC and some refuse to map/unmap
pages when page allocation debugging is disabled at runtime.
As all the users of __kernel_map_pages() were converted to use
debug_pagealloc_map_pages() it is safe to make it available only when
DEBUG_PAGEALLOC is set.
Link: https://lkml.kernel.org/r/20201109192128.960-4-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't allow splitting of vm_special_mapping's. It affects vdso/vvar
areas. Uprobes have only one page in xol_area so they aren't affected.
Those restrictions were enforced by checks in .mremap() callbacks.
Restrict resizing with generic .split() callback.
Link: https://lkml.kernel.org/r/20201013013416.390574-7-dima@arista.com
Signed-off-by: Dmitry Safonov <dima@arista.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As kernel expect to see only one of such mappings, any further operations
on the VMA-copy may be unexpected by the kernel. Maybe it's being on the
safe side, but there doesn't seem to be any expected use-case for this, so
restrict it now.
Link: https://lkml.kernel.org/r/20201013013416.390574-4-dima@arista.com
Fixes: commit e346b38130 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
Signed-off-by: Dmitry Safonov <dima@arista.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
HAVE_MOVE_PUD enables remapping pages at the PUD level if both the
source and destination addresses are PUD-aligned.
With HAVE_MOVE_PUD enabled it can be inferred that there is
approximately a 13x improvement in performance on x86. (See data
below).
------- Test Results ---------
The following results were obtained using a 5.4 kernel, by remapping
a PUD-aligned, 1GB sized region to a PUD-aligned destination.
The results from 10 iterations of the test are given below:
Total mremap times for 1GB data on x86. All times are in nanoseconds.
Control HAVE_MOVE_PUD
180394 15089
235728 14056
238931 25741
187330 13838
241742 14187
177925 14778
182758 14728
160872 14418
205813 15107
245722 13998
205721.5 15594 <-- Mean time in nanoseconds
A 1GB mremap completion time drops from ~205 microseconds
to ~15 microseconds on x86. (~13x speed up).
Link: https://lkml.kernel.org/r/20201014005320.2233162-6-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Hassan Naveed <hnaveed@wavecomp.com>
Cc: Jia He <justin.he@arm.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 70e806e4e6 ("mm: Do early cow for pinned pages during
fork() for ptes") pages under a FOLL_PIN will not be write protected
during COW for fork. This means that pages returned from
pin_user_pages(FOLL_WRITE) should not become write protected while the pin
is active.
However, there is a small race where get_user_pages_fast(FOLL_PIN) can
establish a FOLL_PIN at the same time copy_present_page() is write
protecting it:
CPU 0 CPU 1
get_user_pages_fast()
internal_get_user_pages_fast()
copy_page_range()
pte_alloc_map_lock()
copy_present_page()
atomic_read(has_pinned) == 0
page_maybe_dma_pinned() == false
atomic_set(has_pinned, 1);
gup_pgd_range()
gup_pte_range()
pte_t pte = gup_get_pte(ptep)
pte_access_permitted(pte)
try_grab_compound_head()
pte = pte_wrprotect(pte)
set_pte_at();
pte_unmap_unlock()
// GUP now returns with a write protected page
The first attempt to resolve this by using the write protect caused
problems (and was missing a barrrier), see commit f3c64eda3e ("mm: avoid
early COW write protect games during fork()")
Instead wrap copy_p4d_range() with the write side of a seqcount and check
the read side around gup_pgd_range(). If there is a collision then
get_user_pages_fast() fails and falls back to slow GUP.
Slow GUP is safe against this race because copy_page_range() is only
called while holding the exclusive side of the mmap_lock on the src
mm_struct.
[akpm@linux-foundation.org: coding style fixes]
Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com
Link: https://lkml.kernel.org/r/2-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
Fixes: f3c64eda3e ("mm: avoid early COW write protect games during fork()")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: "Ahmed S. Darwish" <a.darwish@linutronix.de> [seqcount_t parts]
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- PSCI relay at EL2 when "protected KVM" is enabled
- New exception injection code
- Simplification of AArch32 system register handling
- Fix PMU accesses when no PMU is enabled
- Expose CSV3 on non-Meltdown hosts
- Cache hierarchy discovery fixes
- PV steal-time cleanups
- Allow function pointers at EL2
- Various host EL2 entry cleanups
- Simplification of the EL2 vector allocation
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl/XoggPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDsRYP/3ZtGWsyBc1sKdaTBIwQdnrPQHL+7o1Mmjnl
b+YqRMWcJW4g3O81GW6IA+vM0A1UMJxVOjzkZd8KulGv3RCZiqQmWJClWFlYbwLj
e+HHx+Zo/qsmDrwcVoFI8/n+iC/a5fIaCbSWMSPaKHrOMxBiHQk0qlaq4AZ8gb7a
/eHYqI/hISJQb1ZVFHmwlp8FoMnB2M6/FDpCf8oeGKjpF2hjghIPugJ0oRlPLZjB
o3Q6ELEScJV1wBy7d1+5rkm52t9j8gpGhXxja0QwypADNzk5KHEzghXq+rTWUh1S
et9OfqkflMtKMsh0qNwe5ZFbqtsH69qtYMAj4ok7rZOwQcbJ97VSrP5ka7VVzSdC
AgcQU9c9LoyQ7rk0dbs3t0cd8hMgVu50guZ/iHfW88CcdykN9M0nnSPRAYpNbW85
xndBQ5k/a4FoufwoY4e0hS28HIiRfLoEA68mps+yoMiiKh27HO2v4GFRIJoCNxzp
YQ01zOBp9FKYTsxj0h7mMf+5EEyo9E4X/kJOfZpOVVbVKy82wPAGLJpDEnbnoJUe
j1jBmiV/trkn+nTnWmDoXcw2ljuIF9dBm2M8r8yGKdNEHptnN8tMVRlCRImVVWW0
BbZGAzoK0tpKXPIlUh4aXS3mtV9qlohs9rzjVyKfGnaRRbRGANM8qrH5aKuDFinM
RugpMWyk
=hf4L
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for Linux 5.11
- PSCI relay at EL2 when "protected KVM" is enabled
- New exception injection code
- Simplification of AArch32 system register handling
- Fix PMU accesses when no PMU is enabled
- Expose CSV3 on non-Meltdown hosts
- Cache hierarchy discovery fixes
- PV steal-time cleanups
- Allow function pointers at EL2
- Various host EL2 entry cleanups
- Simplification of the EL2 vector allocation
The GHCB specification requires the hypervisor to save the address of an
AP Jump Table so that, for example, vCPUs that have been parked by UEFI
can be started by the OS. Provide support for the AP Jump Table set/get
exit code.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This function uses irq_to_desc() and is going to be used by modules to
replace the open coded irq_to_desc() (ab)usage. The final goal is to remove
the export of irq_to_desc() so driver cannot fiddle with it anymore.
Move it into the core code and fixup the usage sites to include the proper
header.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201210194042.548936472@linutronix.de
* pm-sleep:
PM: sleep: Add dev_wakeup_path() helper
PM / suspend: fix kernel-doc markup
PM: sleep: Print driver flags for all devices during suspend/resume
* pm-acpi:
PM: ACPI: Refresh wakeup device power configuration every time
PM: ACPI: PCI: Drop acpi_pm_set_bridge_wakeup()
PM: ACPI: reboot: Use S5 for reboot
* pm-domains:
PM: domains: create debugfs nodes when adding power domains
PM: domains: replace -ENOTSUPP with -EOPNOTSUPP
* powercap:
powercap: Adjust printing the constraint name with new line
powercap: RAPL: Add AMD Fam19h RAPL support
powercap: Add AMD Fam17h RAPL support
powercap/intel_rapl_msr: Convert rapl_msr_priv into pointer
x86/msr-index: sort AMD RAPL MSRs by address
An SEV-ES guest is started by invoking a new SEV initialization ioctl,
KVM_SEV_ES_INIT. This identifies the guest as an SEV-ES guest, which is
used to drive the appropriate ASID allocation, VMSA encryption, etc.
Before being able to run an SEV-ES vCPU, the vCPU VMSA must be encrypted
and measured. This is done using the LAUNCH_UPDATE_VMSA command after all
calls to LAUNCH_UPDATE_DATA have been performed, but before LAUNCH_MEASURE
has been performed. In order to establish the encrypted VMSA, the current
(traditional) VMSA and the GPRs are synced to the page that will hold the
encrypted VMSA and then LAUNCH_UPDATE_VMSA is invoked. The vCPU is then
marked as having protected guest state.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <e9643245adb809caf3a87c09997926d2f3d6ff41.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The run sequence is different for an SEV-ES guest compared to a legacy or
even an SEV guest. The guest vCPU register state of an SEV-ES guest will
be restored on VMRUN and saved on VMEXIT. There is no need to restore the
guest registers directly and through VMLOAD before VMRUN and no need to
save the guest registers directly and through VMSAVE on VMEXIT.
Update the svm_vcpu_run() function to skip register state saving and
restoring and provide an alternative function for running an SEV-ES guest
in vmenter.S
Additionally, certain host state is restored across an SEV-ES VMRUN. As
a result certain register states are not required to be restored upon
VMEXIT (e.g. FS, GS, etc.), so only do that if the guest is not an SEV-ES
guest.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <fb1c66d32f2194e171b95fc1a8affd6d326e10c1.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
An SEV-ES vCPU requires additional VMCB vCPU load/put requirements. SEV-ES
hardware will restore certain registers on VMEXIT, but not save them on
VMRUN (see Table B-3 and Table B-4 of the AMD64 APM Volume 2), so make the
following changes:
General vCPU load changes:
- During vCPU loading, perform a VMSAVE to the per-CPU SVM save area and
save the current values of XCR0, XSS and PKRU to the per-CPU SVM save
area as these registers will be restored on VMEXIT.
General vCPU put changes:
- Do not attempt to restore registers that SEV-ES hardware has already
restored on VMEXIT.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <019390e9cb5e93cd73014fa5a040c17d42588733.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
An SEV-ES vCPU requires additional VMCB initialization requirements for
vCPU creation and vCPU load/put requirements. This includes:
General VMCB initialization changes:
- Set a VMCB control bit to enable SEV-ES support on the vCPU.
- Set the VMCB encrypted VM save area address.
- CRx registers are part of the encrypted register state and cannot be
updated. Remove the CRx register read and write intercepts and replace
them with CRx register write traps to track the CRx register values.
- Certain MSR values are part of the encrypted register state and cannot
be updated. Remove certain MSR intercepts (EFER, CR_PAT, etc.).
- Remove the #GP intercept (no support for "enable_vmware_backdoor").
- Remove the XSETBV intercept since the hypervisor cannot modify XCR0.
General vCPU creation changes:
- Set the initial GHCB gpa value as per the GHCB specification.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <3a8aef366416eddd5556dfa3fdc212aafa1ad0a2.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SEV and SEV-ES guests each have dedicated ASID ranges. Update the ASID
allocation routine to return an ASID in the respective range.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <d7aed505e31e3954268b2015bb60a1486269c780.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The SVM host save area is used to restore some host state on VMEXIT of an
SEV-ES guest. After allocating the save area, clear it and add the
encryption mask to the SVM host save area physical address that is
programmed into the VM_HSAVE_PA MSR.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <b77aa28af6d7f1a0cb545959e08d6dc75e0c3cba.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The GHCB specification defines how NMIs are to be handled for an SEV-ES
guest. To detect the completion of an NMI the hypervisor must not
intercept the IRET instruction (because a #VC while running the NMI will
issue an IRET) and, instead, must receive an NMI Complete exit event from
the guest.
Update the KVM support for detecting the completion of NMIs in the guest
to follow the GHCB specification. When an SEV-ES guest is active, the
IRET instruction will no longer be intercepted. Now, when the NMI Complete
exit event is received, the iret_interception() function will be called
to simulate the completion of the NMI.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <5ea3dd69b8d4396cefdc9048ebc1ab7caa70a847.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The guest FPU state is automatically restored on VMRUN and saved on VMEXIT
by the hardware, so there is no reason to do this in KVM. Eliminate the
allocation of the guest_fpu save area and key off that to skip operations
related to the guest FPU state.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <173e429b4d0d962c6a443c4553ffdaf31b7665a4.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SEV-ES guests do not currently support SMM. Update the has_emulated_msr()
kvm_x86_ops function to take a struct kvm parameter so that the capability
can be reported at a VM level.
Since this op is also called during KVM initialization and before a struct
kvm instance is available, comments will be added to each implementation
of has_emulated_msr() to indicate the kvm parameter can be null.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <75de5138e33b945d2fb17f81ae507bda381808e3.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since many of the registers used by the SEV-ES are encrypted and cannot
be read or written, adjust the __get_sregs() / __set_sregs() to take into
account whether the VMSA/guest state is encrypted.
For __get_sregs(), return the actual value that is in use by the guest
for all registers being tracked using the write trap support.
For __set_sregs(), skip setting of all guest registers values.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <23051868db76400a9b07a2020525483a1e62dbcf.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For SEV-ES guests, the interception of control register write access
is not recommended. Control register interception occurs prior to the
control register being modified and the hypervisor is unable to modify
the control register itself because the register is located in the
encrypted register state.
SEV-ES guests introduce new control register write traps. These traps
provide intercept support of a control register write after the control
register has been modified. The new control register value is provided in
the VMCB EXITINFO1 field, allowing the hypervisor to track the setting
of the guest control registers.
Add support to track the value of the guest CR8 register using the control
register write trap so that the hypervisor understands the guest operating
mode.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <5a01033f4c8b3106ca9374b7cadf8e33da852df1.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For SEV-ES guests, the interception of control register write access
is not recommended. Control register interception occurs prior to the
control register being modified and the hypervisor is unable to modify
the control register itself because the register is located in the
encrypted register state.
SEV-ES guests introduce new control register write traps. These traps
provide intercept support of a control register write after the control
register has been modified. The new control register value is provided in
the VMCB EXITINFO1 field, allowing the hypervisor to track the setting
of the guest control registers.
Add support to track the value of the guest CR4 register using the control
register write trap so that the hypervisor understands the guest operating
mode.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <c3880bf2db8693aa26f648528fbc6e967ab46e25.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For SEV-ES guests, the interception of control register write access
is not recommended. Control register interception occurs prior to the
control register being modified and the hypervisor is unable to modify
the control register itself because the register is located in the
encrypted register state.
SEV-ES support introduces new control register write traps. These traps
provide intercept support of a control register write after the control
register has been modified. The new control register value is provided in
the VMCB EXITINFO1 field, allowing the hypervisor to track the setting
of the guest control registers.
Add support to track the value of the guest CR0 register using the control
register write trap so that the hypervisor understands the guest operating
mode.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <182c9baf99df7e40ad9617ff90b84542705ef0d7.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For SEV-ES guests, the interception of EFER write access is not
recommended. EFER interception occurs prior to EFER being modified and
the hypervisor is unable to modify EFER itself because the register is
located in the encrypted register state.
SEV-ES support introduces a new EFER write trap. This trap provides
intercept support of an EFER write after it has been modified. The new
EFER value is provided in the VMCB EXITINFO1 field, allowing the
hypervisor to track the setting of the guest EFER.
Add support to track the value of the guest EFER value using the EFER
write trap so that the hypervisor understands the guest operating mode.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <8993149352a3a87cd0625b3b61bfd31ab28977e1.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For an SEV-ES guest, string-based port IO is performed to a shared
(un-encrypted) page so that both the hypervisor and guest can read or
write to it and each see the contents.
For string-based port IO operations, invoke SEV-ES specific routines that
can complete the operation using common KVM port IO support.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <9d61daf0ffda496703717218f415cdc8fd487100.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For an SEV-ES guest, MMIO is performed to a shared (un-encrypted) page
so that both the hypervisor and guest can read or write to it and each
see the contents.
The GHCB specification provides software-defined VMGEXIT exit codes to
indicate a request for an MMIO read or an MMIO write. Add support to
recognize the MMIO requests and invoke SEV-ES specific routines that
can complete the MMIO operation. These routines use common KVM support
to complete the MMIO operation.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <af8de55127d5bcc3253d9b6084a0144c12307d4d.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add trace events for entry to and exit from VMGEXIT MSR protocol
processing. The vCPU will be common for the trace events. The MSR
protocol processing is guided by the GHCB GPA in the VMCB, so the GHCB
GPA will represent the input and output values for the entry and exit
events, respectively. Additionally, the exit event will contain the
return code for the event.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <c5b3b440c3e0db43ff2fc02813faa94fa54896b0.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add trace events for entry to and exit from VMGEXIT processing. The vCPU
id and the exit reason will be common for the trace events. The exit info
fields will represent the input and output values for the entry and exit
events, respectively.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <25357dca49a38372e8f483753fb0c1c2a70a6898.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The GHCB specification defines a GHCB MSR protocol using the lower
12-bits of the GHCB MSR (in the hypervisor this corresponds to the
GHCB GPA field in the VMCB).
Function 0x100 is a request for termination of the guest. The guest has
encountered some situation for which it has requested to be terminated.
The GHCB MSR value contains the reason for the request.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <f3a1f7850c75b6ea4101e15bbb4a3af1a203f1dc.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The GHCB specification defines a GHCB MSR protocol using the lower
12-bits of the GHCB MSR (in the hypervisor this corresponds to the
GHCB GPA field in the VMCB).
Function 0x004 is a request for CPUID information. Only a single CPUID
result register can be sent per invocation, so the protocol defines the
register that is requested. The GHCB MSR value is set to the CPUID
register value as per the specification via the VMCB GHCB GPA field.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <fd7ee347d3936e484c06e9001e340bf6387092cd.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The GHCB specification defines a GHCB MSR protocol using the lower
12-bits of the GHCB MSR (in the hypervisor this corresponds to the
GHCB GPA field in the VMCB).
Function 0x002 is a request to set the GHCB MSR value to the SEV INFO as
per the specification via the VMCB GHCB GPA field.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <c23c163a505290a0d1b9efc4659b838c8c902cbc.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SEV-ES adds a new VMEXIT reason code, VMGEXIT. Initial support for a
VMGEXIT includes mapping the GHCB based on the guest GPA, which is
obtained from a new VMCB field, and then validating the required inputs
for the VMGEXIT exit reason.
Since many of the VMGEXIT exit reasons correspond to existing VMEXIT
reasons, the information from the GHCB is copied into the VMCB control
exit code areas and KVM register areas. The standard exit handlers are
invoked, similar to standard VMEXIT processing. Before restarting the
vCPU, the GHCB is updated with any registers that have been updated by
the hypervisor.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <c6a4ed4294a369bd75c44d03bd7ce0f0c3840e50.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is a pre-patch to consolidate some exit handling code into callable
functions. Follow-on patches for SEV-ES exit handling will then be able
to use them from the sev.c file.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <5b8b0ffca8137f3e1e257f83df9f5c881c8a96a3.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a SHUTDOWN VMEXIT is encountered, normally the VMCB is re-initialized
so that the guest can be re-launched. But when a guest is running as an
SEV-ES guest, the VMSA cannot be re-initialized because it has been
encrypted. For now, just return -EINVAL to prevent a possible attempt at
a guest reset.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <aa6506000f6f3a574de8dbcdab0707df844cb00c.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a guest is running as an SEV-ES guest, it is not possible to emulate
instructions. Add support to prevent instruction emulation.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <f6355ea3024fda0a3eb5eb99c6b62dca10d792bd.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since the guest register state of an SEV-ES guest is encrypted, debugging
is not supported. Update the code to prevent guest debugging when the
guest has protected state.
Additionally, an SEV-ES guest must only and always intercept DR7 reads and
writes. Update set_dr_intercepts() and clr_dr_intercepts() to account for
this.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <8db966fa2f9803d6454ce773863025d0e2e7f3cc.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a guest is running under SEV-ES, the hypervisor cannot access the
guest register state. There are numerous places in the KVM code where
certain registers are accessed that are not allowed to be accessed (e.g.
RIP, CR0, etc). Add checks to prevent register accesses and add intercept
update support at various points within the KVM code.
Also, when handling a VMGEXIT, exceptions are passed back through the
GHCB. Since the RDMSR/WRMSR intercepts (may) inject a #GP on error,
update the SVM intercepts to handle this for SEV-ES guests.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[Redo MSR part using the .complete_emulated_msr callback. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This will be used by SEV-ES to inject MSR failure via the GHCB.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Simplify the four functions that handle {kernel,user} {rd,wr}msr, there
is still some repetition between the two instances of rdmsr but the
whole business of calling kvm_inject_gp and kvm_skip_emulated_instruction
can be unified nicely.
Because complete_emulated_wrmsr now becomes essentially a call to
kvm_complete_insn_gp, remove complete_emulated_msr.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is no need to inject a #GP from kvm_mtrr_set_msr, kvm_emulate_wrmsr will
handle it.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Simplification and distangling of the MSI related functionality
- Let IO/APIC construct the RTE entries from an MSI message instead of
having IO/APIC specific code in the interrupt remapping drivers
- Make the retrieval of the parent interrupt domain (vector or remap
unit) less hardcoded and use the relevant irqdomain callbacks for
selection.
- Allow the handling of more than 255 CPUs without a virtualized IOMMU
when the hypervisor supports it. This has made been possible by the
above modifications and also simplifies the existing workaround in the
HyperV specific virtual IOMMU.
- Cleanup of the historical timer_works() irq flags related
inconsistencies.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/Xxd8THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYpOD/9C5TppNlPMUyx2SflH6bxt37pJEpln
+hYTKsk+jSThntr5mfj+GifGvgmHOVBTGnlDUnUnrpN7TQmLFBzwTOtnBLW53AO2
16/u0+Xci4LNCtEkaymf0Rq4MfsfriXHPJr0A/CnZ0tpHSf5QKHAiitSiGujdMlb
gbq43+zXd+jNkH7vkOLPX/7dZVI1hNASQEevJu2tRR4xYTuXFdBxvLgYkHtYKKrK
R1sbs6nI6yIzye2u4m4xGu29SxgUft+zdUf+UehJKM3yFmf51d9qpkX+kLaTWuaL
VPsMItbn0kdvxwXQWO6DYnIAAnVKCklyHQJTZCoNq9Fe91OoByak1CEVspSOa1av
JmycNSch4IYWasR4vVCB1gbb+V9SejcKu5SV3CDrEDqwkOIpfiqpriUXSCJTLlFd
QOEDOLuuk/79Qs//J/tb/nJ4IuKv8WPudDfIlMro8wUsAr67DjD4mnXprZ+svwWx
Ct/0/Memk+BSa0cw6pvg24BUZGN6zrufkBu2HKT9GOXRUdNkdLkiPhT8mK4T/O0l
f90QCLjPSOJ/K/pLEWdUHEPmgC5Q9RsXOmwVGqX+RbjfP7mYTJXlmWnBb+cFNch0
xFIH3SxVGylxxT06NX3SkvinrHj10CoAlmneefBlLtx6dF+2P84DAMZSF0OFToVI
c2KMg5zoesI4bg==
=8Gfs
-----END PGP SIGNATURE-----
Merge tag 'x86-apic-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 apic updates from Thomas Gleixner:
"Yet another large set of x86 interrupt management updates:
- Simplification and distangling of the MSI related functionality
- Let IO/APIC construct the RTE entries from an MSI message instead
of having IO/APIC specific code in the interrupt remapping drivers
- Make the retrieval of the parent interrupt domain (vector or remap
unit) less hardcoded and use the relevant irqdomain callbacks for
selection.
- Allow the handling of more than 255 CPUs without a virtualized
IOMMU when the hypervisor supports it. This has made been possible
by the above modifications and also simplifies the existing
workaround in the HyperV specific virtual IOMMU.
- Cleanup of the historical timer_works() irq flags related
inconsistencies"
* tag 'x86-apic-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
x86/ioapic: Cleanup the timer_works() irqflags mess
iommu/hyper-v: Remove I/O-APIC ID check from hyperv_irq_remapping_select()
iommu/amd: Fix IOMMU interrupt generation in X2APIC mode
iommu/amd: Don't register interrupt remapping irqdomain when IR is disabled
iommu/amd: Fix union of bitfields in intcapxt support
x86/ioapic: Correct the PCI/ISA trigger type selection
x86/ioapic: Use I/O-APIC ID for finding irqdomain, not index
x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it
x86/kvm: Enable 15-bit extension when KVM_FEATURE_MSI_EXT_DEST_ID detected
iommu/hyper-v: Disable IRQ pseudo-remapping if 15 bit APIC IDs are available
x86/apic: Support 15 bits of APIC ID in MSI where available
x86/ioapic: Handle Extended Destination ID field in RTE
iommu/vt-d: Simplify intel_irq_remapping_select()
x86: Kill all traces of irq_remapping_get_irq_domain()
x86/ioapic: Use irq_find_matching_fwspec() to find remapping irqdomain
x86/hpet: Use irq_find_matching_fwspec() to find remapping irqdomain
iommu/hyper-v: Implement select() method on remapping irqdomain
iommu/vt-d: Implement select() method on remapping irqdomain
iommu/amd: Implement select() method on remapping irqdomain
x86/apic: Add select() method on vector irqdomain
...
- Simplify the FPU protection for !RT kernels
- Add the RT variant of FPU protections
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/Xxk8THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYLWEADC6OnEn/bZO6RMUQvTr0OJGo/G4ZGg
pibxHqBF1Q1vso8nGtYhPXHV7BbYKc1t1g3cvTBrWARSI+0GkRvZuqRUekcm2anR
ERs0zxIVexj0DqVa/08qkH//njG5QLPig2y6EtcOcNWnJiWrf3BayfY+Jelr+pQ0
ZK8yK6X38AwgflIs1RG2gIVocAB16cr4SX6cEqn4+FTY8TW4WFkPeZPYR55eh99s
aTQU+YK/bHul/WmWT+OPbJcfywLy//F+KbHSdjUhOHJO3YW9aAS9xti6wVGlwB0O
j6x82N4pJj1llhOsVZgLVVwX7zUtJlFz4D7w8Bl2/bUDt/EVKS2397uAWNIV5+0g
/8e+xnk9KDsGkD0GqboYUk4SXJiPUJ6GVc3FB1TIFemxWhgKYJ0yfRcGgez41O3P
IOYLoN32UZZvTPhwzX/WZlTSjQ+pTiNWE7Jhnm+HS3ss9qNEeiCYU2zhjIY2/wnB
lhB8W/kAw+UtjJbguwem6NwJB0e6egPn6c+6UL5n1JmMXvmlYe0tEBoZBgqt+30Y
Kz+rhLt5m4VQYTXoGAd1qVQ13eEk0SsbcB0Xl1tU+ix6JuFc4KiFzRR+uj7pm11V
xhCpt7I7Je0jVnnCRc1Sc5gQlEWfVb/FVy8IBlQz6qyUUqHf7tUwnwVmji9IyK4Y
Pc7Peqaoy7Cc3Q==
=bgdQ
-----END PGP SIGNATURE-----
Merge tag 'x86-fpu-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 FPU updates from Thomas Gleixner:
- Simplify the FPU protection for !RT kernels
- Add the RT variant of FPU protections
* tag 'x86-fpu-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Make kernel FPU protection RT friendly
x86/fpu: Simplify fpregs_[un]lock()
- Consolidate all kmap_atomic() internals into a generic implementation
which builds the base for the kmap_local() API and make the
kmap_atomic() interface wrappers which handle the disabling/enabling of
preemption and pagefaults.
- Switch the storage from per-CPU to per task and provide scheduler
support for clearing mapping when scheduling out and restoring them
when scheduling back in.
- Merge the migrate_disable/enable() code, which is also part of the
scheduler pull request. This was required to make the kmap_local()
interface available which does not disable preemption when a mapping
is established. It has to disable migration instead to guarantee that
the virtual address of the mapped slot is the same accross preemption.
- Provide better debug facilities: guard pages and enforced utilization
of the mapping mechanics on 64bit systems when the architecture allows
it.
- Provide the new kmap_local() API which can now be used to cleanup the
kmap_atomic() usage sites all over the place. Most of the usage sites
do not require the implicit disabling of preemption and pagefaults so
the penalty on 64bit and 32bit non-highmem systems is removed and quite
some of the code can be simplified. A wholesale conversion is not
possible because some usage depends on the implicit side effects and
some need to be cleaned up because they work around these side effects.
The migrate disable side effect is only effective on highmem systems
and when enforced debugging is enabled. On 64bit and 32bit non-highmem
systems the overhead is completely avoided.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XyQwTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUolD/9+R+BX96fGir+I8rG9dc3cbLw5meSi
0I/Nq3PToZMs2Iqv50DsoaPYHHz/M6fcAO9LRIgsE9jRbnY93GnsBM0wU9Y8yQaT
4wUzOG5WHaLDfqIkx/CN9coUl458oEiwOEbn79A2FmPXFzr7IpkufnV3ybGDwzwP
p73bjMJMPPFrsa9ig87YiYfV/5IAZHi82PN8Cq1v4yNzgXRP3Tg6QoAuCO84ZnWF
RYlrfKjcJ2xPdn+RuYyXolPtxr1hJQ0bOUpe4xu/UfeZjxZ7i1wtwLN9kWZe8CKH
+x4Lz8HZZ5QMTQ9sCHOLtKzu2MceMcpISzoQH4/aFQCNMgLn1zLbS790XkYiQCuR
ne9Cua+IqgYfGMG8cq8+bkU9HCNKaXqIBgPEKE/iHYVmqzCOqhW5Cogu4KFekf6V
Wi7pyyUdX2en8BAWpk5NHc8de9cGcc+HXMq2NIcgXjVWvPaqRP6DeITERTZLJOmz
XPxq5oPLGl7wdm7z+ICIaNApy8zuxpzb6sPLNcn7l5OeorViORlUu08AN8587wAj
FiVjp6ZYomg+gyMkiNkDqFOGDH5TMENpOFoB0hNNEyJwwS0xh6CgWuwZcv+N8aPO
HuS/P+tNANbD8ggT4UparXYce7YCtgOf3IG4GA3JJYvYmJ6pU+AZOWRoDScWq4o+
+jlfoJhMbtx5Gg==
=n71I
-----END PGP SIGNATURE-----
Merge tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull kmap updates from Thomas Gleixner:
"The new preemtible kmap_local() implementation:
- Consolidate all kmap_atomic() internals into a generic
implementation which builds the base for the kmap_local() API and
make the kmap_atomic() interface wrappers which handle the
disabling/enabling of preemption and pagefaults.
- Switch the storage from per-CPU to per task and provide scheduler
support for clearing mapping when scheduling out and restoring them
when scheduling back in.
- Merge the migrate_disable/enable() code, which is also part of the
scheduler pull request. This was required to make the kmap_local()
interface available which does not disable preemption when a
mapping is established. It has to disable migration instead to
guarantee that the virtual address of the mapped slot is the same
across preemption.
- Provide better debug facilities: guard pages and enforced
utilization of the mapping mechanics on 64bit systems when the
architecture allows it.
- Provide the new kmap_local() API which can now be used to cleanup
the kmap_atomic() usage sites all over the place. Most of the usage
sites do not require the implicit disabling of preemption and
pagefaults so the penalty on 64bit and 32bit non-highmem systems is
removed and quite some of the code can be simplified. A wholesale
conversion is not possible because some usage depends on the
implicit side effects and some need to be cleaned up because they
work around these side effects.
The migrate disable side effect is only effective on highmem
systems and when enforced debugging is enabled. On 64bit and 32bit
non-highmem systems the overhead is completely avoided"
* tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
ARM: highmem: Fix cache_is_vivt() reference
x86/crashdump/32: Simplify copy_oldmem_page()
io-mapping: Provide iomap_local variant
mm/highmem: Provide kmap_local*
sched: highmem: Store local kmaps in task struct
x86: Support kmap_local() forced debugging
mm/highmem: Provide CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
mm/highmem: Provide and use CONFIG_DEBUG_KMAP_LOCAL
microblaze/mm/highmem: Add dropped #ifdef back
xtensa/mm/highmem: Make generic kmap_atomic() work correctly
mm/highmem: Take kmap_high_get() properly into account
highmem: High implementation details and document API
Documentation/io-mapping: Remove outdated blurb
io-mapping: Cleanup atomic iomap
mm/highmem: Remove the old kmap_atomic cruft
highmem: Get rid of kmap_types.h
xtensa/mm/highmem: Switch to generic kmap atomic
sparc/mm/highmem: Switch to generic kmap atomic
powerpc/mm/highmem: Switch to generic kmap atomic
nds32/mm/highmem: Switch to generic kmap atomic
...
- migrate_disable/enable() support which originates from the RT tree and
is now a prerequisite for the new preemptible kmap_local() API which aims
to replace kmap_atomic().
- A fair amount of topology and NUMA related improvements
- Improvements for the frequency invariant calculations
- Enhanced robustness for the global CPU priority tracking and decision
making
- The usual small fixes and enhancements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XwK4THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoX28D/9cVrvziSQGfBfuQWnUiw8iOIq1QBa2
Me+Tvenhfrlt7xU6rbP9ciFu7eTN+fS06m5uQPGI+t22WuJmHzbmw1bJVXfkvYfI
/QoU+Hg7DkDAn1p7ZKXh0dRkV0nI9ixxSHl0E+Zf1ATBxCUMV2SO85flg6z/4qJq
3VWUye0dmR7/bhtkIjv5rwce9v2JB2g1AbgYXYTW9lHVoUdGoMSdiZAF4tGyHLnx
sJ6DMqQ+k+dmPyYO0z5MTzjW/fXit4n9w2e3z9TvRH/uBu58WSW1RBmQYX6aHBAg
dhT9F4lvTs6lJY23x5RSFWDOv6xAvKF5a0xfb8UZcyH5EoLYrPRvm42a0BbjdeRa
u0z7LbwIlKA+RFdZzFZWz8UvvO0ljyMjmiuqZnZ5dY9Cd80LSBuxrWeQYG0qg6lR
Y2povhhCepEG+q8AXIe2YjHKWKKC1s/l/VY3CNnCzcd21JPQjQ4Z5eWGmHif5IED
CntaeFFhZadR3w02tkX35zFmY3w4soKKrbI4EKWrQwd+cIEQlOSY7dEPI/b5BbYj
MWAb3P4EG9N77AWTNmbhK4nN0brEYb+rBbCA+5dtNBVhHTxAC7OTWElJOC2O66FI
e06dREjvwYtOkRUkUguWwErbIai2gJ2MH0VILV3hHoh64oRk7jjM8PZYnjQkdptQ
Gsq0rJW5iiu/OQ==
=Oz1V
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner:
- migrate_disable/enable() support which originates from the RT tree
and is now a prerequisite for the new preemptible kmap_local() API
which aims to replace kmap_atomic().
- A fair amount of topology and NUMA related improvements
- Improvements for the frequency invariant calculations
- Enhanced robustness for the global CPU priority tracking and decision
making
- The usual small fixes and enhancements all over the place
* tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
sched/fair: Trivial correction of the newidle_balance() comment
sched/fair: Clear SMT siblings after determining the core is not idle
sched: Fix kernel-doc markup
x86: Print ratio freq_max/freq_base used in frequency invariance calculations
x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
x86, sched: Calculate frequency invariance for AMD systems
irq_work: Optimize irq_work_single()
smp: Cleanup smp_call_function*()
irq_work: Cleanup
sched: Limit the amount of NUMA imbalance that can exist at fork time
sched/numa: Allow a floating imbalance between NUMA nodes
sched: Avoid unnecessary calculation of load imbalance at clone time
sched/numa: Rename nr_running and break out the magic number
sched: Make migrate_disable/enable() independent of RT
sched/topology: Condition EAS enablement on FIE support
arm64: Rebuild sched domains on invariance status changes
sched/topology,schedutil: Wrap sched domains rebuild
sched/uclamp: Allow to reset a task uclamp constraint value
sched/core: Fix typos in comments
Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug
...
Core:
- Better handling of page table leaves on archictectures which have
architectures have non-pagetable aligned huge/large pages. For such
architectures a leaf can actually be part of a larger entry.
- Prevent a deadlock vs. exec_update_mutex
Architectures:
- The related updates for page size calculation of leaf entries
- The usual churn to support new CPUs
- Small fixes and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XvgATHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUrdEACatdr93wv75vnm5tCZM4EsFvB2PzVJ
ck4K4+hHiMVV4802qf+kW5plF+rckAU4TAai/L7wkTntKHvjD/0/o1epoIStb+dS
SCpVkQMCLT/8xT242iHPOfgsQpVpJnIiBwVRjn8HXu82nXdgMJhKnBjTe634UfxW
o2OCFiyJzpRi5l86gVp67ueqgvl34NPI2JaSLc0g80QfZ8akzdePPpED35CzYjZh
41k+7ssvt6qch3vMUySHAhkX4gQl0nc80YAaF/XZbCfvdyY7D03PtfBjfvphTSK0
l54z9aWh0ciK9P1aPfvkHDXBJUR2VtUAx2GiURK+XU3jNk3KMrz9CcBl1D/exIAg
07IsiYVoB38YAUOZoR9K8p+p+5EuwYRRUMAgfQfBALCuaLQV477Cne82b2KmNCus
1izUQvcDDf0s74OyYTHWFXRGla95COJvNLzkrZ1oU3mX4HgdKdOAUbf/2XTLWeKO
3HOIS+jsg5cp82tRe4X5r51h73pONYlo9lLo/CjQXz25vMcXKtE/MZGq2gkRff4p
N4k88eQ5LOsRqUaU46GcHozXRCfcpW7SPI9AaN5I/fKGIZvHP7uMdMb+g5DV8yHI
dNZ8u5uLPHwdg80C3fJ3Pnp7VsVNHliPXMwv0vib7BCp7aUVZWeFnOntw3PdYFRk
XKEbfl36IuAadg==
=rZ99
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Thomas Gleixner:
"Core:
- Better handling of page table leaves on archictectures which have
architectures have non-pagetable aligned huge/large pages. For such
architectures a leaf can actually be part of a larger entry.
- Prevent a deadlock vs exec_update_mutex
Architectures:
- The related updates for page size calculation of leaf entries
- The usual churn to support new CPUs
- Small fixes and improvements all over the place"
* tag 'perf-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
perf/x86/intel: Add Tremont Topdown support
uprobes/x86: Fix fall-through warnings for Clang
perf/x86: Fix fall-through warnings for Clang
kprobes/x86: Fix fall-through warnings for Clang
perf/x86/intel/lbr: Fix the return type of get_lbr_cycles()
perf/x86/intel: Fix rtm_abort_event encoding on Ice Lake
x86/kprobes: Restore BTF if the single-stepping is cancelled
perf: Break deadlock involving exec_update_mutex
sparc64/mm: Implement pXX_leaf_size() support
powerpc/8xx: Implement pXX_leaf_size() support
arm64/mm: Implement pXX_leaf_size() support
perf/core: Fix arch_perf_get_page_size()
mm: Introduce pXX_leaf_size()
mm/gup: Provide gup_get_pte() more generic
perf/x86/intel: Add event constraint for CYCLE_ACTIVITY.STALLS_MEM_ANY
perf/x86/intel/uncore: Add Rocket Lake support
perf/x86/msr: Add Rocket Lake CPU support
perf/x86/cstate: Add Rocket Lake CPU support
perf/x86/intel: Add Rocket Lake CPU support
perf,mm: Handle non-page-table-aligned hugetlbfs
...
RCU:
- Avoid cpuinfo-induced IPI pileups and idle-CPU IPIs.
- Lockdep-RCU updates reducing the need for __maybe_unused.
- Tasks-RCU updates.
- Miscellaneous fixes.
- Documentation updates.
- Torture-test updates.
KCSAN:
- updates for selftests, avoiding setting watchpoints on NULL pointers
- fix to watchpoint encoding
LKMM:
- updates for documentation along with some updates to example-code
litmus tests
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/Xon4THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYobXUD/92LJTI/TMgK6Z6EEQBiJZO/2mNKjK8
FEKc6AqTNMlZNsWCfQ5UgqtHpn+MkBZsX1x4u22gehE1qaCB8gnQ5wXgbXon8tQm
exxVk6vvQZjseeqCMqrsUYQlD7dNgHnf1qAmWXJvji4sA/1Opo6n2M74tqfE2ueV
S5hpQwSuK/6Zu2Hrr62HD8+Fx0in6ZuKRZxHGp1392l++DGbniJM3dzntRXB+JbZ
w3PDHFCQuGzTytyeKuQV48ot9IK+2YzmjIp/+4tHL6mvU38xeSu6gcYtqKPcfYWw
D6HXvDa965h5IrFdSA2JWSzjJ+VYgZVElk2HyXDNIae0fM/8GidgoIDQipT1WAur
sxW/Ke4U6Jm5MMqXqV8iMNduktkGD1/h6G/iB1Yis29xFdthorNpbHVAP+8cKXgf
1cR6RorOuBYv6XpyzygHtE7qfLY5ST352pJ4+UqNzboujOcuEnGaygttt0F/F8sA
ZH8NT5dyUfbGeqepdZWkbj116Hjeg3fyV3CZeyBhDeqpjf1Nn3nbJ1xRksPLfa3i
IKvN7HSzEg+vKnsJNnQeFlAmQ/W3n2bedzRqfaCg77pNhKI6jPuavY5f2YGFUj0y
yx0UzOYoI1Cln0keBMmynbyUKgJ7zstLkrt/JenjhtD3B+0df5BmYjkL+nqkP6ax
+XTCu7Xg+B061g==
=N/iO
-----END PGP SIGNATURE-----
Merge tag 'core-rcu-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Thomas Gleixner:
"RCU, LKMM and KCSAN updates collected by Paul McKenney.
RCU:
- Avoid cpuinfo-induced IPI pileups and idle-CPU IPIs
- Lockdep-RCU updates reducing the need for __maybe_unused
- Tasks-RCU updates
- Miscellaneous fixes
- Documentation updates
- Torture-test updates
KCSAN:
- updates for selftests, avoiding setting watchpoints on NULL pointers
- fix to watchpoint encoding
LKMM:
- updates for documentation along with some updates to example-code
litmus tests"
* tag 'core-rcu-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
srcu: Take early exit on memory-allocation failure
rcu/tree: Defer kvfree_rcu() allocation to a clean context
rcu: Do not report strict GPs for outgoing CPUs
rcu: Fix a typo in rcu_blocking_is_gp() header comment
rcu: Prevent lockdep-RCU splats on lock acquisition/release
rcu/tree: nocb: Avoid raising softirq for offloaded ready-to-execute CBs
rcu,ftrace: Fix ftrace recursion
rcu/tree: Make struct kernel_param_ops definitions const
rcu/tree: Add a warning if CPU being onlined did not report QS already
rcu: Clarify nocb kthreads naming in RCU_NOCB_CPU config
rcu: Fix single-CPU check in rcu_blocking_is_gp()
rcu: Implement rcu_segcblist_is_offloaded() config dependent
list.h: Update comment to explicitly note circular lists
rcu: Panic after fixed number of stalls
x86/smpboot: Move rcu_cpu_starting() earlier
rcu: Allow rcu_irq_enter_check_tick() from NMI
tools/memory-model: Label MP tests' producers and consumers
tools/memory-model: Use "buf" and "flag" for message-passing tests
tools/memory-model: Add types to litmus tests
tools/memory-model: Add a glossary of LKMM terms
...
- More generalization of entry/exit functionality
- The consolidation work to reclaim TIF flags on x86 and also for non-x86
specific TIF flags which are solely relevant for syscall related work
and have been moved into their own storage space. The x86 specific part
had to be merged in to avoid a major conflict.
- The TIF_NOTIFY_SIGNAL work which replaces the inefficient signal
delivery mode of task work and results in an impressive performance
improvement for io_uring. The non-x86 consolidation of this is going to
come seperate via Jens.
- The selective syscall redirection facility which provides a clean and
efficient way to support the non-Linux syscalls of WINE by catching them
at syscall entry and redirecting them to the user space emulation. This
can be utilized for other purposes as well and has been designed
carefully to avoid overhead for the regular fastpath. This includes the
core changes and the x86 support code.
- Simplification of the context tracking entry/exit handling for the users
of the generic entry code which guarantee the proper ordering and
protection.
- Preparatory changes to make the generic entry code accomodate S390
specific requirements which are mostly related to their syscall restart
mechanism.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XoPoTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoe0tD/4jSKHIogVM9kVpiYfwjDGS1NluaBXn
71ZoASbX9GZebyGandMyF2QP1iJ24ZO0RztBwHEVH6fyomKB2iFNedssCpO9yfWV
3eFRpOvMpbszY2W2bd0QG3GrqaTttjVfB4ahkGLzqeSbchdob6hZpNDYtBZnujA6
GSnrrurfJkCGoQny+yJQYdQJXQU+BIX90B2a2Q+jW123Luy/iHXC1f/krZSA1m14
fC9xYLSUjPphTzh2ZOW+C3DgdjOL5PfAm/6F+DArt4GtLgrEGD7R74aLSFhvetky
dn5QtG+yAsz1i0cc5Wu/JBcT9tOkY92rPYSyLI9bYQUSQ/bMyuprz6oYKj3dubsu
ZSsKPdkNFPIniL4fLdCMWZcIXX5xgnrxKjdgXZXW3gtrcxSns8w8uED3Sh7dgE08
pgIeq67E5g/OB8kJXH1VxdewmeQb9cOmnzzHwNO7TrrGbBKjDTYHNdYOKf1dUTTK
ZX1UjLfGwxTkMYAbQD1k0JGZ2OLRshzSaH5BW/ZKa3bvJW6yYOq+/YT8B8hbJ8U3
vThlO75/55IJxS5r5Y3vZd/IHdsYbPuETD+TA8tNYtPqNZasW8nnk4TYctWqzDuO
/Ka1wvWYid3c6ySznQn4zSyRjr968AfHeZ9YTUMhWufy5waXVmdBMG41u3IKfsVt
osyzNc4EK19/Mg==
=hsjV
-----END PGP SIGNATURE-----
Merge tag 'core-entry-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core entry/exit updates from Thomas Gleixner:
"A set of updates for entry/exit handling:
- More generalization of entry/exit functionality
- The consolidation work to reclaim TIF flags on x86 and also for
non-x86 specific TIF flags which are solely relevant for syscall
related work and have been moved into their own storage space. The
x86 specific part had to be merged in to avoid a major conflict.
- The TIF_NOTIFY_SIGNAL work which replaces the inefficient signal
delivery mode of task work and results in an impressive performance
improvement for io_uring. The non-x86 consolidation of this is
going to come seperate via Jens.
- The selective syscall redirection facility which provides a clean
and efficient way to support the non-Linux syscalls of WINE by
catching them at syscall entry and redirecting them to the user
space emulation. This can be utilized for other purposes as well
and has been designed carefully to avoid overhead for the regular
fastpath. This includes the core changes and the x86 support code.
- Simplification of the context tracking entry/exit handling for the
users of the generic entry code which guarantee the proper ordering
and protection.
- Preparatory changes to make the generic entry code accomodate S390
specific requirements which are mostly related to their syscall
restart mechanism"
* tag 'core-entry-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
entry: Add syscall_exit_to_user_mode_work()
entry: Add exit_to_user_mode() wrapper
entry_Add_enter_from_user_mode_wrapper
entry: Rename exit_to_user_mode()
entry: Rename enter_from_user_mode()
docs: Document Syscall User Dispatch
selftests: Add benchmark for syscall user dispatch
selftests: Add kselftest for syscall user dispatch
entry: Support Syscall User Dispatch on common syscall entry
kernel: Implement selective syscall userspace redirection
signal: Expose SYS_USER_DISPATCH si_code type
x86: vdso: Expose sigreturn address on vdso to the kernel
MAINTAINERS: Add entry for common entry code
entry: Fix boot for !CONFIG_GENERIC_ENTRY
x86: Support HAVE_CONTEXT_TRACKING_OFFSTACK
context_tracking: Only define schedule_user() on !HAVE_CONTEXT_TRACKING_OFFSTACK archs
sched: Detect call to schedule from critical entry code
context_tracking: Don't implement exception_enter/exit() on CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK
context_tracking: Introduce HAVE_CONTEXT_TRACKING_OFFSTACK
x86: Reclaim unused x86 TI flags
...
- Expose tag address bits in siginfo. The original arm64 ABI did not
expose any of the bits 63:56 of a tagged address in siginfo. In the
presence of user ASAN or MTE, this information may be useful. The
implementation is generic to other architectures supporting tags (like
SPARC ADI, subject to wiring up the arch code). The user will have to
opt in via sigaction(SA_EXPOSE_TAGBITS) so that the extra bits, if
available, become visible in si_addr.
- Default to 32-bit wide ZONE_DMA. Previously, ZONE_DMA was set to the
lowest 1GB to cope with the Raspberry Pi 4 limitations, to the
detriment of other platforms. With these changes, the kernel scans the
Device Tree dma-ranges and the ACPI IORT information before deciding
on a smaller ZONE_DMA.
- Strengthen READ_ONCE() to acquire when CONFIG_LTO=y. When building
with LTO, there is an increased risk of the compiler converting an
address dependency headed by a READ_ONCE() invocation into a control
dependency and consequently allowing for harmful reordering by the
CPU.
- Add CPPC FFH support using arm64 AMU counters.
- set_fs() removal on arm64. This renders the User Access Override (UAO)
ARMv8 feature unnecessary.
- Perf updates: PMU driver for the ARM DMC-620 memory controller, sysfs
identifier file for SMMUv3, stop event counters support for i.MX8MP,
enable the perf events-based hard lockup detector.
- Reorganise the kernel VA space slightly so that 52-bit VA
configurations can use more virtual address space.
- Improve the robustness of the arm64 memory offline event notifier.
- Pad the Image header to 64K following the EFI header definition
updated recently to increase the section alignment to 64K.
- Support CONFIG_CMDLINE_EXTEND on arm64.
- Do not use tagged PC in the kernel (TCR_EL1.TBID1==1), freeing up 8
bits for PtrAuth.
- Switch to vmapped shadow call stacks.
- Miscellaneous clean-ups.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAl/XcSgACgkQa9axLQDI
XvGkwg//SLknimELD/cphf2UzZm5RFuCU0x1UnIXs9XYo5BrOpgVLLA//+XkCrKN
0GLAdtBDfw1axWJudzgMBiHrv6wSGh4p3YWjLIW06u/PJu3m3U8oiiolvvF8d7Yq
UKDseKGQnQkrl97J0SyA+Da/u8D11GEzp52SWL5iRxzt6vInEC27iTOp9n1yoaoP
f3y7qdp9kv831ryUM3rXFYpc8YuMWXk+JpBSNaxqmjlvjMzipA5PhzBLmNzfc657
XcrRX5qsgjEeJW8UUnWUVNB42j7tVzN77yraoUpoVVCzZZeWOQxqq5EscKPfIhRt
AjtSIQNOs95ZVE0SFCTjXnUUb823coUs4dMCdftqlE62JNRwdR+3bkfa+QjPTg1F
O9ohW1AzX0/JB19QBxMaOgbheB8GFXh3DVJ6pizTgxJgyPvQQtFuEhT1kq8Cst0U
Pe+pEWsg9t41bUXNz+/l9tUWKWpeCfFNMTrBXLmXrNlTLeOvDh/0UiF0+2lYJYgf
YAboibQ5eOv2wGCcSDEbNMJ6B2/6GtubDJxH4du680F6Emb6pCSw0ntPwB7mSGLG
5dXz+9FJxDLjmxw7BXxQgc5MoYIrt5JQtaOQ6UxU8dPy53/+py4Ck6tXNkz0+Ap7
gPPaGGy1GqobQFu3qlHtOK1VleQi/sWcrpmPHrpiiFUf6N7EmcY=
=zXFk
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
- Expose tag address bits in siginfo. The original arm64 ABI did not
expose any of the bits 63:56 of a tagged address in siginfo. In the
presence of user ASAN or MTE, this information may be useful. The
implementation is generic to other architectures supporting tags
(like SPARC ADI, subject to wiring up the arch code). The user will
have to opt in via sigaction(SA_EXPOSE_TAGBITS) so that the extra
bits, if available, become visible in si_addr.
- Default to 32-bit wide ZONE_DMA. Previously, ZONE_DMA was set to the
lowest 1GB to cope with the Raspberry Pi 4 limitations, to the
detriment of other platforms. With these changes, the kernel scans
the Device Tree dma-ranges and the ACPI IORT information before
deciding on a smaller ZONE_DMA.
- Strengthen READ_ONCE() to acquire when CONFIG_LTO=y. When building
with LTO, there is an increased risk of the compiler converting an
address dependency headed by a READ_ONCE() invocation into a control
dependency and consequently allowing for harmful reordering by the
CPU.
- Add CPPC FFH support using arm64 AMU counters.
- set_fs() removal on arm64. This renders the User Access Override
(UAO) ARMv8 feature unnecessary.
- Perf updates: PMU driver for the ARM DMC-620 memory controller, sysfs
identifier file for SMMUv3, stop event counters support for i.MX8MP,
enable the perf events-based hard lockup detector.
- Reorganise the kernel VA space slightly so that 52-bit VA
configurations can use more virtual address space.
- Improve the robustness of the arm64 memory offline event notifier.
- Pad the Image header to 64K following the EFI header definition
updated recently to increase the section alignment to 64K.
- Support CONFIG_CMDLINE_EXTEND on arm64.
- Do not use tagged PC in the kernel (TCR_EL1.TBID1==1), freeing up 8
bits for PtrAuth.
- Switch to vmapped shadow call stacks.
- Miscellaneous clean-ups.
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (78 commits)
perf/imx_ddr: Add system PMU identifier for userspace
bindings: perf: imx-ddr: add compatible string
arm64: Fix build failure when HARDLOCKUP_DETECTOR_PERF is enabled
arm64: mte: fix prctl(PR_GET_TAGGED_ADDR_CTRL) if TCF0=NONE
arm64: mark __system_matches_cap as __maybe_unused
arm64: uaccess: remove vestigal UAO support
arm64: uaccess: remove redundant PAN toggling
arm64: uaccess: remove addr_limit_user_check()
arm64: uaccess: remove set_fs()
arm64: uaccess cleanup macro naming
arm64: uaccess: split user/kernel routines
arm64: uaccess: refactor __{get,put}_user
arm64: uaccess: simplify __copy_user_flushcache()
arm64: uaccess: rename privileged uaccess routines
arm64: sdei: explicitly simulate PAN/UAO entry
arm64: sdei: move uaccess logic to arch/arm64/
arm64: head.S: always initialize PSTATE
arm64: head.S: cleanup SCTLR_ELx initialization
arm64: head.S: rename el2_setup -> init_kernel_el
arm64: add C wrappers for SET_PSTATE_*()
...
(Arvind Sankar)
- Remove -m16 workaround now that the GCC versions that need it are unsupported
(Nick Desaulniers)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XhwYACgkQEsHwGGHe
VUqYPQ/9ECDO6Ms8rkvZf4f7+LfKO8XsRYf0h411h5zuVssgPybIbZCvuiQ5Jj9r
IzTe0pO0AoVoX5/JE+mxbvroBb5hRokNehJayPmdVjZSrzsYWZR3RVW6iSRo/TQ0
bjzY1cHC4EPmtyYA4FDr+b0JAEEbyPuPOAcqSzBBadsx5FURvwH7s1hUILCZ1nmy
i8DADZDo7Puk1W8VQkGkjUeo3QVY8KG2VzZegU0iUR/M7Hoek3WBPObYFg0bsKN4
6hvRWPMt2lFXOSNZOFhG+LjtWXTJ22g7ENU4SHXhHBdxmM8H/OZvh/3UZTes7rdo
u/LZJ580GfOa3TYHOToGQq9iBxoi9Se3vLXUtt2I9Z7sbBVUGTKx2Llezvt9/QRh
kJkpvMc2r2b9n7Cu52gHKQ89OrqFzs2scNyjQ2KRU61bPwLjWoGiobugnl8T9q8Y
Li8LKS2tCD6yo7L1eYt0NDriWVF3/Tv1rebWMNIf3sGSHFkxgHxFviTx3L3SWEBS
8JMuKLPh20WC9ombXP/tDN2rtTmmL9hEoqZ9uSTG1tRNLOLLNv790ucUh+lp6RFp
7y8dzl+L3hXsFdcCYooOIGO1phmdThwSLMqZHmbpRWpKNr+8P0wCqZP16twnoXNC
kO7OiNd6XHVpJg4WnN3C3X70n7TPXSZ7dkpDNFBJaH2QilNEHa8=
=Qk/S
-----END PGP SIGNATURE-----
Merge tag 'x86_build_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 build updates from Borislav Petkov:
"Two x86 build fixes:
- Fix the vmlinux size check on 64-bit along with adding useful
clarifications on the topic (Arvind Sankar)
- Remove -m16 workaround now that the GCC versions that need it are
unsupported (Nick Desaulniers)"
* tag 'x86_build_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/build: Remove -m16 workaround for unsupported versions of GCC
x86/build: Fix vmlinux size check on 64-bit
(Fenghua Yu)
- Cleanups.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XhTUACgkQEsHwGGHe
VUpTkA/9FNOaJohBa9XO2bv3RjsqE4/2PqZS434HJL6yrbbRDFyscSp2sNCDlfh/
6zj9R72HpRf6xLW8CTeszrfUQ0z3CFnfwz8EAolhv5DOiJnM3wSS5inmLhtyTMgw
mJ8qVXXELdAFm1R8rAQLVmA3FE9aV6u19POstKeXMUykCyaxKDhNLJgn3CiXpegU
AvG3QWmrR/1x4DjQguMNwXMQiYVDkRdnfJa5SOzCTwyUj0D0an+kPmSHEMvhaOL6
tUYfjLkZYdB5THG8eUM833EJmgRe7X1VtTQIydcVyt/6tbL50FmVYUpWk+SXuSnJ
/uBdzx0IXmfvYKgGSFw+FGOyGH8u02nR4621rECNLNf1h8YknzpN7Ri1hB83GjQe
PMiW/gCwxM6gfRsBpJ17xnAbmR5Az2dOz71uj5k13L1GNL8VZD2DZz//a4TDarzE
4R8i3PQ/oUMgmL+ARpVWwycc/dhB8o+1glmAjOwWHO/kM1k/hx9Ou5bbLlhffbL5
zkk6ORrNfKzAMvE1flkjim5XgNHY8n/L4CwBKXJFQ3IYbXBGg0CKxBHWvTpIz1KO
O4h52OnYUpM15sY22NSvpdRiHcsgEqNOB4lb6K2dby9iE5b5RHpNgfMNO6+eOtag
dgCWopamLK2+MFeNe48+1v/3bDveskhvJon91GUZYmfkFzhZTN8=
=9Lsd
-----END PGP SIGNATURE-----
Merge tag 'x86_cache_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cache resource control updates from Borislav Petkov:
- add logic to correct MBM total and local values fixing errata SKX99
and BDF102 (Fenghua Yu)
- cleanups
* tag 'x86_cache_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Clean up unused function parameter in rmdir path
x86/resctrl: Constify kernfs_ops
x86/resctrl: Correct MBM total and local values
Documentation/x86: Rename resctrl_ui.rst and add two errata to the file
(Gabriel Krisman Bertazi)
- All kinds of minor cleanups all over the tree.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XgtoACgkQEsHwGGHe
VUqGuA/9GqN2zNQdhgRvAQ+FLZiOYK9MfXcoayfMq8T61VRPDBWaQRfVYKmfmEjS
0l5OnYgZQ9n6vzqFy6pmgc/ix8Jr553dZp5NCamcOqjCTcuO/LwRRh+ZBeFSBTPi
r2qFYKKRYvM7nbyUMm4WqvAakxJ18xsjNbIslr9Aqe8WtHBKKX3MOu8SOpFtGyXz
aEc4rhsS45iZa5gTXhvOn73tr3yHGWU1rzyyAAAmDGTgAxRwsTna8v16C4+v+Bua
Zg18Wiutj8ZjtFpzKJtGWGZoSBap3Jw2Ys64g42MBQUE56KY/99tQVo/SvbYvvlf
PHWLH0f3rPNJ6J2qeKwhtNzPlEAH/6e416A1/6TVwsK+8pdfGmkfaQh2iDHLhJ5i
CSwF61H44ZaE3pc1tHHbC5ALvydPlup7D4MKgztfq0mZ3OoV2Vg7dtyyr+Ybz72b
G+Kl/tmyacQTXo0FiYbZKETo3/VfTdBXGyVax1rHkx3pt8zvhFg3kxb1TT/l/CoM
eSTx53PtTdVtbGOq1CjnUm0FKlbh4+kLoNuo9DYKeXUQBs8PWOCZmL3wXmm4cqlZ
mDZVWvll7CjToY8izzcE/AG279cWkgcL5Tcg7W7CR66+egfDdpuqOZ4tv4TyzoWq
0J7WeNj+TAo98b7RA0Ux8LOlszRxS2ykuI6uB2MgwCaRMbbaQao=
=lLiH
-----END PGP SIGNATURE-----
Merge tag 'x86_cleanups_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Borislav Petkov:
"Another branch with a nicely negative diffstat, just the way I
like 'em:
- Remove all uses of TIF_IA32 and TIF_X32 and reclaim the two bits in
the end (Gabriel Krisman Bertazi)
- All kinds of minor cleanups all over the tree"
* tag 'x86_cleanups_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
x86/ia32_signal: Propagate __user annotation properly
x86/alternative: Update text_poke_bp() kernel-doc comment
x86/PCI: Make a kernel-doc comment a normal one
x86/asm: Drop unused RDPID macro
x86/boot/compressed/64: Use TEST %reg,%reg instead of CMP $0,%reg
x86/head64: Remove duplicate include
x86/mm: Declare 'start' variable where it is used
x86/head/64: Remove unused GET_CR2_INTO() macro
x86/boot: Remove unused finalize_identity_maps()
x86/uaccess: Document copy_from_user_nmi()
x86/dumpstack: Make show_trace_log_lvl() static
x86/mtrr: Fix a kernel-doc markup
x86/setup: Remove unused MCA variables
x86, libnvdimm/test: Remove COPY_MC_TEST
x86: Reclaim TIF_IA32 and TIF_X32
x86/mm: Convert mmu context ia32_compat into a proper flags field
x86/elf: Use e_machine to check for x32/ia32 in setup_additional_pages()
elf: Expose ELF header on arch_setup_additional_pages()
x86/elf: Use e_machine to select start_thread for x32
elf: Expose ELF header in compat_start_thread()
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XUC8ACgkQEsHwGGHe
VUoqig//YbNptsXveS0Kir4FUuWpLINbMGGfCekxlUItBzYFH3khqOcE4aiOfRB8
TOx2QIFySiqJuTF4tnmwSs0QidHVchjVdD7+7f5jU2G8b/qzvU94+FdVQJJkTQa9
pfOCxRr63ZjUdobPzSuEvZx9Wqr5kA1UpOsITxxx+rGR1M8HEA5dUYhSYgeoqFVI
h7GgLSAfLkyeguTt2uHF8WOw0jgzkabw5CdiLAc/HwUh8Sde3ZdNCJizH5U636BF
N+sNkQNjuo1t0lFAh9bjwTYq/0wy9ojqBYcXDDd2s1s62TuAOZ10u8dAW8+CeZuJ
iC+7/j3QMulbrUeeOv48SkUkv+cMn+gcjcxFoHSEehVue9IciaoHasDACrB+0zBy
pEZrFlyRGZyAjaXoiPQpsQ63/10W1zs+nFfP+IjM98O/I+ZawFfrjOI6x6qyWY3r
DhskIaZ/Ha5nqzka1f502Xml54vBaeETQ7D8tQwco9wo3KyUPKXRLgHLiSiaUk/x
FW/1hVC13mimn1n4eCB2sqwOKEFi8CfpLAbATrxmU8pRjIpcYK2B3mRWZBYUutEM
uW1PdnnFI2680vKFrkWfWnqxSb4yLv/v5K4th5+8HzUyODWEq4hgSOdwZgxwEH9Y
dQcfkiDbZRKEIoTiXDz4FFAlHeGMkldAT8YGYCBwQxX1DoCUvCg=
=1LFY
-----END PGP SIGNATURE-----
Merge tag 'x86_mm_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm update from Borislav Petkov:
"A single improvement to check ident_pud_init()'s return value (Arvind
Sankar)"
* tag 'x86_mm_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/ident_map: Check for errors from ident_pud_init()
an attempt to have userspace tools not poke at naked MSRs. This round
deals with MSR_IA32_ENERGY_PERF_BIAS and removes direct poking into it
by our in-tree tools in favor of the proper "energy_perf_bias" sysfs
interface which we already have.
In addition, the msr.ko write filtering's error message points to a new
summary page which contains the info we collected from helpful reporters
about which userspace tools write MSRs:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/about
along with the current status of their conversion.
Rest is the usual small fixes and improvements.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XVKYACgkQEsHwGGHe
VUondg//fv3aQM3KtWE7sxv6BjpiUNozPBELRuKo+EskHSxHudRhBxzdSMM7WgKq
2uojb2CQtzRzYhHuiXjXKfbB7Ci/Jo4EDCJW2otpiqit7/UgXu15Q5ypCUMIteiV
u9A2w3oN3GPR5TuofLWCffaotVMpFok3u7jX7RxEQPWmZqJItTwZpqYLeyniHaKM
c6taAxZVyV13iejRhxim2zkl/hMXpjA8I+8CqWIL25J7GYlYeWLWxWYmHIQTs0NM
zSIyr47RD8RRXVeRdeJMxnQblKE1zrObIV1fUXXu1dSW47DkrrcOQwEMorNjPtPA
FR5Xhi+TX8JrBasMpwCnV/CTj6Ua8UsMfwQcPOFnXALPj87HfFSypa5BpnBH5xTW
PaiatRmiNJm3g79ncaTvXCksMbb4WANqOYK+gsGYvtKbfLR+caWT6vytjZA6sC6x
laynstV9PFUyewdwjjAjilhArzV+y+5RsRudBK8xSjcawbyV4ZEorNKYS9qrhm+y
7CAM9A8fCQiO6POr6W7HcfmkUOHC9PLhtyjdJH89tAmaf+sfvaczzx3awwSuKx7P
0rJlDiJP1v7yEpOMWHbpGIqjMBaWK4y3mb4g3UwFpHpo8cTl+WXZQppOPIBn9GA9
ASLYT/ze7zk1Ua2V88qoXiC5AEvqBnSq4fp2pmf06ROZgBnYT6o=
=ISyk
-----END PGP SIGNATURE-----
Merge tag 'x86_misc_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Borislav Petkov:
"The main part of this branch is the ongoing fight against windmills in
an attempt to have userspace tools not poke at naked MSRs.
This round deals with MSR_IA32_ENERGY_PERF_BIAS and removes direct
poking into it by our in-tree tools in favor of the proper
"energy_perf_bias" sysfs interface which we already have.
In addition, the msr.ko write filtering's error message points to a
new summary page which contains the info we collected from helpful
reporters about which userspace tools write MSRs:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/about
along with the current status of their conversion.
The rest is the usual small fixes and improvements"
* tag 'x86_misc_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/msr: Add a pointer to an URL which contains further details
x86/pci: Fix the function type for check_reserved_t
selftests/x86: Add missing .note.GNU-stack sections
selftests/x86/fsgsbase: Fix GS == 1, 2, and 3 tests
x86/msr: Downgrade unrecognized MSR message
x86/msr: Do not allow writes to MSR_IA32_ENERGY_PERF_BIAS
tools/power/x86_energy_perf_policy: Read energy_perf_bias from sysfs
tools/power/turbostat: Read energy_perf_bias from sysfs
tools/power/cpupower: Read energy_perf_bias from sysfs
MAINTAINERS: Cleanup SGI-related entries
(Justin Ernst and Mike Travis)
- The usual set of small fixes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XWLkACgkQEsHwGGHe
VUpgoQ//WnhLoyByrxyKScOLW2aCCX4sIQ2R0Mjqt4z6wHHL6OB1GIswtIWsFlpw
UtFOhTmKV0A5q/yXIGYY1EOZPhA4wrrTesYHsOoVcJ3xdwrAarc3fzw02uKT/uQp
sW0JxTtJ5iK0XZp+K81344sQy26iuTiLh0Lwg9lJ5mpUotW+1iSOp7LAgcN462r9
3l8To6qWyWnnMWgJlV7jdcsswk0V2TkT/iipQITNqEGtwqRm/uFGRgK/669IbqP/
P/qROhU1XseHxVFfPMXfFVeFjNTpkcgD11C8YeqWiHXrmBesCHIC9cd2pYfJepLJ
F4sIh6zqt6v6z5UOD+yo8IjdmL0v9nPLJiCeiu9ES59pVIx7DsR+p005fBLTAXp8
QpbiAhJn4ajpu78QAkkhieAvldXP2+1h42wK8HUTWKt2wPsnyGozvY71kp8UkMSd
1Kh3mahrAN57TfvlewrMr/ynhGMMuWQHIr3ScYei0L6vF2c6f98HSiNzzZ1PiiY7
6OwR+qISucV8ZeXMv6va5POI/T67UZjjuT5Zw0ls1LpqJ6CkGmcrCeq6qYbxK7Fy
Vd8T/v51FknrTF0J42GMLb9ICIFmC0yAvzflTytFvWqgGEZpEEpRzmKfeC+a0G7B
m3XEm/xE3+3TKnID+2tPSFOEU4mdwKC0+lguD7DV8aTYpdjsxXs=
=VZBo
-----END PGP SIGNATURE-----
Merge tag 'x86_platform_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 platform updates from Borislav Petkov:
- add a new uv_sysfs driver and expose read-only information from UV
BIOS (Justin Ernst and Mike Travis)
- the usual set of small fixes
* tag 'x86_platform_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform/uv: Update sysfs documentation
x86/platform/uv: Add deprecated messages to /proc info leaves
x86/platform/uv: Add sysfs hubless leaves
x86/platform/uv: Add sysfs leaves to replace those in procfs
x86/platform/uv: Add kernel interfaces for obtaining system info
x86/platform/uv: Make uv_pcibus_kset and uv_hubs_kset static
x86/platform/uv: Fix an error code in uv_hubs_init()
x86/platform/uv: Update MAINTAINERS for uv_sysfs driver
x86/platform/uv: Update ABI documentation of /sys/firmware/sgi_uv/
x86/platform/uv: Add new uv_sysfs platform driver
x86/platform/uv: Add and export uv_bios_* functions
x86/platform/uv: Remove existing /sys/firmware/sgi_uv/ interface
code to use it (Yazen Ghannam)
- Remove a dead and unused TSEG region remapping workaround on AMD (Arvind Sankar)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XVlYACgkQEsHwGGHe
VUpxTA/9F0KsgSyTh66uX+aX5qkQ3WTBVgxbXGFrn5qPvwcALXabU8qObDWTSdwS
1YbiWDjKNBJX+dggWe/fcQgUZxu5DFkM4IKEW1V7MLJEcdfylcqCyc1YNpEI4ySn
ebw2Sy4/5iXGAvhz802/WoUU/o3A2uZwe0RFyodHGxof5027HkZhRHeYB27Htw+l
z0IsmiYOoPl/4mNuVgr/qieIFSw1SUE9kwjU8RvM6xVWmXWXpM68JHa9s+/51pFt
6BaOz485OyzWUCtSx3/++GEkU2d53bWYOuQ1zTLEiuaBfYC5n5T/kAcT4WJNK6Tf
tX7yrzmWm9ecykIxfkgMrhG57G38y2GMJcEg+dFQHeXC062fdHDg+oY6Ql2EkAm5
t5RIQ/cyOmQCLns31rHI/kwQ3RMKc/lfnL/z8lrlfWsC5o755yFJKttbfLJugbTo
3BO1fbs4xgQcgi0KoqXOUETrQtsOLtr9FJwvcArB94XXqcIPClE8Ir7n8T7FCuLr
9litSXIdn46EHwD6hD5QIk7y+Rxwk/jxZFys3eh90jcWDDZTaG2lz3if33RbZ1go
XBrS5X3HsMODGZlaMeUjrbFIz3e0Zyoo+RO/TX48w8nzivC6xSNxSNFgIZ1XTF5E
SLMGa6lEQ9mLiqRfgFjynNwSYOSlGv3euMkZaVPS3hnNmn+vZbI=
=RsCs
-----END PGP SIGNATURE-----
Merge tag 'x86_cpu_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpuid updates from Borislav Petkov:
"Only AMD-specific changes this time:
- Save the AMD physical die ID into cpuinfo_x86.cpu_die_id and
convert all code to use it (Yazen Ghannam)
- Remove a dead and unused TSEG region remapping workaround on AMD
(Arvind Sankar)"
* tag 'x86_cpu_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/amd: Remove dead code for TSEG region remapping
x86/topology: Set cpu_die_id only if DIE_TYPE found
EDAC/mce_amd: Use struct cpuinfo_x86.cpu_die_id for AMD NodeId
x86/CPU/AMD: Remove amd_get_nb_id()
x86/CPU/AMD: Save AMD NodeId as cpu_die_id
applications to populate protected regions of user code and data called
enclaves. Once activated, the new hardware protects enclave code and
data from outside access and modification.
Enclaves provide a place to store secrets and process data with those
secrets. SGX has been used, for example, to decrypt video without
exposing the decryption keys to nosy debuggers that might be used to
subvert DRM. Software has generally been rewritten specifically to
run in enclaves, but there are also projects that try to run limited
unmodified software in enclaves."
Most of the functionality is concentrated into arch/x86/kernel/cpu/sgx/
except the addition of a new mprotect() hook to control enclave page
permissions and support for vDSO exceptions fixup which will is used by
SGX enclaves.
All this work by Sean Christopherson, Jarkko Sakkinen and many others.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XTtMACgkQEsHwGGHe
VUqxFw/+NZGf2b3CWPcrvwXCpkvSpIrqh1jQwyvkZyJ1gen7Vy8dkvf99h8+zQPI
4wSArEyjhYJKAAmBNefLKi/Cs/bdkGzLlZyDGqtM641XRjf0xXIpQkOBb6UBa+Pv
to8veQmVH2bBTM49qnd+H1wM6FzYvhTYCD8xr4HlLXtIfpP2CK2GvCb8s/4LifgD
fTucZX9TFwLgVkWOHWHN0n8XMR2Fjb2YCrwjFMKyr/M2W+pPoOCTIt4PWDuXiOeG
rFP7R4DT9jDg8ht5j2dHQT/Bo8TvTCB4Oj98MrX1TTgkSjLJySSMfyQg5EwNfSIa
HC0lg/6qwAxnhWX7cCCBETNZ4aYDmz/dxcCSsLbomGP9nMaUgUy7qn5nNuNbJilb
oCBsr8LDMzu1LJzmkduM8Uw6OINh+J8ICoVXaR5pS7gSZz/+vqIP/rK691AiqhJL
QeMkI9gQ83jEXpr/AV7ABCjGCAeqELOkgravUyTDev24eEc0LyU0qENpgxqWSTca
OvwSWSwNuhCKd2IyKZBnOmjXGwvncwX0gp1KxL9WuLkR6O8XldLAYmVCwVAOrIh7
snRot8+3qNjELa65Nh5DapwLJrU24TRoKLHLgfWK8dlqrMejNtXKucQ574Np0feR
p2hrNisOrtCwxAt7OAgWygw8agN6cJiY18onIsr4wSBm5H7Syb0=
=k7tj
-----END PGP SIGNATURE-----
Merge tag 'x86_sgx_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SGC support from Borislav Petkov:
"Intel Software Guard eXtensions enablement. This has been long in the
making, we were one revision number short of 42. :)
Intel SGX is new hardware functionality that can be used by
applications to populate protected regions of user code and data
called enclaves. Once activated, the new hardware protects enclave
code and data from outside access and modification.
Enclaves provide a place to store secrets and process data with those
secrets. SGX has been used, for example, to decrypt video without
exposing the decryption keys to nosy debuggers that might be used to
subvert DRM. Software has generally been rewritten specifically to run
in enclaves, but there are also projects that try to run limited
unmodified software in enclaves.
Most of the functionality is concentrated into arch/x86/kernel/cpu/sgx/
except the addition of a new mprotect() hook to control enclave page
permissions and support for vDSO exceptions fixup which will is used
by SGX enclaves.
All this work by Sean Christopherson, Jarkko Sakkinen and many others"
* tag 'x86_sgx_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
x86/sgx: Return -EINVAL on a zero length buffer in sgx_ioc_enclave_add_pages()
x86/sgx: Fix a typo in kernel-doc markup
x86/sgx: Fix sgx_ioc_enclave_provision() kernel-doc comment
x86/sgx: Return -ERESTARTSYS in sgx_ioc_enclave_add_pages()
selftests/sgx: Use a statically generated 3072-bit RSA key
x86/sgx: Clarify 'laundry_list' locking
x86/sgx: Update MAINTAINERS
Documentation/x86: Document SGX kernel architecture
x86/sgx: Add ptrace() support for the SGX driver
x86/sgx: Add a page reclaimer
selftests/x86: Add a selftest for SGX
x86/vdso: Implement a vDSO for Intel SGX enclave call
x86/traps: Attempt to fixup exceptions in vDSO before signaling
x86/fault: Add a helper function to sanitize error code
x86/vdso: Add support for exception fixup in vDSO functions
x86/sgx: Add SGX_IOC_ENCLAVE_PROVISION
x86/sgx: Add SGX_IOC_ENCLAVE_INIT
x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES
x86/sgx: Add SGX_IOC_ENCLAVE_CREATE
x86/sgx: Add an SGX misc driver interface
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XSkwACgkQEsHwGGHe
VUoYZBAAwGeZCV6KKQERAhMsTKAI+UvGS8rAOBOOTheDHyS/X4Rf63OuSLO/wC3c
NJ9Z4M8GeMmW8K/oeEeUyEI6980FC2I0tQE20J29SPjK37jTowIHjJ26ofKGUP1+
TiqSL2Eo90KqUSVg4GGdDKEuO6by5gS8tRoNE9Of0eGz6ahVOn9PFHt6QVk9cFug
KOByzLjwmBd+XNE/gm56Tl4khOUBYy8+2JGZfewfJOsktln9t+s0WnuB1P7PU+2B
9uI2AAs9JmaO5RzIVHi3VYFsPEzrL3GXBpBCR2XJucIk4h6Km94yNDZW/th5XP6T
vKDNrhg3ErmeI0MgfIKHq+QXxGLobsra9/ZQazg8ZuPvIg47jzo/TaCiLtJGPips
Kr6MLMK6lfQOnOsaDjnGUJPMpOvAmMxOU9o7MVn1XxpDiFr8Fzt1njBMJ/kZikXt
YNOUj+Pws/WfvdFW/QkBBao6iOidEFxgTLJnADeocL0T0y5feVqvrOrWKFXGiL+H
i8ehg4040DZA0PxqxxUYe+EVQgMs41LAHr0zvUxI+hxQBKPFFg6xkpUCOaHd/rSZ
XTx0AsYuaE2RPY/ldDWkWrzGja9FfzTSKwzkDkhECrHXGYaIyiNSCpFZQ4Mqua3c
Z0mT4pLx6blLFqJ+O+Q+juuMSixSVKWe5NwEoG45sWZfGAHIl0M=
=IsGj
-----END PGP SIGNATURE-----
Merge tag 'x86_microcode_update_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode loader update from Borislav Petkov:
"This one wins the award for most boring pull request ever. But that's
a good thing - this is how I like 'em and the microcode loader
*should* be boring. :-)
A single cleanup removing "break" after a return statement (Tom Rix)"
* tag 'x86_microcode_update_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode/amd: Remove unneeded break
- Pass error records logged by firmware through the MCE decoding chain
to provide human-readable error descriptions instead of raw values
(Smita Koralahalli)
- Some #MC handler fixes (Gabriele Paoloni)
- The usual small fixes and cleanups all over.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XSJcACgkQEsHwGGHe
VUob6hAAgSJq1IcftZR4DSk/Mlrt0x4orDNCmoGhxAlOT7ryiidYhXuKV2tvWloA
3v7X5E0r9CroS3PMghQtVOD7qJMjNAWKun5C6zPhLkIeV+CvcLAHhfShlcyWhJ76
PQaHHSxRQGEh5M2Xcp26kdwqrARbhcl66ukBvFNpiNUkLH+robDmIazraI5a2bV/
9azNoVUZBSXQoJYpPz/tBTxu2EToj/xrMVIZg1OPHR5cxtOwUSZCr8V69KGK4onm
avYQY8TSCDxsG1VcywYzBNi2W6lKs2EFlhCVZBLCz0NIkZCYTXErI4OsuExMufLu
t17sAfHjQg+SxEcL5pc+iQkr9i0LLnujKz+Cl0ShtRk6SER0U+9pc/yf0wQSGDhB
AZz87z+a6+r4pxdTSclOkpQCAfRR+pWjNwA5dyi6/72Qbqi6lmwKWDPJnnyq0YS4
UZI01zjs7ir93nS1zwJcekFOJCSTsb6XmhEgMVlpw+YoZHaOki1KJMCU0kIgZt8O
YlEniP/DdXBS0mflOJQnoes7XrcIWVqWEubeRZdoWYnC07hNmdg4XJ0c3Skx8ZW+
gL8kt4pDWlnKHlTlhtgocG3H5BkMazrYEmbograc/Oe8lkr9ESqIS7yS1l8lM7Z6
i0HXATcdvDHV0AqW/uoNczXpck4x8xrahIzyPqAve2G15XEIgq4=
=D9fV
-----END PGP SIGNATURE-----
Merge tag 'ras_updates_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Borislav Petkov:
- Enable additional logging mode on older Xeons (Tony Luck)
- Pass error records logged by firmware through the MCE decoding chain
to provide human-readable error descriptions instead of raw values
(Smita Koralahalli)
- Some #MC handler fixes (Gabriele Paoloni)
- The usual small fixes and cleanups all over.
* tag 'ras_updates_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Rename kill_it to kill_current_task
x86/mce: Remove redundant call to irq_work_queue()
x86/mce: Panic for LMCE only if mca_cfg.tolerant < 3
x86/mce: Move the mce_panic() call and 'kill_it' assignments to the right places
x86/mce, cper: Pass x86 CPER through the MCA handling chain
x86/mce: Use "safe" MSR functions when enabling additional error logging
x86/mce: Correct the detection of invalid notifier priorities
x86/mce: Assign boolean values to a bool variable
x86/mce: Enable additional error logging on certain Intel CPUs
x86/mce: Remove unneeded break
Pull crypto updates from Herbert Xu:
"API:
- Add speed testing on 1420-byte blocks for networking
Algorithms:
- Improve performance of chacha on ARM for network packets
- Improve performance of aegis128 on ARM for network packets
Drivers:
- Add support for Keem Bay OCS AES/SM4
- Add support for QAT 4xxx devices
- Enable crypto-engine retry mechanism in caam
- Enable support for crypto engine on sdm845 in qce
- Add HiSilicon PRNG driver support"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (161 commits)
crypto: qat - add capability detection logic in qat_4xxx
crypto: qat - add AES-XTS support for QAT GEN4 devices
crypto: qat - add AES-CTR support for QAT GEN4 devices
crypto: atmel-i2c - select CONFIG_BITREVERSE
crypto: hisilicon/trng - replace atomic_add_return()
crypto: keembay - Add support for Keem Bay OCS AES/SM4
dt-bindings: Add Keem Bay OCS AES bindings
crypto: aegis128 - avoid spurious references crypto_aegis128_update_simd
crypto: seed - remove trailing semicolon in macro definition
crypto: x86/poly1305 - Use TEST %reg,%reg instead of CMP $0,%reg
crypto: x86/sha512 - Use TEST %reg,%reg instead of CMP $0,%reg
crypto: aesni - Use TEST %reg,%reg instead of CMP $0,%reg
crypto: cpt - Fix sparse warnings in cptpf
hwrng: ks-sa - Add dependency on IOMEM and OF
crypto: lib/blake2s - Move selftest prototype into header file
crypto: arm/aes-ce - work around Cortex-A57/A72 silion errata
crypto: ecdh - avoid unaligned accesses in ecdh_set_secret()
crypto: ccree - rework cache parameters handling
crypto: cavium - Use dma_set_mask_and_coherent to simplify code
crypto: marvell/octeontx - Use dma_set_mask_and_coherent to simplify code
...
When performing VMGEXIT processing for an SEV-ES guest, register values
will be synced between KVM and the GHCB. Prepare for detecting when a GPR
has been updated (marked dirty) in order to determine whether to sync the
register to the GHCB.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <7ca2a1cdb61456f2fe9c64193e34d601e395c133.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allocate a page during vCPU creation to be used as the encrypted VM save
area (VMSA) for the SEV-ES guest. Provide a flag in the kvm_vcpu_arch
structure that indicates whether the guest state is protected.
When freeing a VMSA page that has been encrypted, the cache contents must
be flushed using the MSR_AMD64_VM_PAGE_FLUSH before freeing the page.
[ i386 build warnings ]
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <fde272b17eec804f3b9db18c131262fe074015c5.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Update the GHCB accessor functions to add functions for retrieve GHCB
fields by name. Update existing code to use the new accessor functions.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <664172c53a5fb4959914e1a45d88e805649af0ad.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add support to KVM for determining if a system is capable of supporting
SEV-ES as well as determining if a guest is an SEV-ES guest.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <e66792323982c822350e40c7a1cf67ea2978a70b.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When both KVM support and the CCP driver are built into the kernel instead
of as modules, KVM initialization can happen before CCP initialization. As
a result, sev_platform_status() will return a failure when it is called
from sev_hardware_setup(), when this isn't really an error condition.
Since sev_platform_status() doesn't need to be called at this time anyway,
remove the invocation from sev_hardware_setup().
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <618380488358b56af558f2682203786f09a49483.1607620209.git.thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On systems that do not have hardware enforced cache coherency between
encrypted and unencrypted mappings of the same physical page, the
hypervisor can use the VM page flush MSR (0xc001011e) to flush the cache
contents of an SEV guest page. When a small number of pages are being
flushed, this can be used in place of issuing a WBINVD across all CPUs.
CPUID 0x8000001f_eax[2] is used to determine if the VM page flush MSR is
available. Add a CPUID feature to indicate it is supported and define the
MSR.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <f1966379e31f9b208db5257509c4a089a87d33d0.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move kvm_machine_check to x86.h to avoid two exact copies
of the same function in kvm.c and svm.c.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Message-Id: <20201029135600.122392-1-ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Correct a few problems in the x86 and the generic membarrier
implementation. Small corrections for assumptions about visibility
which have turned out not to be true.
- Make the PAT bits for memory encryption correct vs. 4K and 2M/1G page
table entries as they are at a different location.
- Fix a concurrency issue in the the local bandwidth readout of resource
control leading to incorrect values
- Fix the ordering of allocating a vector for an interrupt. The order
missed to respect the provided cpumask when the first attempt of
allocating node local in the mask fails. It then tries the node instead
of trying the full provided mask first. This leads to erroneous error
messages and breaking the (user) supplied affinity request. Reorder it.
- Make the INT3 padding detection in optprobe work correctly.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/WOw0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodUhEACW02Al+mlpZjOhzIGwNUALGX7YGEOi
wJoiIDTV2vOtKhcJ/AOVn05CznoWXa6PN1+7pm0nR2M1XaiXZ3B6HMjy1+0rpSqK
5dCCEqr+QowwMRsKd4hnuJowR/7Og66FiYIcknDJg/1egg+RXy7yKds577W6/KW0
2VV/x35xDywERiQ28qIftQ4L7NREyiioomN/GFpeoQf8tQF06Rb4t12T5pdA6D8S
/C1wj0IKVRMJo/7HbB/X6skxPOK0PWAEBpaT0+Q66VKSNnFVN5ap+rGTKdUoTmuM
NdxHOukpHyFCQtbRPOjRUeeSZJLKSX5oXsBO1GUvyyxcT2XNlTOdNamYLWyO+RfV
uP97qhzYDKL+cgDkwypJ0WOzOb9EIXOh4P9BTnJFBGhc4EQwen3cpb3CyWWftjnv
/obXiRDnAOE6P6H2AGiwrK3Pny+SvgrFYKMe+iy+ntToz1yrDh7ZrZ0DQKVoFWEr
n3qUnlPZmVvRzHIRVYoK69nS/UgmNN0LssavzRBzab3BcK93f23QkW86P42kNCLa
9kL+ZgGwlpqUZZR/p3pHq9Mv2ZXGEdUYY99h2vy1fgMwOw/RQZnPDAj/131UXlsU
4DL0mAlTK1/0/81H6/V1GDBkMkO+hN4x6Y3asi7bPHEKEzlLe+P1tUt3YELd3CEC
dc8ebICG1InXsw==
=t31y
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A set of x86 and membarrier fixes:
- Correct a few problems in the x86 and the generic membarrier
implementation. Small corrections for assumptions about visibility
which have turned out not to be true.
- Make the PAT bits for memory encryption correct vs 4K and 2M/1G
page table entries as they are at a different location.
- Fix a concurrency issue in the the local bandwidth readout of
resource control leading to incorrect values
- Fix the ordering of allocating a vector for an interrupt. The order
missed to respect the provided cpumask when the first attempt of
allocating node local in the mask fails. It then tries the node
instead of trying the full provided mask first. This leads to
erroneous error messages and breaking the (user) supplied affinity
request. Reorder it.
- Make the INT3 padding detection in optprobe work correctly"
* tag 'x86-urgent-2020-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kprobes: Fix optprobe to detect INT3 padding correctly
x86/apic/vector: Fix ordering in vector assignment
x86/resctrl: Fix incorrect local bandwidth when mba_sc is enabled
x86/mm/mem_encrypt: Fix definition of PMD_FLAGS_DEC_WP
membarrier: Execute SYNC_CORE on the calling thread
membarrier: Explicitly sync remote cores when SYNC_CORE is requested
membarrier: Add an actual barrier before rseq_preempt()
x86/membarrier: Get rid of a dubious optimization
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl/UDHQUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMGeQf9EtGft5U5EihqAbNr2O61Bh4ptCIT
+qNWWfuGQkKLsP6PCHMUJnNI3WJy2/Gb5+nUHjFXSEZBP2l3KGRuDniAdm4+DyEi
2khVmJiXYn2q2yfodmpHA/dqav3OHSrsq2IfH+J+WAFlIHnjkdz3Wk1zNFk7Y/xv
PVv2czvXhsnrvHvNp5e1+YsVGkMZc9fwXLRbac7ptmaKUKCBAgpZO8Gkc2GGgOdE
zUDp3qA8/7Ys+vzzYfPrRMUhev9dgE4x2TBmtOuzqOcfj2FOKRbKbwjur37fJ61j
Px4F2ZI0GEL0RrHvZK1vZ5KO41BcD+gQPumKAg1Lgz312loKj85RG8nBEQ==
=BJ9g
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Bugfixes for ARM, x86 and tools"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
tools/kvm_stat: Exempt time-based counters
KVM: mmu: Fix SPTE encoding of MMIO generation upper half
kvm: x86/mmu: Use cpuid to determine max gfn
kvm: svm: de-allocate svm_cpu_data for all cpus in svm_cpu_uninit()
selftests: kvm/set_memory_region_test: Fix race in move region test
KVM: arm64: Add usage of stage 2 fault lookup level in user_mem_abort()
KVM: arm64: Fix handling of merging tables into a block entry
KVM: arm64: Fix memory leak on stage2 update of a valid PTE
Commit
7705dc8557 ("x86/vmlinux: Use INT3 instead of NOP for linker fill bytes")
changed the padding bytes between functions from NOP to INT3. However,
when optprobe decodes a target function it finds INT3 and gives up the
jump optimization.
Instead of giving up any INT3 detection, check whether the rest of the
bytes to the end of the function are INT3. If all of them are INT3,
those come from the linker. In that case, continue the optprobe jump
optimization.
[ bp: Massage commit message. ]
Fixes: 7705dc8557 ("x86/vmlinux: Use INT3 instead of NOP for linker fill bytes")
Reported-by: Adam Zabrocki <pi3@pi3.com.pl>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/160767025681.3880685.16021570341428835411.stgit@devnote2
Commit cae7ed3c2c ("KVM: x86: Refactor the MMIO SPTE generation handling")
cleaned up the computation of MMIO generation SPTE masks, however it
introduced a bug how the upper part was encoded:
SPTE bits 52-61 were supposed to contain bits 10-19 of the current
generation number, however a missing shift encoded bits 1-10 there instead
(mostly duplicating the lower part of the encoded generation number that
then consisted of bits 1-9).
In the meantime, the upper part was shrunk by one bit and moved by
subsequent commits to become an upper half of the encoded generation number
(bits 9-17 of bits 0-17 encoded in a SPTE).
In addition to the above, commit 56871d444b ("KVM: x86: fix overlap between SPTE_MMIO_MASK and generation")
has changed the SPTE bit range assigned to encode the generation number and
the total number of bits encoded but did not update them in the comment
attached to their defines, nor in the KVM MMU doc.
Let's do it here, too, since it is too trivial thing to warrant a separate
commit.
Fixes: cae7ed3c2c ("KVM: x86: Refactor the MMIO SPTE generation handling")
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <156700708db2a5296c5ed7a8b9ac71f1e9765c85.1607129096.git.maciej.szmigiero@oracle.com>
Cc: stable@vger.kernel.org
[Reorganize macros so that everything is computed from the bit ranges. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Until commit e7c587da12 ("x86/speculation: Use synthetic bits for
IBRS/IBPB/STIBP"), KVM was testing both Intel and AMD CPUID bits before
allowing the guest to write MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD.
Testing only Intel bits on VMX processors, or only AMD bits on SVM
processors, fails if the guests are created with the "opposite" vendor
as the host.
While at it, also tweak the host CPU check to use the vendor-agnostic
feature bit X86_FEATURE_IBPB, since we only care about the availability
of the MSR on the host here and not about specific CPUID bits.
Fixes: e7c587da12 ("x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP")
Cc: stable@vger.kernel.org
Reported-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
AVX512_FP16 is supported by Intel processors, like Sapphire Rapids.
It could gain better performance for it's faster compared to FP32
if the precision or magnitude requirements are met. It's availability
is indicated by CPUID.(EAX=7,ECX=0):EDX[bit 23].
Expose it in KVM supported CPUID, then guest could make use of it; no
new registers are used, only new instructions.
Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>
Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Message-Id: <20201208033441.28207-3-kyung.min.park@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Enumerate AVX512 Half-precision floating point (FP16) CPUID feature
flag. Compared with using FP32, using FP16 cut the number of bits
required for storage in half, reducing the exponent from 8 bits to 5,
and the mantissa from 23 bits to 10. Using FP16 also enables developers
to train and run inference on deep learning models fast when all
precision or magnitude (FP32) is not needed.
A processor supports AVX512 FP16 if CPUID.(EAX=7,ECX=0):EDX[bit 23]
is present. The AVX512 FP16 requires AVX512BW feature be implemented
since the instructions for manipulating 32bit masks are associated with
AVX512BW.
The only in-kernel usage of this is kvm passthrough. The CPU feature
flag is shown as "avx512_fp16" in /proc/cpuinfo.
Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Message-Id: <20201208033441.28207-2-kyung.min.park@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Saves one byte in __vmx_vcpu_run for the same functionality.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Message-Id: <20201029140457.126965-1-ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For SEV, all DMA to and from guest has to use shared (un-encrypted) pages.
SEV uses SWIOTLB to make this happen without requiring changes to device
drivers. However, depending on the workload being run, the default 64MB
of it might not be enough and it may run out of buffers to use for DMA,
resulting in I/O errors and/or performance degradation for high
I/O workloads.
Adjust the default size of SWIOTLB for SEV guests using a
percentage of the total memory available to guest for the SWIOTLB buffers.
Adds a new sev_setup_arch() function which is invoked from setup_arch()
and it calls into a new swiotlb generic code function swiotlb_adjust_size()
to do the SWIOTLB buffer adjustment.
v5 fixed build errors and warnings as
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Co-developed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Commit
57d563c829 ("x86: ia32_setup_rt_frame(): consolidate uaccess areas")
dropped a __user annotation in a cast when refactoring __put_user() to
unsafe_put_user().
Hence, since then, sparse warns in arch/x86/ia32/ia32_signal.c:350:9:
warning: cast removes address space '__user' of expression
warning: incorrect type in argument 1 (different address spaces)
expected void const volatile [noderef] __user *ptr
got unsigned long long [usertype] *
Add the __user annotation to restore the propagation of address spaces.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201207124141.21859-1-lukas.bulwahn@gmail.com
The value freq_max/freq_base is a fundamental component of frequency
invariance calculations. It may come from a variety of sources such as MSRs
or ACPI data, tracking it down when troubleshooting a system could be
non-trivial. It is worth saving it in the kernel logs.
# dmesg | grep 'Estimated ratio of average max'
[ 14.024036] smpboot: Estimated ratio of average max frequency by base frequency (times 1024): 1289
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20201112182614.10700-4-ggherdovich@suse.cz
This is the first pass in creating the ability to calculate the
frequency invariance on AMD systems. This approach uses the CPPC
highest performance and nominal performance values that range from
0 - 255 instead of a high and base frquency. This is because we do
not have the ability on AMD to get a highest frequency value.
On AMD systems the highest performance and nominal performance
vaues do correspond to the highest and base frequencies for the system
so using them should produce an appropriate ratio but some tweaking
is likely necessary.
Due to CPPC being initialized later in boot than when the frequency
invariant calculation is currently made, I had to create a callback
from the CPPC init code to do the calculation after we have CPPC
data.
Special thanks to "kernel test robot <lkp@intel.com>" for reporting that
compilation of drivers/acpi/cppc_acpi.c is conditional to
CONFIG_ACPI_CPPC_LIB, not just CONFIG_ACPI.
[ ggherdovich@suse.cz: made safe under CPU hotplug, edited changelog. ]
Signed-off-by: Nathan Fontenot <nathan.fontenot@amd.com>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20201112182614.10700-2-ggherdovich@suse.cz
Mark tripped over the creative irqflags handling in the IO-APIC timer
delivery check which ends up doing:
local_irq_save(flags);
local_irq_enable();
local_irq_restore(flags);
which triggered a new consistency check he's working on required for
replacing the POPF based restore with a conditional STI.
That code is a historical mess and none of this is needed. Make it
straightforward use local_irq_disable()/enable() as that's all what is
required. It is invoked from interrupt enabled code nowadays.
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/87k0tpju47.fsf@nanos.tec.linutronix.de
Prarit reported that depending on the affinity setting the
' irq $N: Affinity broken due to vector space exhaustion.'
message is showing up in dmesg, but the vector space on the CPUs in the
affinity mask is definitely not exhausted.
Shung-Hsi provided traces and analysis which pinpoints the problem:
The ordering of trying to assign an interrupt vector in
assign_irq_vector_any_locked() is simply wrong if the interrupt data has a
valid node assigned. It does:
1) Try the intersection of affinity mask and node mask
2) Try the node mask
3) Try the full affinity mask
4) Try the full online mask
Obviously #2 and #3 are in the wrong order as the requested affinity
mask has to take precedence.
In the observed cases #1 failed because the affinity mask did not contain
CPUs from node 0. That made it allocate a vector from node 0, thereby
breaking affinity and emitting the misleading message.
Revert the order of #2 and #3 so the full affinity mask without the node
intersection is tried before actually affinity is broken.
If no node is assigned then only the full affinity mask and if that fails
the full online mask is tried.
Fixes: d6ffc6ac83 ("x86/vector: Respect affinity mask in irq descriptor")
Reported-by: Prarit Bhargava <prarit@redhat.com>
Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87ft4djtyp.fsf@nanos.tec.linutronix.de
The MBA software controller (mba_sc) is a feedback loop which
periodically reads MBM counters and tries to restrict the bandwidth
below a user-specified value. It tags along the MBM counter overflow
handler to do the updates with 1s interval in mbm_update() and
update_mba_bw().
The purpose of mbm_update() is to periodically read the MBM counters to
make sure that the hardware counter doesn't wrap around more than once
between user samplings. mbm_update() calls __mon_event_count() for local
bandwidth updating when mba_sc is not enabled, but calls mbm_bw_count()
instead when mba_sc is enabled. __mon_event_count() will not be called
for local bandwidth updating in MBM counter overflow handler, but it is
still called when reading MBM local bandwidth counter file
'mbm_local_bytes', the call path is as below:
rdtgroup_mondata_show()
mon_event_read()
mon_event_count()
__mon_event_count()
In __mon_event_count(), m->chunks is updated by delta chunks which is
calculated from previous MSR value (m->prev_msr) and current MSR value.
When mba_sc is enabled, m->chunks is also updated in mbm_update() by
mistake by the delta chunks which is calculated from m->prev_bw_msr
instead of m->prev_msr. But m->chunks is not used in update_mba_bw() in
the mba_sc feedback loop.
When reading MBM local bandwidth counter file, m->chunks was changed
unexpectedly by mbm_bw_count(). As a result, the incorrect local
bandwidth counter which calculated from incorrect m->chunks is shown to
the user.
Fix this by removing incorrect m->chunks updating in mbm_bw_count() in
MBM counter overflow handler, and always calling __mon_event_count() in
mbm_update() to make sure that the hardware local bandwidth counter
doesn't wrap around.
Test steps:
# Run workload with aggressive memory bandwidth (e.g., 10 GB/s)
git clone https://github.com/intel/intel-cmt-cat && cd intel-cmt-cat
&& make
./tools/membw/membw -c 0 -b 10000 --read
# Enable MBA software controller
mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl
# Create control group c1
mkdir /sys/fs/resctrl/c1
# Set MB throttle to 6 GB/s
echo "MB:0=6000;1=6000" > /sys/fs/resctrl/c1/schemata
# Write PID of the workload to tasks file
echo `pidof membw` > /sys/fs/resctrl/c1/tasks
# Read local bytes counters twice with 1s interval, the calculated
# local bandwidth is not as expected (approaching to 6 GB/s):
local_1=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes`
sleep 1
local_2=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes`
echo "local b/w (bytes/s):" `expr $local_2 - $local_1`
Before fix:
local b/w (bytes/s): 11076796416
After fix:
local b/w (bytes/s): 5465014272
Fixes: ba0f26d852 (x86/intel_rdt/mba_sc: Prepare for feedback loop)
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/1607063279-19437-1-git-send-email-xiaochen.shen@intel.com
The PAT bit is in different locations for 4k and 2M/1G page table
entries.
Add a definition for _PAGE_LARGE_CACHE_MASK to represent the three
caching bits (PWT, PCD, PAT), similar to _PAGE_CACHE_MASK for 4k pages,
and use it in the definition of PMD_FLAGS_DEC_WP to get the correct PAT
index for write-protected pages.
Fixes: 6ebcb06071 ("x86/mm: Add support to encrypt the kernel in-place")
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20201111160946.147341-1-nivedita@alum.mit.edu
In the commit 1c96dcceae
("KVM: x86: fix apic_accept_events vs check_nested_events"),
we accidently started latching SIPIs that are received while the cpu is not
waiting for them.
This causes vCPUs to never enter a halted state.
Fixes: 1c96dcceae ("KVM: x86: fix apic_accept_events vs check_nested_events")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20201203143319.159394-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Tremont has four L1 Topdown events, TOPDOWN_FE_BOUND.ALL,
TOPDOWN_BAD_SPECULATION.ALL, TOPDOWN_BE_BOUND.ALL and
TOPDOWN_RETIRING.ALL. They are available on GP counters.
Export them to sysfs and facilitate the perf stat tool.
$perf stat --topdown -- sleep 1
Performance counter stats for 'sleep 1':
retiring bad speculation frontend bound
backend bound
24.9% 16.8% 31.7%
26.6%
1.001224610 seconds time elapsed
0.001150000 seconds user
0.000000000 seconds sys
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1607457952-3519-1-git-send-email-kan.liang@linux.intel.com
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of letting the code fall
through to the next case.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://github.com/KSPP/linux/issues/115
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a fallthrough pseudo-keyword as a replacement for
a /* fall through */ comment, instead of letting the code fall through
to the next case.
Notice that Clang doesn't recognize /* fall through */ comments as
implicit fall-through markings.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://github.com/KSPP/linux/issues/115
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of just letting the code
fall through to the next case.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://github.com/KSPP/linux/issues/115
The cycle count of a timed LBR is always 1 in perf record -D.
The cycle count is stored in the first 16 bits of the IA32_LBR_x_INFO
register, but the get_lbr_cycles() return Boolean type.
Use u16 to replace the Boolean type.
Fixes: 47125db27e ("perf/x86/intel/lbr: Support Architectural LBR")
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20201125213720.15692-2-kan.liang@linux.intel.com
According to the event list from icelake_core_v1.09.json, the encoding
of the RTM_RETIRED.ABORTED event on Ice Lake should be,
"EventCode": "0xc9",
"UMask": "0x04",
"EventName": "RTM_RETIRED.ABORTED",
Correct the wrong encoding.
Fixes: 6017608936 ("perf/x86/intel: Add Icelake support")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20201125213720.15692-1-kan.liang@linux.intel.com
Fix to restore BTF if single-stepping causes a page fault and
it is cancelled.
Usually the BTF flag was restored when the single stepping is done
(in resume_execution()). However, if a page fault happens on the
single stepping instruction, the fault handler is invoked and
the single stepping is cancelled. Thus, the BTF flag is not
restored.
Fixes: 1ecc798c67 ("x86: debugctlmsr kprobes")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/160389546985.106936.12727996109376240993.stgit@devnote2
sync_core_before_usermode() had an incorrect optimization. If the kernel
returns from an interrupt, it can get to usermode without IRET. It just has
to schedule to a different task in the same mm and do SYSRET. Fortunately,
there were no callers of sync_core_before_usermode() that could have had
in_irq() or in_nmi() equal to true, because it's only ever called from the
scheduler.
While at it, clarify a related comment.
Fixes: 70216e18e5 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/5afc7632be1422f91eaf7611aaaa1b5b8580a086.1607058304.git.luto@kernel.org
Commit
26bfa5f894 ("x86, amd: Cleanup init_amd")
moved the code that remaps the TSEG region using 4k pages from
init_amd() to bsp_init_amd().
However, bsp_init_amd() is executed well before the direct mapping is
actually created:
setup_arch()
-> early_cpu_init()
-> early_identify_cpu()
-> this_cpu->c_bsp_init()
-> bsp_init_amd()
...
-> init_mem_mapping()
So the change effectively disabled the 4k remapping, because
pfn_range_is_mapped() is always false at this point.
It has been over six years since the commit, and no-one seems to have
noticed this, so just remove the code. The original code was also
incomplete, since it doesn't check how large the TSEG address range
actually is, so it might remap only part of it in any case.
Hygon has copied the incorrect version, so the code has never run on it
since the cpu support was added two years ago. Remove it from there as
well.
Committer notes:
This workaround is incomplete anyway:
1. The code must check MSRC001_0113.TValid (SMM TSeg Mask MSR) first, to
check whether the TSeg address range is enabled.
2. The code must check whether the range is not 2M aligned - if it is,
there's nothing to work around.
3. In all the BIOSes tested, the TSeg range is in a e820 reserved area
and those are not mapped anymore, after
66520ebc2d ("x86, mm: Only direct map addresses that are marked as E820_RAM")
which means, there's nothing to be worked around either.
So let's rip it out.
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201127171324.1846019-1-nivedita@alum.mit.edu
After having collected the majority of reports about MSRs being written
by userspace tools and what tools those are, and all newer reports
mostly repeating, add an URL where detailed information is gathered and
kept up-to-date.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201205002825.19107-1-bp@alien8.de
Add "deprecated" message to any access to old /proc/sgi_uv/* leaves.
[ bp: Do not have a trailing function opening brace and the arguments
continuing on the next line and align them on the opening brace. ]
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lkml.kernel.org/r/20201128034227.120869-5-mike.travis@hpe.com
Add kernel interfaces used to obtain info for the uv_sysfs driver
to display.
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lkml.kernel.org/r/20201128034227.120869-2-mike.travis@hpe.com
Update kernel-doc parameter name after
c3d6324f84 ("x86/alternatives: Teach text_poke_bp() to emulate instructions")
changed the last parameter from @handler to @emulate.
[ bp: Make commit message more precise. ]
Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201203145020.2441-1-hqjagain@gmail.com
- Make the AMD L3 QoS code and data priorization enable/disable mechanism
work correctly. The control bit was only set/cleared on one of the CPUs
in a L3 domain, but it has to be modified on all CPUs in the domain. The
initial documentation was not clear about this, but the updated one from
Oct 2020 spells it out.
- Fix an off by one in the UV platform detection code which causes the UV
hubs to be identified wrongly. The chip revisions start at 1 not at 0.
- Fix a long standing bug in the evaluation of prefixes in the uprobes
code which fails to handle repeated prefixes properly. The aggregate
size of the prefixes can be larger than the bytes array but the code
blindly iterated over the aggregate size beyond the array boundary.
Add a macro to handle this case properly and use it at the affected
places.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/M2GoTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUGeD/0TxQyVIZTJjY/ExDYvQZeSWgvFVon9
A/QcwkgtkWzYKI3YThgxBtSNKhHPP8eQ9QK8ErcaQKEHU5TdXVEOwHgTm1A6OGat
0+M9/EWEe7tTu+cLpu7esQ+1VbSvEcFXZbljbhCKrShlBlsIjFG4eCPAprSw9yjI
RSgfXKZu4NmHVS6nfJTwmIIaTeLQ6U3b7b1D5s/66slBFScqnLbRNhABVbHbos4F
pl/lxDCFOddy2YbEojHjjGqMA7oxPav7c0nYFOM/zG+wAqfEjbqOxReT31bGQPi2
XT9K4JEqDqILo0KnhV4GsYoWAhes3BtmsJ9IoZ7IijsMriYl80mD9URAORidJ4PX
28Ckk9V/DlE8uDrAnBDcWDSoKlg78mhVV7V9L6v43teg/gJfSZNROtNDBmqRmwG4
Op2NJfzJITtaxVQuSZRkSs8rzGv+QUfaM1sBUQ+Oz4KYeIjjA7G2MAOECrzIAWKB
GWc5toYRVS6oGT+RbZhSxZYoh8ASoGJ2MrL8K4OV4RqEqHHcXcih0WmmljtsDIFI
td4FHHH6fghIb9S6iYKiApd6k2qKa33mwJwa/xZOoIrv0w5xT0WDJnxT60gu/Mec
YDkqhmA009CNSD2G4oNRNF5MH7gp34UII+25jOGatbVh+5DDPYs+5Jnh/DR7jssR
PryAG9ER7UUb6w==
=rNBq
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A set of fixes for x86:
- Make the AMD L3 QoS code and data priorization enable/disable
mechanism work correctly.
The control bit was only set/cleared on one of the CPUs in a L3
domain, but it has to be modified on all CPUs in the domain. The
initial documentation was not clear about this, but the updated one
from Oct 2020 spells it out.
- Fix an off by one in the UV platform detection code which causes
the UV hubs to be identified wrongly.
The chip revisions start at 1 not at 0.
- Fix a long standing bug in the evaluation of prefixes in the
uprobes code which fails to handle repeated prefixes properly.
The aggregate size of the prefixes can be larger than the bytes
array but the code blindly iterated over the aggregate size beyond
the array boundary. Add a macro to handle this case properly and
use it at the affected places"
* tag 'x86-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev-es: Use new for_each_insn_prefix() macro to loop over prefixes bytes
x86/insn-eval: Use new for_each_insn_prefix() macro to loop over prefixes bytes
x86/uprobes: Do not use prefixes.nbytes when looping over prefixes.bytes
x86/platform/uv: Fix UV4 hub revision adjustment
x86/resctrl: Fix AMD L3 QOS CDP enable/disable
- Add recursion protection to another callchain invoked from
x86_pmu_stop() which can recurse back into x86_pmu_stop(). The first
attempt to fix this missed this extra code path.
- Use the already filtered status variable to check for PEBS counter
overflow bits and not the unfiltered full status read from
IA32_PERF_GLOBAL_STATUS which can have unrelated bits check which
would be evaluated incorrectly.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/M1lwTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoeKPEACo/xfvFgp/rlSIjh12OW7XTyKqKMVq
C8TvMFbzn3Pzx89RyJctyyi1d5xYz6ic3uXkV0yGJ1g9fkJdzhgCY/byoTd/nWor
/r6RRtw5Ah3werXlESbZWTb6691eqGO4TpWZMAwdDeV1eRAc7ZSoOLJiY0o4HVWd
AE2aX9/psikHBAeMzMHn5W4pELl0s6s6aX9OJLFFZeroA3UW3bWOHu8o/iylmld9
wnQPvGe+kk35ZBza8/kbBJv3ZasUBPXy1+9Z1p1aE8dmd+HrYnw9vS3El9dgtX1Y
ADKttAD3S3j/PbBOXOnIRoSTM7Vvc5gviLyPfw6erBIMtw/OBFSrRD7IbVJpyZmn
yysBgacu6pzeLXRZ4ZktQ2GPeAgdiZwsqwuvuQGd5t1zfdNkZEEI7aK87CrHCYOk
zMt7vNK/q5xF9wt9C3cYzka/Sp7UA10zqP31/fv2fOZb4dcANTtm9reS/fwqd67X
HYhYiks4z1wxqTOf76wBpXau9ZtQLb20MmC3S6IknDF39pODO1sGcWMJSB0AgYUE
3125ZTxrmlgv+U7S5qBHKB5ot5wvNUexiDD0g0o2N9x4dCQqErmvwUDWWUoqq35A
g4tqODudwM4aU05leWO/piZEBr4xxAiRke9r8WJZNHAo0C7yHjjmXRtZYDkSf2pV
U0gyy4dWTsXKKw==
=YbCo
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"Two fixes for performance monitoring on X86:
- Add recursion protection to another callchain invoked from
x86_pmu_stop() which can recurse back into x86_pmu_stop(). The
first attempt to fix this missed this extra code path.
- Use the already filtered status variable to check for PEBS counter
overflow bits and not the unfiltered full status read from
IA32_PERF_GLOBAL_STATUS which can have unrelated bits check which
would be evaluated incorrectly"
* tag 'perf-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Check PEBS status correctly
perf/x86/intel: Fix a warning on x86_pmu_stop() with large PEBS
- Move -Wcast-align to W=3, which tends to be false-positive and there
is no tree-wide solution.
- Pass -fmacro-prefix-map to KBUILD_CPPFLAGS because it is a preprocessor
option and makes sense for .S files as well.
- Disable -gdwarf-2 for Clang's integrated assembler to avoid warnings.
- Disable --orphan-handling=warn for LLD 10.0.1 to avoid warnings.
- Fix undesirable line breaks in *.mod files.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl/MzyMVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGKJ8P/2kLq296XAPjqC90/LWMja8dsXO/
Wgaq8zC819x0JFuGdBKlwlFe3AvFYRtts9V5+mzjxvsOjH/6+xzyrXjRPCwZYqlj
XKC3ZwuS2SGDPFCriI1edwTUp5tyDnG/VBjqbf3ybQnz0LAShidXBD9IlM/XX9Rz
BlWqd7Uib50Pq8AfM2JVokrSmkkvhqxocIsmjTa0wvRjRAw7+aVkGNCWXqnTho7y
YuHmTWbmUQIROF3Bzs1fkGp+qaQofPRfA1tTwaTVvgmt8rEqyzXi11y6kj56INfg
/pq4O1KrplKtJFdrcjj4/eptqHG3I+Jq56qCHVescF6+bH6cc6BUL8qDdAzFZQai
e/pWCzREqFDKchEmT2d0Uzik8Zfxi5Cw68Otpzb4LqTUUxXSoRx1R9Of/Ei5QZum
6b6s9Q41UwH983UQCOOSGjXGZYP6fZG1a0XejbduYo7TL4KEECAO/FlLBWGttYH3
0i3aKz3aDKb/fo7hDbbqg+o6F0mShEraqxMmWgIvgGt+k76j0O0wS2KryqpTd7Vv
xg72suGM7f9QBA50lZ0r32fm86XnlqwQAm9ZMaSXR1Ii7j4F9UNRmR/FUYq7dPwa
COkuHr+9LqzV/tkluWi2rjLIGPaCuEVeSCcQ/wIDdp2iOyb54CbozwK0Yi2dxxus
jVFKwSaMUDHrkSj6
=/ysh
-----END PGP SIGNATURE-----
Merge tag 'kbuild-fixes-v5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Move -Wcast-align to W=3, which tends to be false-positive and there
is no tree-wide solution.
- Pass -fmacro-prefix-map to KBUILD_CPPFLAGS because it is a
preprocessor option and makes sense for .S files as well.
- Disable -gdwarf-2 for Clang's integrated assembler to avoid warnings.
- Disable --orphan-handling=warn for LLD 10.0.1 to avoid warnings.
- Fix undesirable line breaks in *.mod files.
* tag 'kbuild-fixes-v5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: avoid split lines in .mod files
kbuild: Disable CONFIG_LD_ORPHAN_WARN for ld.lld 10.0.1
kbuild: Hoist '--orphan-handling' into Kconfig
Kbuild: do not emit debug info for assembly with LLVM_IAS=1
kbuild: use -fmacro-prefix-map for .S sources
Makefile.extrawarn: move -Wcast-align to W=3
Since insn.prefixes.nbytes can be bigger than the size of
insn.prefixes.bytes[] when a prefix is repeated, the proper
check must be:
insn.prefixes.bytes[i] != 0 and i < 4
instead of using insn.prefixes.nbytes. Use the new
for_each_insn_prefix() macro which does it correctly.
Debugged by Kees Cook <keescook@chromium.org>.
[ bp: Massage commit message. ]
Fixes: 25189d08e5 ("x86/sev-es: Add support for handling IOIO exceptions")
Reported-by: syzbot+9b64b619f10f19d19a7c@syzkaller.appspotmail.com
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/160697106089.3146288.2052422845039649176.stgit@devnote2
Since insn.prefixes.nbytes can be bigger than the size of
insn.prefixes.bytes[] when a prefix is repeated, the proper check must
be
insn.prefixes.bytes[i] != 0 and i < 4
instead of using insn.prefixes.nbytes. Use the new
for_each_insn_prefix() macro which does it correctly.
Debugged by Kees Cook <keescook@chromium.org>.
[ bp: Massage commit message. ]
Fixes: 32d0b95300 ("x86/insn-eval: Add utility functions to get segment selector")
Reported-by: syzbot+9b64b619f10f19d19a7c@syzkaller.appspotmail.com
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/160697104969.3146288.16329307586428270032.stgit@devnote2
Since insn.prefixes.nbytes can be bigger than the size of
insn.prefixes.bytes[] when a prefix is repeated, the proper check must
be
insn.prefixes.bytes[i] != 0 and i < 4
instead of using insn.prefixes.nbytes.
Introduce a for_each_insn_prefix() macro for this purpose. Debugged by
Kees Cook <keescook@chromium.org>.
[ bp: Massage commit message, sync with the respective header in tools/
and drop "we". ]
Fixes: 2b14449835 ("uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints")
Reported-by: syzbot+9b64b619f10f19d19a7c@syzkaller.appspotmail.com
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/160697103739.3146288.7437620795200799020.stgit@devnote2
In the TDP MMU, use shadow_phys_bits to dermine the maximum possible GFN
mapped in the guest for zapping operations. boot_cpu_data.x86_phys_bits
may be reduced in the case of HW features that steal HPA bits for other
purposes. However, this doesn't necessarily reduce GPA space that can be
accessed via TDP. So zap based on a maximum gfn calculated with MAXPHYADDR
retrieved from CPUID. This is already stored in shadow_phys_bits, so use
it instead of x86_phys_bits.
Fixes: faaf05b00a ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-Id: <20201203231120.27307-1-rick.p.edgecombe@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The cpu arg for svm_cpu_uninit() was previously ignored resulting in the
per cpu structure svm_cpu_data not being de-allocated for all cpus.
Signed-off-by: Jacob Xu <jacobhxu@google.com>
Message-Id: <20201203205939.1783969-1-jacobhxu@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
CMP $0,%reg can't set overflow flag, so we can use shorter TEST %reg,%reg
instruction when only zero and sign flags are checked (E,L,LE,G,GE conditions).
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
CMP $0,%reg can't set overflow flag, so we can use shorter TEST %reg,%reg
instruction when only zero and sign flags are checked (E,L,LE,G,GE conditions).
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
CMP $0,%reg can't set overflow flag, so we can use shorter TEST %reg,%reg
instruction when only zero and sign flags are checked (E,L,LE,G,GE conditions).
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The sgx_enclave_add_pages.length field is documented as
* @length: length of the data (multiple of the page size)
Fail with -EINVAL, when the caller gives a zero length buffer of data
to be added as pages to an enclave. Right now 'ret' is returned as
uninitialized in that case.
[ bp: Flesh out commit message. ]
Fixes: c6d26d3707 ("x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/linux-sgx/X8ehQssnslm194ld@mwanda/
Link: https://lkml.kernel.org/r/20201203183527.139317-1-jarkko@kernel.org
Since the ASID is now stored in svm->asid, pre_sev_run should also place
it there and not directly in the VMCB control area.
Reported-by: Ashish Kalra <Ashish.Kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The kernel cannot disambiguate when 2+ PEBS counters overflow at the
same time. This is what the comment for this code suggests. However,
I see the comparison is done with the unfiltered p->status which is a
copy of IA32_PERF_GLOBAL_STATUS at the time of the sample. This
register contains more than the PEBS counter overflow bits. It also
includes many other bits which could also be set.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201126110922.317681-2-namhyung@kernel.org
The commit 3966c3feca ("x86/perf/amd: Remove need to check "running"
bit in NMI handler") introduced this. It seems x86_pmu_stop can be
called recursively (like when it losts some samples) like below:
x86_pmu_stop
intel_pmu_disable_event (x86_pmu_disable)
intel_pmu_pebs_disable
intel_pmu_drain_pebs_nhm (x86_pmu_drain_pebs_buffer)
x86_pmu_stop
While commit 35d1ce6bec ("perf/x86/intel/ds: Fix x86_pmu_stop
warning for large PEBS") fixed it for the normal cases, there's
another path to call x86_pmu_stop() recursively when a PEBS error was
detected (like two or more counters overflowed at the same time).
Like in the Kan's previous fix, we can skip the interrupt accounting
for large PEBS, so check the iregs which is set for PMI only.
Fixes: 3966c3feca ("x86/perf/amd: Remove need to check "running" bit in NMI handler")
Reported-by: John Sperbeck <jsperbeck@google.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201126110922.317681-1-namhyung@kernel.org