Commit Graph

38412 Commits

Author SHA1 Message Date
Sean Christopherson 44f1b5586d KVM: SVM: Enhance and clean up the vmcb tracking comment in pre_svm_run()
Explicitly document why a vmcb must be marked dirty and assigned a new
asid when it will be run on a different cpu.  The "what" is relatively
obvious, whereas the "why" requires reading the APM and/or KVM code.

Opportunistically remove a spurious period and several unnecessary
newlines in the comment.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210406171811.4043363-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-20 04:18:50 -04:00
Sean Christopherson 554cf31474 KVM: SVM: Add a comment to clarify what vcpu_svm.vmcb points at
Add a comment above the declaration of vcpu_svm.vmcb to call out that it
is simply a shorthand for current_vmcb->ptr.  The myriad accesses to
svm->vmcb are quite confusing without this crucial detail.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210406171811.4043363-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-20 04:18:49 -04:00
Sean Christopherson d1788191fd KVM: SVM: Drop vcpu_svm.vmcb_pa
Remove vmcb_pa from vcpu_svm and simply read current_vmcb->pa directly in
the one path where it is consumed.  Unlike svm->vmcb, use of the current
vmcb's address is very limited, as evidenced by the fact that its use
can be trimmed to a single dereference.

Opportunistically add a comment about using vmcb01 for VMLOAD/VMSAVE, at
first glance using vmcb01 instead of vmcb_pa looks wrong.

No functional change intended.

Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210406171811.4043363-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-20 04:18:49 -04:00
Sean Christopherson 17e5e964ee KVM: SVM: Don't set current_vmcb->cpu when switching vmcb
Do not update the new vmcb's last-run cpu when switching to a different
vmcb.  If the vCPU is migrated between its last run and a vmcb switch,
e.g. for nested VM-Exit, then setting the cpu without marking the vmcb
dirty will lead to KVM running the vCPU on a different physical cpu with
stale clean bit settings.

                          vcpu->cpu    current_vmcb->cpu    hardware
  pre_svm_run()           cpu0         cpu0                 cpu0,clean
  kvm_arch_vcpu_load()    cpu1         cpu0                 cpu0,clean
  svm_switch_vmcb()       cpu1         cpu1                 cpu0,clean
  pre_svm_run()           cpu1         cpu1                 kaboom

Simply delete the offending code; unlike VMX, which needs to update the
cpu at switch time due to the need to do VMPTRLD, SVM only cares about
which cpu last ran the vCPU.

Fixes: af18fa775d ("KVM: nSVM: Track the physical cpu of the vmcb vmrun through the vmcb")
Cc: Cathy Avery <cavery@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210406171811.4043363-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-20 04:18:49 -04:00
Ingo Molnar 27743f01e3 x86/platform/uv: Remove dead !CONFIG_KEXEC_CORE code
The !CONFIG_KEXEC_CORE code in arch/x86/platform/uv/uv_nmi.c was unused, untested
and didn't even build for 7 years. Since we fixed this by requiring X86_UV to
depend on CONFIG_KEXEC_CORE, remove the (now) dead code.

Also move the uv_nmi_kexec_failed definition back up to where the other file-scope
global variables are defined.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Mike Travis <travis@sgi.com>
Cc: linux-kernel@vger.kernel.org
2021-04-20 10:08:34 +02:00
Ingo Molnar c2209ea556 x86/platform/uv: Fix !KEXEC build failure
When KEXEC is disabled, the UV build fails:

  arch/x86/platform/uv/uv_nmi.c:875:14: error: ‘uv_nmi_kexec_failed’ undeclared (first use in this function)

Since uv_nmi_kexec_failed is only defined in the KEXEC_CORE #ifdef branch,
this code cannot ever have been build tested:

	if (main)
		pr_err("UV: NMI kdump: KEXEC not supported in this kernel\n");
	atomic_set(&uv_nmi_kexec_failed, 1);

Nor is this use possible in uv_handle_nmi():

                atomic_set(&uv_nmi_kexec_failed, 0);

These bugs were introduced in this commit:

    d0a9964e9873: ("x86/platform/uv: Implement simple dump failover if kdump fails")

Which added the uv_nmi_kexec_failed assignments to !KEXEC code, while making the
definition KEXEC-only - apparently without testing the !KEXEC case.

Instead of complicating the #ifdef maze, simplify the code by requiring X86_UV
to depend on KEXEC_CORE. This pattern is present in other architectures as well.

( We'll remove the untested, 7 years old !KEXEC complications from the file in a
  separate commit. )

Fixes: d0a9964e9873: ("x86/platform/uv: Implement simple dump failover if kdump fails")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Mike Travis <travis@sgi.com>
Cc: linux-kernel@vger.kernel.org
2021-04-20 10:08:23 +02:00
Tom Lendacky a3ba26ecfb KVM: SVM: Make sure GHCB is mapped before updating
Access to the GHCB is mainly in the VMGEXIT path and it is known that the
GHCB will be mapped. But there are two paths where it is possible the GHCB
might not be mapped.

The sev_vcpu_deliver_sipi_vector() routine will update the GHCB to inform
the caller of the AP Reset Hold NAE event that a SIPI has been delivered.
However, if a SIPI is performed without a corresponding AP Reset Hold,
then the GHCB might not be mapped (depending on the previous VMEXIT),
which will result in a NULL pointer dereference.

The svm_complete_emulated_msr() routine will update the GHCB to inform
the caller of a RDMSR/WRMSR operation about any errors. While it is likely
that the GHCB will be mapped in this situation, add a safe guard
in this path to be certain a NULL pointer dereference is not encountered.

Fixes: f1c6366e30 ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
Fixes: 647daca25d ("KVM: SVM: Add support for booting APs in an SEV-ES guest")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Message-Id: <a5d3ebb600a91170fc88599d5a575452b3e31036.1617979121.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 18:04:47 -04:00
Wanpeng Li a1fa4cbd53 KVM: X86: Do not yield to self
If the target is self we do not need to yield, we can avoid malicious
guest to play this.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1617941911-5338-3-git-send-email-wanpengli@tencent.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 18:04:46 -04:00
Wanpeng Li 4a7132efff KVM: X86: Count attempted/successful directed yield
To analyze some performance issues with lock contention and scheduling,
it is nice to know when directed yield are successful or failing.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1617941911-5338-2-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 18:04:46 -04:00
Wanpeng Li 2b519b5797 x86/kvm: Don't bother __pv_cpu_mask when !CONFIG_SMP
Enable PV TLB shootdown when !CONFIG_SMP doesn't make sense. Let's
move it inside CONFIG_SMP. In addition, we can avoid define and
alloc __pv_cpu_mask when !CONFIG_SMP and get rid of 'alloc' variable
in kvm_alloc_cpumask.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1617941911-5338-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 18:04:45 -04:00
Ben Gardon 4c6654bd16 KVM: x86/mmu: Tear down roots before kvm_mmu_zap_all_fast returns
To avoid saddling a vCPU thread with the work of tearing down an entire
paging structure, take a reference on each root before they become
obsolete, so that the thread initiating the fast invalidation can tear
down the paging structure and (most likely) release the last reference.
As a bonus, this teardown can happen under the MMU lock in read mode so
as not to block the progress of vCPU threads.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210401233736.638171-14-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 18:04:45 -04:00
Ben Gardon b7cccd397f KVM: x86/mmu: Fast invalidation for TDP MMU
Provide a real mechanism for fast invalidation by marking roots as
invalid so that their reference count will quickly fall to zero
and they will be torn down.

One negative side affect of this approach is that a vCPU thread will
likely drop the last reference to a root and be saddled with the work of
tearing down an entire paging structure. This issue will be resolved in
a later commit.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210401233736.638171-13-bgardon@google.com>
[Move the loop to tdp_mmu.c, otherwise compilation fails on 32-bit. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 18:04:35 -04:00
Zhang Rui 6a5f438679 perf/x86/rapl: Add support for Intel Alder Lake
Alder Lake RAPL support is the same as previous Sky Lake.
Add Alder Lake model for RAPL.

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/1618237865-33448-26-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:30 +02:00
Kan Liang d0ca946bcf perf/x86/cstate: Add Alder Lake CPU support
Compared with the Rocket Lake, the CORE C1 Residency Counter is added
for Alder Lake, but the CORE C3 Residency Counter is removed. Other
counters are the same.

Create a new adl_cstates for Alder Lake. Update the comments
accordingly.

The External Design Specification (EDS) is not published yet. It comes
from an authoritative internal source.

The patch has been tested on real hardware.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/1618237865-33448-25-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:29 +02:00
Kan Liang 19d3a81fd9 perf/x86/msr: Add Alder Lake CPU support
PPERF and SMI_COUNT MSRs are also supported on Alder Lake.

The External Design Specification (EDS) is not published yet. It comes
from an authoritative internal source.

The patch has been tested on real hardware.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/1618237865-33448-24-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:29 +02:00
Kan Liang 772ed05f3c perf/x86/intel/uncore: Add Alder Lake support
The uncore subsystem for Alder Lake is similar to the previous Tiger
Lake.

The difference includes:
- New MSR addresses for global control, fixed counters, CBOX and ARB.
  Add a new adl_uncore_msr_ops for uncore operations.
- Add a new threshold field for CBOX.
- New PCIIDs for IMC devices.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/1618237865-33448-23-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:29 +02:00
Kan Liang 55bcf6ef31 perf: Extend PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE
Current Hardware events and Hardware cache events have special perf
types, PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE. The two types don't
pass the PMU type in the user interface. For a hybrid system, the perf
subsystem doesn't know which PMU the events belong to. The first capable
PMU will always be assigned to the events. The events never get a chance
to run on the other capable PMUs.

Extend the two types to become PMU aware types. The PMU type ID is
stored at attr.config[63:32].

Add a new PMU capability, PERF_PMU_CAP_EXTENDED_HW_TYPE, to indicate a
PMU which supports the extended PERF_TYPE_HARDWARE and
PERF_TYPE_HW_CACHE.

The PMU type is only required when searching a specific PMU. The PMU
specific codes will only be interested in the 'real' config value, which
is stored in the low 32 bit of the event->attr.config. Update the
event->attr.config in the generic code, so the PMU specific codes don't
need to calculate it separately.

If a user specifies a PMU type, but the PMU doesn't support the extended
type, error out.

If an event cannot be initialized in a PMU specified by a user, error
out immediately. Perf should not try to open it on other PMUs.

The new PMU capability is only set for the X86 hybrid PMUs for now.
Other architectures, e.g., ARM, may need it as well. The support on ARM
may be implemented later separately.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1618237865-33448-22-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:29 +02:00
Kan Liang f83d2f91d2 perf/x86/intel: Add Alder Lake Hybrid support
Alder Lake Hybrid system has two different types of core, Golden Cove
core and Gracemont core. The Golden Cove core is registered to
"cpu_core" PMU. The Gracemont core is registered to "cpu_atom" PMU.

The difference between the two PMUs include:
- Number of GP and fixed counters
- Events
- The "cpu_core" PMU supports Topdown metrics.
  The "cpu_atom" PMU supports PEBS-via-PT.

The "cpu_core" PMU is similar to the Sapphire Rapids PMU, but without
PMEM.
The "cpu_atom" PMU is similar to Tremont, but with different events,
event_constraints, extra_regs and number of counters.

The mem-loads AUX event workaround only applies to the Golden Cove core.

Users may disable all CPUs of the same CPU type on the command line or
in the BIOS. For this case, perf still register a PMU for the CPU type
but the CPU mask is 0.

Current caps/pmu_name is usually the microarch codename. Assign the
"alderlake_hybrid" to the caps/pmu_name of both PMUs to indicate the
hybrid Alder Lake microarchitecture.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/1618237865-33448-21-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:28 +02:00
Kan Liang 3e9a8b219e perf/x86: Support filter_match callback
Implement filter_match callback for X86, which check whether an event is
schedulable on the current CPU.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/1618237865-33448-20-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:28 +02:00
Kan Liang 58ae30c29a perf/x86/intel: Add attr_update for Hybrid PMUs
The attribute_group for Hybrid PMUs should be different from the
previous
cpu PMU. For example, cpumask is required for a Hybrid PMU. The PMU type
should be included in the event and format attribute.

Add hybrid_attr_update for the Hybrid PMU.
Check the PMU type in is_visible() function. Only display the event or
format for the matched Hybrid PMU.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/1618237865-33448-19-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:28 +02:00
Kan Liang a9c81ccdf5 perf/x86: Add structures for the attributes of Hybrid PMUs
Hybrid PMUs have different events and formats. In theory, Hybrid PMU
specific attributes should be maintained in the dedicated struct
x86_hybrid_pmu, but it wastes space because the events and formats are
similar among Hybrid PMUs.

To reduce duplication, all hybrid PMUs will share a group of attributes
in the following patch. To distinguish an attribute from different
Hybrid PMUs, a PMU aware attribute structure is introduced. A PMU type
is required for the attribute structure. The type is internal usage. It
is not visible in the sysfs API.

Hybrid PMUs may support the same event name, but with different event
encoding, e.g., the mem-loads event on an Atom PMU has different event
encoding from a Core PMU. It brings issue if two attributes are
created for them. Current sysfs_update_group finds an attribute by
searching the attr name (aka event name). If two attributes have the
same event name, the first attribute will be replaced.
To address the issue, only one attribute is created for the event. The
event_str is extended and stores event encodings from all Hybrid PMUs.
Each event encoding is divided by ";". The order of the event encodings
must follow the order of the hybrid PMU index. The event_str is internal
usage as well. When a user wants to show the attribute of a Hybrid PMU,
only the corresponding part of the string is displayed.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/1618237865-33448-18-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:28 +02:00
Kan Liang d9977c43bf perf/x86: Register hybrid PMUs
Different hybrid PMUs have different PMU capabilities and events. Perf
should registers a dedicated PMU for each of them.

To check the X86 event, perf has to go through all possible hybrid pmus.

All the hybrid PMUs are registered at boot time. Before the
registration, add intel_pmu_check_hybrid_pmus() to check and update the
counters information, the event constraints, the extra registers and the
unique capabilities for each hybrid PMUs.

Postpone the display of the PMU information and HW check to
CPU_STARTING, because the boot CPU is the only online CPU in the
init_hw_perf_events(). Perf doesn't know the availability of the other
PMUs. Perf should display the PMU information only if the counters of
the PMU are available.

One type of CPUs may be all offline. For this case, users can still
observe the PMU in /sys/devices, but its CPU mask is 0.

All hybrid PMUs have capability PERF_PMU_CAP_HETEROGENEOUS_CPUS.
The PMU name for hybrid PMUs will be "cpu_XXX", which will be assigned
later in a separated patch.

The PMU type id for the core PMU is still PERF_TYPE_RAW. For the other
hybrid PMUs, the PMU type id is not hard code.

The event->cpu must be compatitable with the supported CPUs of the PMU.
Add a check in the x86_pmu_event_init().

The events in a group must be from the same type of hybrid PMU.
The fake cpuc used in the validation must be from the supported CPU of
the event->pmu.

Perf may not retrieve a valid core type from get_this_hybrid_cpu_type().
For example, ADL may have an alternative configuration. With that
configuration, Perf cannot retrieve the core type from the CPUID leaf
0x1a. Add a platform specific get_hybrid_cpu_type(). If the generic way
fails, invoke the platform specific get_hybrid_cpu_type().

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1618237865-33448-17-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:27 +02:00
Kan Liang e11c1a7eb3 perf/x86: Factor out x86_pmu_show_pmu_cap
The PMU capabilities are different among hybrid PMUs. Perf should dump
the PMU capabilities information for each hybrid PMU.

Factor out x86_pmu_show_pmu_cap() which shows the PMU capabilities
information. The function will be reused later when registering a
dedicated hybrid PMU.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/1618237865-33448-16-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:27 +02:00
Kan Liang b98567298b perf/x86: Remove temporary pmu assignment in event_init
The temporary pmu assignment in event_init is unnecessary.

The assignment was introduced by commit 8113070d66 ("perf_events:
Add fast-path to the rescheduling code"). At that time, event->pmu is
not assigned yet when initializing an event. The assignment is required.
However, from commit 7e5b2a01d2 ("perf: provide PMU when initing
events"), the event->pmu is provided before event_init is invoked.
The temporary pmu assignment in event_init should be removed.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/1618237865-33448-15-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:27 +02:00
Kan Liang 34d5b61f29 perf/x86/intel: Factor out intel_pmu_check_extra_regs
Each Hybrid PMU has to check and update its own extra registers before
registration.

The intel_pmu_check_extra_regs will be reused later to check the extra
registers of each hybrid PMU.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/1618237865-33448-14-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:26 +02:00
Kan Liang bc14fe1bee perf/x86/intel: Factor out intel_pmu_check_event_constraints
Each Hybrid PMU has to check and update its own event constraints before
registration.

The intel_pmu_check_event_constraints will be reused later to check
the event constraints of each hybrid PMU.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/1618237865-33448-13-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:26 +02:00
Kan Liang b8c4d1a876 perf/x86/intel: Factor out intel_pmu_check_num_counters
Each Hybrid PMU has to check its own number of counters and mask fixed
counters before registration.

The intel_pmu_check_num_counters will be reused later to check the
number of the counters for each hybrid PMU.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/1618237865-33448-12-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:26 +02:00
Kan Liang 183af7366b perf/x86: Hybrid PMU support for extra_regs
Different hybrid PMU may have different extra registers, e.g. Core PMU
may have offcore registers, frontend register and ldlat register. Atom
core may only have offcore registers and ldlat register. Each hybrid PMU
should use its own extra_regs.

An Intel Hybrid system should always have extra registers.
Unconditionally allocate shared_regs for Intel Hybrid system.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/1618237865-33448-11-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:26 +02:00
Kan Liang 24ee38ffe6 perf/x86: Hybrid PMU support for event constraints
The events are different among hybrid PMUs. Each hybrid PMU should use
its own event constraints.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/1618237865-33448-10-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:25 +02:00
Kan Liang 0d18f2dfea perf/x86: Hybrid PMU support for hardware cache event
The hardware cache events are different among hybrid PMUs. Each hybrid
PMU should have its own hw cache event table.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1618237865-33448-9-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:25 +02:00
Kan Liang eaacf07d11 perf/x86: Hybrid PMU support for unconstrained
The unconstrained value depends on the number of GP and fixed counters.
Each hybrid PMU should use its own unconstrained.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1618237865-33448-8-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:25 +02:00
Kan Liang d4b294bf84 perf/x86: Hybrid PMU support for counters
The number of GP and fixed counters are different among hybrid PMUs.
Each hybrid PMU should use its own counter related information.

When handling a certain hybrid PMU, apply the number of counters from
the corresponding hybrid PMU.

When reserving the counters in the initialization of a new event,
reserve all possible counters.

The number of counter recored in the global x86_pmu is for the
architecture counters which are available for all hybrid PMUs. KVM
doesn't support the hybrid PMU yet. Return the number of the
architecture counters for now.

For the functions only available for the old platforms, e.g.,
intel_pmu_drain_pebs_nhm(), nothing is changed.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/1618237865-33448-7-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:25 +02:00
Kan Liang fc4b8fca2d perf/x86: Hybrid PMU support for intel_ctrl
The intel_ctrl is the counter mask of a PMU. The PMU counter information
may be different among hybrid PMUs, each hybrid PMU should use its own
intel_ctrl to check and access the counters.

When handling a certain hybrid PMU, apply the intel_ctrl from the
corresponding hybrid PMU.

When checking the HW existence, apply the PMU and number of counters
from the corresponding hybrid PMU as well. Perf will check the HW
existence for each Hybrid PMU before registration. Expose the
check_hw_exists() for a later patch.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/1618237865-33448-6-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:24 +02:00
Kan Liang d0946a882e perf/x86/intel: Hybrid PMU support for perf capabilities
Some platforms, e.g. Alder Lake, have hybrid architecture. Although most
PMU capabilities are the same, there are still some unique PMU
capabilities for different hybrid PMUs. Perf should register a dedicated
pmu for each hybrid PMU.

Add a new struct x86_hybrid_pmu, which saves the dedicated pmu and
capabilities for each hybrid PMU.

The architecture MSR, MSR_IA32_PERF_CAPABILITIES, only indicates the
architecture features which are available on all hybrid PMUs. The
architecture features are stored in the global x86_pmu.intel_cap.

For Alder Lake, the model-specific features are perf metrics and
PEBS-via-PT. The corresponding bits of the global x86_pmu.intel_cap
should be 0 for these two features. Perf should not use the global
intel_cap to check the features on a hybrid system.
Add a dedicated intel_cap in the x86_hybrid_pmu to store the
model-specific capabilities. Use the dedicated intel_cap to replace
the global intel_cap for thse two features. The dedicated intel_cap
will be set in the following "Add Alder Lake Hybrid support" patch.

Add is_hybrid() to distinguish a hybrid system. ADL may have an
alternative configuration. With that configuration, the
X86_FEATURE_HYBRID_CPU is not set. Perf cannot rely on the feature bit.
Add a new static_key_false, perf_is_hybrid, to indicate a hybrid system.
It will be assigned in the following "Add Alder Lake Hybrid support"
patch as well.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1618237865-33448-5-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:24 +02:00
Kan Liang 61e76d53c3 perf/x86: Track pmu in per-CPU cpu_hw_events
Some platforms, e.g. Alder Lake, have hybrid architecture. In the same
package, there may be more than one type of CPU. The PMU capabilities
are different among different types of CPU. Perf will register a
dedicated PMU for each type of CPU.

Add a 'pmu' variable in the struct cpu_hw_events to track the dedicated
PMU of the current CPU.

Current x86_get_pmu() use the global 'pmu', which will be broken on a
hybrid platform. Modify it to apply the 'pmu' of the specific CPU.

Initialize the per-CPU 'pmu' variable with the global 'pmu'. There is
nothing changed for the non-hybrid platforms.

The is_x86_event() will be updated in the later patch ("perf/x86:
Register hybrid PMUs") for hybrid platforms. For the non-hybrid
platforms, nothing is changed here.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1618237865-33448-4-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:24 +02:00
Ricardo Neri 250b3c0d79 x86/cpu: Add helper function to get the type of the current hybrid CPU
On processors with Intel Hybrid Technology (i.e., one having more than
one type of CPU in the same package), all CPUs support the same
instruction set and enumerate the same features on CPUID. Thus, all
software can run on any CPU without restrictions. However, there may be
model-specific differences among types of CPUs. For instance, each type
of CPU may support a different number of performance counters. Also,
machine check error banks may be wired differently. Even though most
software will not care about these differences, kernel subsystems
dealing with these differences must know.

Add and expose a new helper function get_this_hybrid_cpu_type() to query
the type of the current hybrid CPU. The function will be used later in
the perf subsystem.

The Intel Software Developer's Manual defines the CPU type as 8-bit
identifier.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1618237865-33448-3-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:23 +02:00
Ricardo Neri a161545ab5 x86/cpufeatures: Enumerate Intel Hybrid Technology feature bit
Add feature enumeration to identify a processor with Intel Hybrid
Technology: one in which CPUs of more than one type are the same package.
On a hybrid processor, all CPUs support the same homogeneous (i.e.,
symmetric) instruction set. All CPUs enumerate the same features in CPUID.
Thus, software (user space and kernel) can run and migrate to any CPU in
the system as well as utilize any of the enumerated features without any
change or special provisions. The main difference among CPUs in a hybrid
processor are power and performance properties.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1618237865-33448-2-git-send-email-kan.liang@linux.intel.com
2021-04-19 20:03:23 +02:00
Josh Poimboeuf 7d3d10e0e8 x86/crypto: Enable objtool in crypto code
Now that all the stack alignment prologues have been cleaned up in the
crypto code, enable objtool.  Among other benefits, this will allow ORC
unwinding to work.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/fc2a1918c50e33e46ef0e9a5de02743f2f6e3639.1614182415.git.jpoimboe@redhat.com
2021-04-19 12:36:37 -05:00
Josh Poimboeuf 27d26793f2 x86/crypto/sha512-ssse3: Standardize stack alignment prologue
Use a more standard prologue for saving the stack pointer before
realigning the stack.

This enables ORC unwinding by allowing objtool to understand the stack
realignment.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/6ecaaac9f3828fbb903513bf90c34a08380a8e35.1614182415.git.jpoimboe@redhat.com
2021-04-19 12:36:37 -05:00
Josh Poimboeuf ec063e090b x86/crypto/sha512-avx2: Standardize stack alignment prologue
Use a more standard prologue for saving the stack pointer before
realigning the stack.

This enables ORC unwinding by allowing objtool to understand the stack
realignment.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/b1a7b29fcfc65d60a3b6e77ef75f4762a5b8488d.1614182415.git.jpoimboe@redhat.com
2021-04-19 12:36:36 -05:00
Josh Poimboeuf d61684b56e x86/crypto/sha512-avx: Standardize stack alignment prologue
Use a more standard prologue for saving the stack pointer before
realigning the stack.

This enables ORC unwinding by allowing objtool to understand the stack
realignment.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/d36e9ea1c819d87fa89b3df3fa83e2a1ede18146.1614182415.git.jpoimboe@redhat.com
2021-04-19 12:36:36 -05:00
Josh Poimboeuf ce58466680 x86/crypto/sha256-avx2: Standardize stack alignment prologue
Use a more standard prologue for saving the stack pointer before
realigning the stack.

This enables ORC unwinding by allowing objtool to understand the stack
realignment.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/8048e7444c49a8137f05265262b83dc50f8fb7f3.1614182415.git.jpoimboe@redhat.com
2021-04-19 12:36:36 -05:00
Josh Poimboeuf 20114c899c x86/crypto/sha1_avx2: Standardize stack alignment prologue
Use a more standard prologue for saving the stack pointer before
realigning the stack.

This enables ORC unwinding by allowing objtool to understand the stack
realignment.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/fdaaf8670ed1f52f55ba9a6bbac98c1afddc1af6.1614182415.git.jpoimboe@redhat.com
2021-04-19 12:36:35 -05:00
Josh Poimboeuf 35a0067d2c x86/crypto/sha_ni: Standardize stack alignment prologue
Use a more standard prologue for saving the stack pointer before
realigning the stack.

This enables ORC unwinding by allowing objtool to understand the stack
realignment.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/5033e1a79867dff1b18e1b4d0783c38897d3f223.1614182415.git.jpoimboe@redhat.com
2021-04-19 12:36:35 -05:00
Josh Poimboeuf 2b02ed5548 x86/crypto/crc32c-pcl-intel: Standardize jump table
Simplify the jump table code so that it resembles a compiler-generated
table.

This enables ORC unwinding by allowing objtool to follow all the
potential code paths.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/5357a039def90b8ef6b5874ef12cda008ecf18ba.1614182415.git.jpoimboe@redhat.com
2021-04-19 12:36:34 -05:00
Josh Poimboeuf dabe5167a3 x86/crypto/camellia-aesni-avx2: Unconditionally allocate stack buffer
A conditional stack allocation violates traditional unwinding
requirements when a single instruction can have differing stack layouts.

There's no benefit in allocating the stack buffer conditionally.  Just
do it unconditionally.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/85ac96613ee5784b6239c18d3f68b1f3c509caa3.1614182415.git.jpoimboe@redhat.com
2021-04-19 12:36:34 -05:00
Josh Poimboeuf e163be86ff x86/crypto/aesni-intel_avx: Standardize stack alignment prologue
Use RBP instead of R14 for saving the old stack pointer before
realignment.  This resembles what compilers normally do.

This enables ORC unwinding by allowing objtool to understand the stack
realignment.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/02d00a0903a0959f4787e186e2a07d271e1f63d4.1614182415.git.jpoimboe@redhat.com
2021-04-19 12:36:34 -05:00
Josh Poimboeuf ff5796b6db x86/crypto/aesni-intel_avx: Fix register usage comments
Fix register usage comments to match reality.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/8655d4513a0ed1eddec609165064153973010aa2.1614182415.git.jpoimboe@redhat.com
2021-04-19 12:36:33 -05:00
Josh Poimboeuf 4f08300916 x86/crypto/aesni-intel_avx: Remove unused macros
These macros are no longer used; remove them.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/53f7136ea93ebdbca399959e6d2991ecb46e733e.1614182415.git.jpoimboe@redhat.com
2021-04-19 12:36:33 -05:00
Ben Gardon 24ae4cfaaa KVM: x86/mmu: Allow enabling/disabling dirty logging under MMU read lock
To reduce lock contention and interference with page fault handlers,
allow the TDP MMU functions which enable and disable dirty logging
to operate under the MMU read lock.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210401233736.638171-12-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 09:06:04 -04:00
Ben Gardon 2db6f772b5 KVM: x86/mmu: Allow zapping collapsible SPTEs to use MMU read lock
To reduce the impact of disabling dirty logging, change the TDP MMU
function which zaps collapsible SPTEs to run under the MMU read lock.
This way, page faults on zapped SPTEs can proceed in parallel with
kvm_mmu_zap_collapsible_sptes.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210401233736.638171-11-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 09:06:04 -04:00
Ben Gardon 6103bc0740 KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock
To reduce lock contention and interference with page fault handlers,
allow the TDP MMU function to zap a GFN range to operate under the MMU
read lock.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210401233736.638171-10-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 09:06:04 -04:00
Ben Gardon c0e64238ac KVM: x86/mmu: Protect the tdp_mmu_roots list with RCU
Protect the contents of the TDP MMU roots list with RCU in preparation
for a future patch which will allow the iterator macro to be used under
the MMU lock in read mode.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210401233736.638171-9-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 09:06:01 -04:00
Ben Gardon fb10129335 KVM: x86/mmu: handle cmpxchg failure in kvm_tdp_mmu_get_root
To reduce dependence on the MMU write lock, don't rely on the assumption
that the atomic operation in kvm_tdp_mmu_get_root will always succeed.
By not relying on that assumption, threads do not need to hold the MMU
lock in write mode in order to take a reference on a TDP MMU root.

In the root iterator, this change means that some roots might have to be
skipped if they are found to have a zero refcount. This will still never
happen as of this patch, but a future patch will need that flexibility to
make the root iterator safe under the MMU read lock.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210401233736.638171-8-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 09:05:25 -04:00
Ben Gardon 11cccf5c04 KVM: x86/mmu: Make TDP MMU root refcount atomic
In order to parallelize more operations for the TDP MMU, make the
refcount on TDP MMU roots atomic, so that a future patch can allow
multiple threads to take a reference on the root concurrently, while
holding the MMU lock in read mode.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210401233736.638171-7-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 09:05:25 -04:00
Ben Gardon cfc109979b KVM: x86/mmu: Refactor yield safe root iterator
Refactor the yield safe TDP MMU root iterator to be more amenable to
changes in future commits which will allow it to be used under the MMU
lock in read mode. Currently the iterator requires a complicated dance
between the helper functions and different parts of the for loop which
makes it hard to reason about. Moving all the logic into a single function
simplifies the iterator substantially.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210401233736.638171-6-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 09:05:24 -04:00
Ben Gardon 2bdb3d84ce KVM: x86/mmu: Merge TDP MMU put and free root
kvm_tdp_mmu_put_root and kvm_tdp_mmu_free_root are always called
together, so merge the functions to simplify TDP MMU root refcounting /
freeing.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210401233736.638171-5-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 09:05:24 -04:00
Ben Gardon 4bba36d72b KVM: x86/mmu: use tdp_mmu_free_sp to free roots
Minor cleanup to deduplicate the code used to free a struct kvm_mmu_page
in the TDP MMU.

No functional change intended.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210401233736.638171-4-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 09:05:24 -04:00
Ben Gardon 76eb54e7e7 KVM: x86/mmu: Move kvm_mmu_(get|put)_root to TDP MMU
The TDP MMU is almost the only user of kvm_mmu_get_root and
kvm_mmu_put_root. There is only one use of put_root in mmu.c for the
legacy / shadow MMU. Open code that one use and move the get / put
functions to the TDP MMU so they can be extended in future commits.

No functional change intended.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210401233736.638171-3-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 09:05:24 -04:00
Ben Gardon 8ca6f063b7 KVM: x86/mmu: Re-add const qualifier in kvm_tdp_mmu_zap_collapsible_sptes
kvm_tdp_mmu_zap_collapsible_sptes unnecessarily removes the const
qualifier from its memlsot argument, leading to a compiler warning. Add
the const annotation and pass it to subsequent functions.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210401233736.638171-2-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 09:05:23 -04:00
Sean Christopherson e1eed5847b KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if possible
Let the TDP MMU yield when unmapping a range in response to a MMU
notification, if yielding is allowed by said notification.  There is no
reason to disallow yielding in this case, and in theory the range being
invalidated could be quite large.

Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-19 09:05:23 -04:00
Maciej W. Rozycki 0ef3439cd8 x86/build: Disable HIGHMEM64G selection for M486SX
Fix a regression caused by making the 486SX separately selectable in
Kconfig, for which the HIGHMEM64G setting has not been updated and
therefore has become exposed as a user-selectable option for the M486SX
configuration setting unlike with original M486 and all the other
settings that choose non-PAE-enabled processors:

  High Memory Support
  > 1. off (NOHIGHMEM)
    2. 4GB (HIGHMEM4G)
    3. 64GB (HIGHMEM64G)
  choice[1-3?]:

With the fix in place the setting is now correctly removed:

  High Memory Support
  > 1. off (NOHIGHMEM)
    2. 4GB (HIGHMEM4G)
  choice[1-2?]:

 [ bp: Massage commit message. ]

Fixes: 87d6021b81 ("x86/math-emu: Limit MATH_EMULATION to 486SX compatibles")
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org # v5.5+
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.2104141221340.44318@angie.orcam.me.uk
2021-04-19 14:02:12 +02:00
Jakub Kicinski 8203c7ce4e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
 - keep the ZC code, drop the code related to reinit
net/bridge/netfilter/ebtables.c
 - fix build after move to net_generic

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-17 11:08:07 -07:00
Sean Christopherson b4c5936c47 KVM: Kill off the old hva-based MMU notifier callbacks
Yank out the hva-based MMU notifier APIs now that all architectures that
use the notifiers have moved to the gfn-based APIs.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:31:08 -04:00
Sean Christopherson 3039bcc744 KVM: Move x86's MMU notifier memslot walkers to generic code
Move the hva->gfn lookup for MMU notifiers into common code.  Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.

In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.

The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.

Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.

Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.

MIPS, PPC, and arm64 will be converted one at a time in future patches.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:31:06 -04:00
Paolo Bonzini 6c9dd6d262 KVM: constify kvm_arch_flush_remote_tlbs_memslot
memslots are stored in RCU and there should be no need to
change them.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:31:04 -04:00
Paolo Bonzini dbb6964e4c KVM: MMU: protect TDP MMU pages only down to required level
When using manual protection of dirty pages, it is not necessary
to protect nested page tables down to the 4K level; instead KVM
can protect only hugepages in order to split them lazily, and
delay write protection at 4K-granularity until KVM_CLEAR_DIRTY_LOG.
This was overlooked in the TDP MMU, so do it there as well.

Fixes: a6a0b05da9 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
Cc: Ben Gardon <bgardon@google.com>
Reviewed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:31:04 -04:00
Maxim Levitsky 7e582ccbbd KVM: x86: implement KVM_CAP_SET_GUEST_DEBUG2
Store the supported bits into KVM_GUESTDBG_VALID_MASK
macro, similar to how arm does this.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210401135451.1004564-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:31:02 -04:00
Maxim Levitsky 4020da3b9f KVM: x86: pending exceptions must not be blocked by an injected event
Injected interrupts/nmi should not block a pending exception,
but rather be either lost if nested hypervisor doesn't
intercept the pending exception (as in stock x86), or be delivered
in exitintinfo/IDT_VECTORING_INFO field, as a part of a VMexit
that corresponds to the pending exception.

The only reason for an exception to be blocked is when nested run
is pending (and that can't really happen currently
but still worth checking for).

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210401143817.1030695-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:31:02 -04:00
Maxim Levitsky 232f75d3b4 KVM: nSVM: call nested_svm_load_cr3 on nested state load
While KVM's MMU should be fully reset by loading of nested CR0/CR3/CR4
by KVM_SET_SREGS, we are not in nested mode yet when we do it and therefore
only root_mmu is reset.

On regular nested entries we call nested_svm_load_cr3 which both updates
the guest's CR3 in the MMU when it is needed, and it also initializes
the mmu again which makes it initialize the walk_mmu as well when nested
paging is enabled in both host and guest.

Since we don't call nested_svm_load_cr3 on nested state load,
the walk_mmu can be left uninitialized, which can lead to a NULL pointer
dereference while accessing it if we happen to get a nested page fault
right after entering the nested guest first time after the migration and
we decide to emulate it, which leads to the emulator trying to access
walk_mmu->gva_to_gpa which is NULL.

Therefore we should call this function on nested state load as well.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210401141814.1029036-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:31:01 -04:00
David Edmondson 8486039a6c KVM: x86: dump_vmcs should include the autoload/autostore MSR lists
When dumping the current VMCS state, include the MSRs that are being
automatically loaded/stored during VM entry/exit.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Message-Id: <20210318120841.133123-6-david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:31:01 -04:00
David Edmondson 0702a3cbbf KVM: x86: dump_vmcs should show the effective EFER
If EFER is not being loaded from the VMCS, show the effective value by
reference to the MSR autoload list or calculation.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Message-Id: <20210318120841.133123-5-david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:31:00 -04:00
David Edmondson 5518da62d4 KVM: x86: dump_vmcs should consider only the load controls of EFER/PAT
When deciding whether to dump the GUEST_IA32_EFER and GUEST_IA32_PAT
fields of the VMCS, examine only the VM entry load controls, as saving
on VM exit has no effect on whether VM entry succeeds or fails.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Message-Id: <20210318120841.133123-4-david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:31:00 -04:00
David Edmondson 699e1b2e55 KVM: x86: dump_vmcs should not conflate EFER and PAT presence in VMCS
Show EFER and PAT based on their individual entry/exit controls.

Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Message-Id: <20210318120841.133123-3-david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:59 -04:00
David Edmondson d9e46d344e KVM: x86: dump_vmcs should not assume GUEST_IA32_EFER is valid
If the VM entry/exit controls for loading/saving MSR_EFER are either
not available (an older processor or explicitly disabled) or not
used (host and guest values are the same), reading GUEST_IA32_EFER
from the VMCS returns an inaccurate value.

Because of this, in dump_vmcs() don't use GUEST_IA32_EFER to decide
whether to print the PDPTRs - always do so if the fields exist.

Fixes: 4eb64dce8d ("KVM: x86: dump VMCS on invalid entry")
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Message-Id: <20210318120841.133123-2-david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:59 -04:00
Maxim Levitsky adc2a23734 KVM: nSVM: improve SYSENTER emulation on AMD
Currently to support Intel->AMD migration, if CPU vendor is GenuineIntel,
we emulate the full 64 value for MSR_IA32_SYSENTER_{EIP|ESP}
msrs, and we also emulate the sysenter/sysexit instruction in long mode.

(Emulator does still refuse to emulate sysenter in 64 bit mode, on the
ground that the code for that wasn't tested and likely has no users)

However when virtual vmload/vmsave is enabled, the vmload instruction will
update these 32 bit msrs without triggering their msr intercept,
which will lead to having stale values in kvm's shadow copy of these msrs,
which relies on the intercept to be up to date.

Fix/optimize this by doing the following:

1. Enable the MSR intercepts for SYSENTER MSRs iff vendor=GenuineIntel
   (This is both a tiny optimization and also ensures that in case
   the guest cpu vendor is AMD, the msrs will be 32 bit wide as
   AMD defined).

2. Store only high 32 bit part of these msrs on interception and combine
   it with hardware msr value on intercepted read/writes
   iff vendor=GenuineIntel.

3. Disable vmload/vmsave virtualization if vendor=GenuineIntel.
   (It is somewhat insane to set vendor=GenuineIntel and still enable
   SVM for the guest but well whatever).
   Then zero the high 32 bit parts when kvm intercepts and emulates vmload.

Thanks a lot to Paulo Bonzini for helping me with fixing this in the most
correct way.

This patch fixes nested migration of 32 bit nested guests, that was
broken because incorrect cached values of SYSENTER msrs were stored in
the migration stream if L1 changed these msrs with
vmload prior to L2 entry.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210401111928.996871-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:59 -04:00
Maxim Levitsky c1df4aac44 KVM: x86: add guest_cpuid_is_intel
This is similar to existing 'guest_cpuid_is_amd_or_hygon'

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210401111928.996871-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:58 -04:00
Sean Christopherson eba04b20e4 KVM: x86: Account a variety of miscellaneous allocations
Switch to GFP_KERNEL_ACCOUNT for a handful of allocations that are
clearly associated with a single task/VM.

Note, there are a several SEV allocations that aren't accounted, but
those can (hopefully) be fixed by using the local stack for memory.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210331023025.2485960-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:58 -04:00
Sean Christopherson 8727906fde KVM: SVM: Do not allow SEV/SEV-ES initialization after vCPUs are created
Reject KVM_SEV_INIT and KVM_SEV_ES_INIT if they are attempted after one
or more vCPUs have been created.  KVM assumes a VM is tagged SEV/SEV-ES
prior to vCPU creation, e.g. init_vmcb() needs to mark the VMCB as SEV
enabled, and svm_create_vcpu() needs to allocate the VMSA.  At best,
creating vCPUs before SEV/SEV-ES init will lead to unexpected errors
and/or behavior, and at worst it will crash the host, e.g.
sev_launch_update_vmsa() will dereference a null svm->vmsa pointer.

Fixes: 1654efcbc4 ("KVM: SVM: Add KVM_SEV_INIT command")
Fixes: ad73109ae7 ("KVM: SVM: Provide support to launch and run an SEV-ES guest")
Cc: stable@vger.kernel.org
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210331031936.2495277-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:58 -04:00
Sean Christopherson 9fa1521daa KVM: SVM: Do not set sev->es_active until KVM_SEV_ES_INIT completes
Set sev->es_active only after the guts of KVM_SEV_ES_INIT succeeds.  If
the command fails, e.g. because SEV is already active or there are no
available ASIDs, then es_active will be left set even though the VM is
not fully SEV-ES capable.

Refactor the code so that "es_active" is passed on the stack instead of
being prematurely shoved into sev_info, both to avoid having to unwind
sev_info and so that it's more obvious what actually consumes es_active
in sev_guest_init() and its helpers.

Fixes: ad73109ae7 ("KVM: SVM: Provide support to launch and run an SEV-ES guest")
Cc: stable@vger.kernel.org
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210331031936.2495277-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:57 -04:00
Sean Christopherson c36b16d29f KVM: SVM: Use online_vcpus, not created_vcpus, to iterate over vCPUs
Use the kvm_for_each_vcpu() helper to iterate over vCPUs when encrypting
VMSAs for SEV, which effectively switches to use online_vcpus instead of
created_vcpus.  This fixes a possible null-pointer dereference as
created_vcpus does not guarantee a vCPU exists, since it is updated at
the very beginning of KVM_CREATE_VCPU.  created_vcpus exists to allow the
bulk of vCPU creation to run in parallel, while still correctly
restricting the max number of max vCPUs.

Fixes: ad73109ae7 ("KVM: SVM: Provide support to launch and run an SEV-ES guest")
Cc: stable@vger.kernel.org
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210331031936.2495277-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:57 -04:00
Sean Christopherson 8f8f52a45d KVM: x86/mmu: Simplify code for aging SPTEs in TDP MMU
Use a basic NOT+AND sequence to clear the Accessed bit in TDP MMU SPTEs,
as opposed to the fancy ffs()+clear_bit() logic that was copied from the
legacy MMU.  The legacy MMU uses clear_bit() because it is operating on
the SPTE itself, i.e. clearing needs to be atomic.  The TDP MMU operates
on a local variable that it later writes to the SPTE, and so doesn't need
to be atomic or even resident in memory.

Opportunistically drop unnecessary initialization of new_spte, it's
guaranteed to be written before being accessed.

Using NOT+AND instead of ffs()+clear_bit() reduces the sequence from:

   0x0000000000058be6 <+134>:	test   %rax,%rax
   0x0000000000058be9 <+137>:	je     0x58bf4 <age_gfn_range+148>
   0x0000000000058beb <+139>:	test   %rax,%rdi
   0x0000000000058bee <+142>:	je     0x58cdc <age_gfn_range+380>
   0x0000000000058bf4 <+148>:	mov    %rdi,0x8(%rsp)
   0x0000000000058bf9 <+153>:	mov    $0xffffffff,%edx
   0x0000000000058bfe <+158>:	bsf    %eax,%edx
   0x0000000000058c01 <+161>:	movslq %edx,%rdx
   0x0000000000058c04 <+164>:	lock btr %rdx,0x8(%rsp)
   0x0000000000058c0b <+171>:	mov    0x8(%rsp),%r15

to:

   0x0000000000058bdd <+125>:	test   %rax,%rax
   0x0000000000058be0 <+128>:	je     0x58beb <age_gfn_range+139>
   0x0000000000058be2 <+130>:	test   %rax,%r8
   0x0000000000058be5 <+133>:	je     0x58cc0 <age_gfn_range+352>
   0x0000000000058beb <+139>:	not    %rax
   0x0000000000058bee <+142>:	and    %r8,%rax
   0x0000000000058bf1 <+145>:	mov    %rax,%r15

thus eliminating several memory accesses, including a locked access.

Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210331004942.2444916-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:57 -04:00
Sean Christopherson 6d9aafb96d KVM: x86/mmu: Remove spurious clearing of dirty bit from TDP MMU SPTE
Don't clear the dirty bit when aging a TDP MMU SPTE (in response to a MMU
notifier event).  Prematurely clearing the dirty bit could cause spurious
PML updates if aging a page happened to coincide with dirty logging.

Note, tdp_mmu_set_spte_no_acc_track() flows into __handle_changed_spte(),
so the host PFN will be marked dirty, i.e. there is no potential for data
corruption.

Fixes: a6a0b05da9 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210331004942.2444916-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:56 -04:00
Sean Christopherson 6dfbd6b5d5 KVM: x86/mmu: Drop trace_kvm_age_page() tracepoint
Remove x86's trace_kvm_age_page() tracepoint.  It's mostly redundant with
the common trace_kvm_age_hva() tracepoint, and if there is a need for the
extra details, e.g. gfn, referenced, etc... those details should be added
to the common tracepoint so that all architectures and MMUs benefit from
the info.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210326021957.1424875-19-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:56 -04:00
Sean Christopherson 5f7c292b89 KVM: Move prototypes for MMU notifier callbacks to generic code
Move the prototypes for the MMU notifier callbacks out of arch code and
into common code.  There is no benefit to having each arch replicate the
prototypes since any deviation from the invocation in common code will
explode.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210326021957.1424875-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:55 -04:00
Sean Christopherson aaaac889cf KVM: x86/mmu: Use leaf-only loop for walking TDP SPTEs when changing SPTE
Use the leaf-only TDP iterator when changing the SPTE in reaction to a
MMU notifier.  Practically speaking, this is a nop since the guts of the
loop explicitly looks for 4k SPTEs, which are always leaf SPTEs.  Switch
the iterator to match age_gfn_range() and test_age_gfn() so that a future
patch can consolidate the core iterating logic.

No real functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210326021957.1424875-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:55 -04:00
Sean Christopherson a3f15bda46 KVM: x86/mmu: Pass address space ID to TDP MMU root walkers
Move the address space ID check that is performed when iterating over
roots into the macro helpers to consolidate code.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210326021957.1424875-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:55 -04:00
Sean Christopherson 2b9663d8a1 KVM: x86/mmu: Pass address space ID to __kvm_tdp_mmu_zap_gfn_range()
Pass the address space ID to TDP MMU's primary "zap gfn range" helper to
allow the MMU notifier paths to iterate over memslots exactly once.
Currently, both the legacy MMU and TDP MMU iterate over memslots when
looking for an overlapping hva range, which can be quite costly if there
are a large number of memslots.

Add a "flush" parameter so that iterating over multiple address spaces
in the caller will continue to do the right thing when yielding while a
flush is pending from a previous address space.

Note, this also has a functional change in the form of coalescing TLB
flushes across multiple address spaces in kvm_zap_gfn_range(), and also
optimizes the TDP MMU to utilize range-based flushing when running as L1
with Hyper-V enlightenments.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210326021957.1424875-6-seanjc@google.com>
[Keep separate for loops to prepare for other incoming patches. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:54 -04:00
Sean Christopherson 1a61b7db7a KVM: x86/mmu: Coalesce TLB flushes across address spaces for gfn range zap
Gather pending TLB flushes across both address spaces when zapping a
given gfn range.  This requires feeding "flush" back into subsequent
calls, but on the plus side sets the stage for further batching
between the legacy MMU and TDP MMU.  It also allows refactoring the
address space iteration to cover the legacy and TDP MMUs without
introducing truly ugly code.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210326021957.1424875-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:54 -04:00
Sean Christopherson 142ccde1f7 KVM: x86/mmu: Coalesce TLB flushes when zapping collapsible SPTEs
Gather pending TLB flushes across both the legacy and TDP MMUs when
zapping collapsible SPTEs to avoid multiple flushes if both the legacy
MMU (for nested guests) and TDP MMU have mappings for the memslot.

Note, this also optimizes the TDP MMU to flush only the relevant range
when running as L1 with Hyper-V enlightenments.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210326021957.1424875-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:54 -04:00
Sean Christopherson 302695a574 KVM: x86/mmu: Move flushing for "slot" handlers to caller for legacy MMU
Place the onus on the caller of slot_handle_*() to flush the TLB, rather
than handling the flush in the helper, and rename parameters accordingly.
This will allow future patches to coalesce flushes between address spaces
and between the legacy and TDP MMUs.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210326021957.1424875-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:53 -04:00
Sean Christopherson af95b53e56 KVM: x86/mmu: Coalesce TDP MMU TLB flushes when zapping collapsible SPTEs
When zapping collapsible SPTEs across multiple roots, gather pending
flushes and perform a single remote TLB flush at the end, as opposed to
flushing after processing every root.

Note, flush may be cleared by the result of zap_collapsible_spte_range().
This is intended and correct, e.g. yielding may have serviced a prior
pending flush.

Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210326021957.1424875-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:53 -04:00
Vitaly Kuznetsov c28fa560c5 KVM: x86/vPMU: Forbid reading from MSR_F15H_PERF MSRs when guest doesn't have X86_FEATURE_PERFCTR_CORE
MSR_F15H_PERF_CTL0-5, MSR_F15H_PERF_CTR0-5 MSRs have a CPUID bit assigned
to them (X86_FEATURE_PERFCTR_CORE) and when it wasn't exposed to the guest
the correct behavior is to inject #GP an not just return zero.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210329124804.170173-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:53 -04:00
Krish Sadhukhan 9a7de6ecc3 KVM: nSVM: If VMRUN is single-stepped, queue the #DB intercept in nested_svm_vmexit()
According to APM, the #DB intercept for a single-stepped VMRUN must happen
after the completion of that instruction, when the guest does #VMEXIT to
the host. However, in the current implementation of KVM, the #DB intercept
for a single-stepped VMRUN happens after the completion of the instruction
that follows the VMRUN instruction. When the #DB intercept handler is
invoked, it shows the RIP of the instruction that follows VMRUN, instead of
of VMRUN itself. This is an incorrect RIP as far as single-stepping VMRUN
is concerned.

This patch fixes the problem by checking, in nested_svm_vmexit(), for the
condition that the VMRUN instruction is being single-stepped and if so,
queues the pending #DB intercept so that the #DB is accounted for before
we execute L1's next instruction.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oraacle.com>
Message-Id: <20210323175006.73249-2-krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:52 -04:00
Paolo Bonzini 4a38162ee9 KVM: MMU: load PDPTRs outside mmu_lock
On SVM, reading PDPTRs might access guest memory, which might fault
and thus might sleep.  On the other hand, it is not possible to
release the lock after make_mmu_pages_available has been called.

Therefore, push the call to make_mmu_pages_available and the
mmu_lock critical section within mmu_alloc_direct_roots and
mmu_alloc_shadow_roots.

Reported-by: Wanpeng Li <wanpengli@tencent.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:30:52 -04:00
Paolo Bonzini d9bd0082e2 Merge remote-tracking branch 'tip/x86/sgx' into kvm-next
Pull generic x86 SGX changes needed to support SGX in virtual machines.
2021-04-17 08:29:47 -04:00
Walter Wu 02c587733c kasan: remove redundant config option
CONFIG_KASAN_STACK and CONFIG_KASAN_STACK_ENABLE both enable KASAN stack
instrumentation, but we should only need one config, so that we remove
CONFIG_KASAN_STACK_ENABLE and make CONFIG_KASAN_STACK workable.  see [1].

When enable KASAN stack instrumentation, then for gcc we could do no
prompt and default value y, and for clang prompt and default value n.

This patch fixes the following compilation warning:

  include/linux/kasan.h:333:30: warning: 'CONFIG_KASAN_STACK' is not defined, evaluates to 0 [-Wundef]

[akpm@linux-foundation.org: fix merge snafu]

Link: https://bugzilla.kernel.org/show_bug.cgi?id=210221 [1]
Link: https://lkml.kernel.org/r/20210226012531.29231-1-walter-zh.wu@mediatek.com
Fixes: d9b571c885 ("kasan: fix KASAN_STACK dependency for HW_TAGS")
Signed-off-by: Walter Wu <walter-zh.wu@mediatek.com>
Suggested-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-16 16:10:36 -07:00
Nathan Chancellor 5deac80d45 perf/amd/uncore: Fix sysfs type mismatch
dev_attr_show() calls the __uncore_*_show() functions via an indirect
call but their type does not currently match the type of the show()
member in 'struct device_attribute', resulting in a Control Flow
Integrity violation.

$ cat /sys/devices/amd_l3/format/umask
config:8-15

$ dmesg | grep "CFI failure"
[ 1258.174653] CFI failure (target: __uncore_umask_show...):

Update the type in the DEFINE_UNCORE_FORMAT_ATTR macro to match
'struct device_attribute' so that there is no more CFI violation.

Fixes: 06f2c24584 ("perf/amd/uncore: Prepare to scale for more attributes that vary per family")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210415001112.3024673-2-nathan@kernel.org
2021-04-16 18:58:52 +02:00
Nathan Chancellor de5bc7b425 x86/events/amd/iommu: Fix sysfs type mismatch
dev_attr_show() calls _iommu_event_show() via an indirect call but
_iommu_event_show()'s type does not currently match the type of the
show() member in 'struct device_attribute', resulting in a Control Flow
Integrity violation.

$ cat /sys/devices/amd_iommu_1/events/mem_dte_hit
csource=0x0a

$ dmesg | grep "CFI failure"
[ 3526.735140] CFI failure (target: _iommu_event_show...):

Change _iommu_event_show() and 'struct amd_iommu_event_desc' to
'struct device_attribute' so that there is no more CFI violation.

Fixes: 7be6296fdd ("perf/x86/amd: AMD IOMMU Performance Counter PERF uncore PMU implementation")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210415001112.3024673-1-nathan@kernel.org
2021-04-16 18:58:52 +02:00
Joerg Roedel 49d11527e5 Merge branches 'iommu/fixes', 'arm/mediatek', 'arm/smmu', 'arm/exynos', 'unisoc', 'x86/vt-d', 'x86/amd' and 'core' into next 2021-04-16 17:16:03 +02:00
Kan Liang 46ade4740b perf/x86: Move cpuc->running into P4 specific code
The 'running' variable is only used in the P4 PMU. Current perf sets the
variable in the critical function x86_pmu_start(), which wastes cycles
for everybody not running on P4.

Move cpuc->running into the P4 specific p4_pmu_enable_event().

Add a static per-CPU 'p4_running' variable to replace the 'running'
variable in the struct cpu_hw_events. Saves space for the generic
structure.

The p4_pmu_enable_all() also invokes the p4_pmu_enable_event(), but it
should not set cpuc->running. Factor out __p4_pmu_enable_event() for
p4_pmu_enable_all().

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1618410990-21383-1-git-send-email-kan.liang@linux.intel.com
2021-04-16 16:32:42 +02:00
Marco Elver fb6cc127e0 signal: Introduce TRAP_PERF si_code and si_perf to siginfo
Introduces the TRAP_PERF si_code, and associated siginfo_t field
si_perf. These will be used by the perf event subsystem to send signals
(if requested) to the task where an event occurred.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Acked-by: Arnd Bergmann <arnd@arndb.de> # asm-generic
Link: https://lkml.kernel.org/r/20210408103605.1676875-6-elver@google.com
2021-04-16 16:32:41 +02:00
Georges Aureau 0b45143b4b x86/platform/uv: Add more to secondary CPU kdump info
Add call to run_crash_ipi_callback() to gather more info of what the
secondary CPUs were doing to help with failure analysis.

Excerpt from Georges:

'It is only changing where crash secondaries will be stalling after
having taken care of properly laying down "crash note regs". Please
note that "crash note regs" are a key piece of data used by crash dump
debuggers to provide a reliable backtrace of running processors.'

Secondary change pursuant to

  a5f526ecb0 ("CodingStyle: Inclusive Terminology"):

change master/slave to main/secondary.

 [ bp: Massage commit message. ]

Signed-off-by: Georges Aureau <georges.aureau@hpe.com>
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lkml.kernel.org/r/20210311151028.82678-1-mike.travis@hpe.com
2021-04-16 12:51:41 +02:00
Mike Travis 26d4be3ea1 x86/platform/uv: Use x2apic enabled bit as set by BIOS to indicate APIC mode
BIOS now sets the x2apic enabled bit (and the ACPI table) for extended
APIC modes. Use that bit to indicate if extended mode is set.

 [ bp: Fixup subject prefix, merge subsequent fix
   https://lkml.kernel.org/r/20210415220626.223955-1-mike.travis@hpe.com ]

Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210408160047.1703-1-mike.travis@hpe.com
2021-04-16 12:51:03 +02:00
Randy Dunlap 1a594f0afa um: elf.h: Fix W=1 warning for empty body in 'do' statement
Use the common kernel style to eliminate a warning:

./arch/x86/um/asm/elf.h:215:32: warning: suggest braces around empty body in ‘do’ statement [-Wempty-body]
 #define SET_PERSONALITY(ex) do ; while(0)
                                ^

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: linux-um@lists.infradead.org
Signed-off-by: Richard Weinberger <richard@nod.at>
2021-04-15 23:10:50 +02:00
Randy Dunlap a730af6e31 um: Add 2 missing libs to fix various build errors
Fix many build errors (at least 18 build error reports) for uml on i386
by adding 2 more library object files. All missing symbols are
either cmpxchg8b_emu or atomic*386.

Here are a few examples of the build errors that are eliminated:

   /usr/bin/ld: core.c:(.text+0xd83): undefined reference to `cmpxchg8b_emu'
   /usr/bin/ld: core.c:(.text+0x2bb2): undefined reference to `atomic64_add_386'
   /usr/bin/ld: core.c:(.text+0x2c5d): undefined reference to `atomic64_xchg_386'
   syscall.c:(.text+0x2f49): undefined reference to `atomic64_set_386'
   /usr/bin/ld: syscall.c:(.text+0x2f54): undefined reference to `atomic64_set_386'
   syscall.c:(.text+0x33a4): undefined reference to `atomic64_inc_386'
   /usr/bin/ld: syscall.c:(.text+0x33ac): undefined reference to `atomic64_inc_386'
   /usr/bin/ld: net/ipv4/inet_timewait_sock.o: in function `inet_twsk_alloc':
   inet_timewait_sock.c:(.text+0x3d1): undefined reference to `atomic64_read_386'
   /usr/bin/ld: inet_timewait_sock.c:(.text+0x3dd): undefined reference to `atomic64_set_386'
   /usr/bin/ld: net/ipv4/inet_connection_sock.o: in function `inet_csk_clone_lock':
   inet_connection_sock.c:(.text+0x1d74): undefined reference to `atomic64_read_386'
   /usr/bin/ld: inet_connection_sock.c:(.text+0x1d80): undefined reference to `atomic64_set_386'
   /usr/bin/ld: net/ipv4/tcp_input.o: in function `inet_reqsk_alloc':
   tcp_input.c:(.text+0xa345): undefined reference to `atomic64_set_386'
   /usr/bin/ld: net/mac80211/wpa.o: in function `ieee80211_crypto_tkip_encrypt':
   wpa.c:(.text+0x739): undefined reference to `atomic64_inc_return_386'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: kbuild-all@lists.01.org
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: linux-um@lists.infradead.org
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2021-04-15 23:10:40 +02:00
Johannes Berg dc01a3b9db um: Fix tag order in stub_32.h
"static void inline" is the wrong way around, fix that.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: 9f0b4807a4 ("um: rework userspace stubs to not hard-code stub location")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2021-04-15 23:06:37 +02:00
Linus Torvalds 7e25f40eab ACPI fix for 5.12-rc8.
Restore the initrd-based ACPI table override functionality broken by
 one of the recent fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmB4YJASHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxVxgP+gPMJ4hpNIj+x6m6Np7wgatwx/3fuWnq
 Qgrpt8ydPXgSfj+B7sMvCmgkMt3f/c5tQa5k+j00tg0n6qHfQa8hISvUBofk6myu
 t9J/zNiH75x/HXOBwHXHVoaNiZ6RtPu7AbmKCNfF0wSwt7CsTTtskplJEMtCtU8/
 WPIbze7DlGXbLTtDZswfT+bu2ntc7sTHVPgFLtJpTuf3YpXvU5HUgA4HwATtpAV/
 7cm3AFJuprSMWjFs+UXDbYB+66QYubhMcX1N6Ws2XVeVKQtkXIFIBe10eHJZGqhk
 IUx/ICq0IGYdZr9RZ2r55mFgVYtthq+sV0APKKjmImcu4IxRQycHfNIYa4Uz3Nxn
 qRroWiBnQDSMwQTR9ylfx/BxW0mG0FSZtmC2fY2fFLzu2NyOohfy3uiGaEepXK/U
 7yFeUu94sIQ8pTuE4K5F55TsZZf2uXPcyug854qEBMHvoUBqwdBbmVWyG4f16Z71
 CtVAEYtgBrV24XaNyEVy8xrjGzKXND45sWVrDgk3qiZSEMDC8XyPJaSHoWAcOrQt
 laEZacqusASBUHfXNabbQseuNNzUwZWXhG9Vwmdb3remYWQs0fR6lJrd9LkvUBIM
 dQ7kLm04BTAnwmPpL7vr72zaidtd/N8C1Wo7gO2WBZozc5qUsyOF20R6JtXWEkSU
 oSm3wAsI0Vfc
 =82xL
 -----END PGP SIGNATURE-----

Merge tag 'acpi-5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fix from Rafael Wysocki:
 "Restore the initrd-based ACPI table override functionality broken by
  one of the recent fixes"

* tag 'acpi-5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: x86: Call acpi_boot_table_init() after acpi_table_upgrade()
2021-04-15 10:53:39 -07:00
Alison Schofield 2c88d45edb x86, sched: Treat Intel SNC topology as default, COD as exception
Commit 1340ccfa9a ("x86,sched: Allow topologies where NUMA nodes
share an LLC") added a vendor and model specific check to never
call topology_sane() for Intel Skylake Server systems where NUMA
nodes share an LLC.

Intel Ice Lake and Sapphire Rapids CPUs also enumerate an LLC that is
shared by multiple NUMA nodes. The LLC on these CPUs is shared for
off-package data access but private to the NUMA node for on-package
access. Rather than managing a list of allowable SNC topologies, make
this SNC topology the default, and treat Intel's Cluster-On-Die (COD)
topology as the exception.

In SNC mode, Sky Lake, Ice Lake, and Sapphire Rapids servers do not
emit this warning:

sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210310190233.31752-1-alison.schofield@intel.com
2021-04-15 18:34:20 +02:00
Linus Torvalds 2558258d78 Fix for a possible out-of-bounds access.
-----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmB2GmYUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOOwAf/Qc56PZYWi0iGkoEn57b06Xb8shOC
 85of9i9DN55FTKDkiU3BEz2t4Q89UZJGDEfBN83QPzafem3ihRonBVTU5AYT1yPo
 0Q8cYF9H+/86onZWx7FHlHN2rLBOL9druiXSrbZWe6hVj2sasTHHTAV0DFr3V+UX
 H7dWP9I1V77icZj1M2yDWfg3umE3baiJnylpSduH/1oM9ox5x2en/bAzgtpEKAgl
 vgC9dT4c8zpGXh7hfpOoo8QJo61pjHJC12T2+lieQjmaH9yDh5JNXBcGtm2K2jVQ
 UF6t+aOQVD1Bho18EM6+aYfcnhaLpExpLUX0FA1dJB6fEC+Z8UcLrErb3w==
 =9tir
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fix from Paolo Bonzini:
 "Fix for a possible out-of-bounds access"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: VMX: Don't use vcpu->run->internal.ndata as an array index
2021-04-14 08:50:46 -07:00
Christophe Leroy 808094fcbf lib/vdso: Add vdso_data pointer as input to __arch_get_timens_vdso_data()
For the same reason as commit e876f0b69d ("lib/vdso: Allow
architectures to provide the vdso data pointer"), powerpc wants to
avoid calculation of relative position to code.

As the timens_vdso_data is next page to vdso_data, provide
vdso_data pointer to __arch_get_timens_vdso_data() in order
to ease the calculation on powerpc in following patches.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Acked-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/539c4204b1baa77c55f758904a1ea239abbc7a5c.1617209142.git.christophe.leroy@csgroup.eu
2021-04-14 23:04:44 +10:00
Jan Kiszka 16854b567d x86/pat: Do not compile stubbed functions when X86_PAT is off
Those are already provided by linux/io.h as stubs.

The conflict remains invisible until someone would pull linux/io.h into
memtype.c. This fixes a build error when this file is used outside of
the kernel tree.

  [ bp: Massage commit message. ]

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/a9351615-7a0d-9d47-af65-d9e2fffe8192@siemens.com
2021-04-14 08:21:41 +02:00
Mike Rapoport c361e5d4d0 x86/setup: Move trim_snb_memory() later in setup_arch() to fix boot hangs
Commit

  a799c2bd29 ("x86/setup: Consolidate early memory reservations")

moved reservation of the memory inaccessible by Sandy Bride integrated
graphics very early, and, as a result, on systems with such devices
the first 1M was reserved by trim_snb_memory() which prevented the
allocation of the real mode trampoline and made the boot hang very
early.

Since the purpose of trim_snb_memory() is to prevent problematic pages
ever reaching the graphics device, it is safe to reserve these pages
after memblock allocations are possible.

Move trim_snb_memory() later in boot so that it will be called after
reserve_real_mode() and make comments describing trim_snb_memory()
operation more elaborate.

 [ bp: Massage a bit. ]

Fixes: a799c2bd29 ("x86/setup: Consolidate early memory reservations")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Hugh Dickins <hughd@google.com>
Link: https://lkml.kernel.org/r/f67d3e03-af90-f790-baf4-8d412fe055af@infradead.org
2021-04-14 08:16:48 +02:00
Reiji Watanabe 04c4f2ee3f KVM: VMX: Don't use vcpu->run->internal.ndata as an array index
__vmx_handle_exit() uses vcpu->run->internal.ndata as an index for
an array access.  Since vcpu->run is (can be) mapped to a user address
space with a writer permission, the 'ndata' could be updated by the
user process at anytime (the user process can set it to outside the
bounds of the array).
So, it is not safe that __vmx_handle_exit() uses the 'ndata' that way.

Fixes: 1aa561b1a4 ("kvm: x86: Add "last CPU" to some KVM_EXIT information")
Signed-off-by: Reiji Watanabe <reijiw@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210413154739.490299-1-reijiw@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-13 18:23:41 -04:00
Daniel Vetter 213cc929cb Merge drm/drm-fixes into drm-next
msm-next pull request has a baseline with stuff from -fixes, roll
forward first.

Some simple conflicts in amdgpu, ttm and one in i915 where git gets
confused and tries to add the same function twice.

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
2021-04-13 23:15:09 +02:00
Rafael J. Wysocki 6998a8800d ACPI: x86: Call acpi_boot_table_init() after acpi_table_upgrade()
Commit 1a1c130ab7 ("ACPI: tables: x86: Reserve memory occupied by
ACPI tables") attempted to address an issue with reserving the memory
occupied by ACPI tables, but it broke the initrd-based table override
mechanism relied on by multiple users.

To restore the initrd-based ACPI table override functionality, move
the acpi_boot_table_init() invocation in setup_arch() on x86 after
the acpi_table_upgrade() one.

Fixes: 1a1c130ab7 ("ACPI: tables: x86: Reserve memory occupied by ACPI tables")
Reported-by: Hans de Goede <hdegoede@redhat.com>
Tested-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-04-13 16:03:01 +02:00
Wei Yongjun 523caed9ef x86/sgx: Mark sgx_vepc_vm_ops static
Fix the following sparse warning:

  arch/x86/kernel/cpu/sgx/virt.c:95:35: warning:
    symbol 'sgx_vepc_vm_ops' was not declared. Should it be static?

This symbol is not used outside of virt.c so mark it static.

 [ bp: Massage commit message. ]

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210412160023.193850-1-weiyongjun1@huawei.com
2021-04-12 19:48:32 +02:00
Jan Kiszka f7b21a0e41 x86/asm: Ensure asm/proto.h can be included stand-alone
Fix:

  ../arch/x86/include/asm/proto.h:14:30: warning: ‘struct task_struct’ declared \
    inside parameter list will not be visible outside of this definition or declaration
  long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2);
                               ^~~~~~~~~~~

  .../arch/x86/include/asm/proto.h:40:34: warning: ‘struct task_struct’ declared \
    inside parameter list will not be visible outside of this definition or declaration
   long do_arch_prctl_common(struct task_struct *task, int option,
                                    ^~~~~~~~~~~

if linux/sched.h hasn't be included previously. This fixes a build error
when this header is used outside of the kernel tree.

 [ bp: Massage commit message. ]

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/b76b4be3-cf66-f6b2-9a6c-3e7ef54f9845@web.de
2021-04-12 13:12:46 +02:00
Linus Torvalds 06f838e02d - Fix the vDSO exception handling return path to disable interrupts
again.
 
 - A fix for the CE collector to return the proper return values to its
 callers which are used to convey what the collector has done with the
 error address.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmByunsACgkQEsHwGGHe
 VUo+rRAAmhs1CRMKMcha3KoQM5e3QUk8dA8xYuuHa9UJve6r2HXzSwldAGpYmSKS
 v3Pcdeue0INovp+HWSe1UJa/U6ugQ6KcjGy+xMx01VHAuWjAv/O7wMDRfxMDOnJI
 XmgXJG6IhjZUlRuD7BNkFRkUnsk5dABFTlm3OXcpmOyXBsvRPm2M6n4/ILjIlYI+
 kZCyPf0wmR2VpmwCAkhye1tdWBBmT3I3DNwgq15bhAGf6Eh7fqcieqRmBgwYpHhJ
 bOKx7WeRJa4VayV7uvRId9MAyhi9MY66Mb+CIsK0sxkcza2KizquwapN5zUNKpu2
 i24huaNDljB8n0EV8ZJZpI9Xs9QJUBYL10w3LvaSwEySwnN7QrTWzEn5/gYAS7+J
 wR4og5eDMGzgojZi56adQdnrg3thkGPviikU2lUbXo0mpeoT5I6zaQYdkbBq9r9/
 g6LhM86dOeXqpFDPwSRKCoUgiARDoj+woi+4GF1Hc+bIaffP46K4FnOEUODePS3c
 EXWEpJC2DGZq+QfXBViJKcrQi+0/n9jDD6hY5N4TBsyxuN4iUX60rLiMwNJiphmI
 xMwd7Gcr92K3yiEd7zkav2ncuqBk/OCSadubaDyMQFb0F95evBv09yQKN/RImmZq
 Ywt83UG4x+OXIlbQpAXkgLGMhFkH1GtQJ2DOssT6zrw2PFpjP5w=
 =aV+H
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Fix the vDSO exception handling return path to disable interrupts
   again.

 - A fix for the CE collector to return the proper return values to its
   callers which are used to convey what the collector has done with the
   error address.

* tag 'x86_urgent_for_v5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/traps: Correct exc_general_protection() and math_error() return paths
  RAS/CEC: Correct ce_add_elem()'s returned values
2021-04-11 11:42:18 -07:00
Aditya Srivastava 0d6c8e1e24 x86/platform/intel/quark: Fix incorrect kernel-doc comment syntax in files
The opening comment mark '/**' is used for highlighting the beginning of
kernel-doc comments.
There are certain files in arch/x86/platform/intel-quark, which follow this
syntax, but the content inside does not comply with kernel-doc.
Such lines were probably not meant for kernel-doc parsing, but are parsed
due to the presence of kernel-doc like comment syntax(i.e, '/**'), which
causes unexpected warnings from kernel-doc.

E.g., presence of kernel-doc like comment in the header lines for
arch/x86/platform/intel-quark/imr.c causes these warnings:
"warning: Function parameter or member 'fmt' not described in 'pr_fmt'"
"warning: expecting prototype for c(). Prototype was for pr_fmt() instead"

Similarly for arch/x86/platform/intel-quark/imr_selftest.c too.

Provide a simple fix by replacing these occurrences with general comment
format, i.e. '/*', to prevent kernel-doc from parsing it.

Signed-off-by: Aditya Srivastava <yashsri421@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20210330213022.28769-1-yashsri421@gmail.com
2021-04-10 13:59:25 +02:00
Andrew Cooper 99cb64de36 x86/cpu: Comment Skylake server stepping too
Further to

  53375a5a21 ("x86/cpu: Resort and comment Intel models"),

CascadeLake and CooperLake are steppings of Skylake, and make up the 1st
to 3rd generation "Xeon Scalable Processor" line.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210409121027.16437-1-andrew.cooper3@citrix.com
2021-04-10 11:14:33 +02:00
Jakub Kicinski 8859a44ea0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

MAINTAINERS
 - keep Chandrasekar
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
 - simple fix + trust the code re-added to param.c in -next is fine
include/linux/bpf.h
 - trivial
include/linux/ethtool.h
 - trivial, fix kdoc while at it
include/linux/skmsg.h
 - move to relevant place in tcp.c, comment re-wrapped
net/core/skmsg.c
 - add the sk = sk // sk = NULL around calls
net/tipc/crypto.c
 - trivial

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-09 20:48:35 -07:00
Linus Torvalds adb2c4174f Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "14 patches.

  Subsystems affected by this patch series: mm (kasan, gup, pagecache,
  and kfence), MAINTAINERS, mailmap, nds32, gcov, ocfs2, ia64, and lib"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  lib: fix kconfig dependency on ARCH_WANT_FRAME_POINTERS
  kfence, x86: fix preemptible warning on KPTI-enabled systems
  lib/test_kasan_module.c: suppress unused var warning
  kasan: fix conflict with page poisoning
  fs: direct-io: fix missing sdio->boundary
  ia64: fix user_stack_pointer() for ptrace()
  ocfs2: fix deadlock between setattr and dio_end_io_write
  gcov: re-fix clang-11+ support
  nds32: flush_dcache_page: use page_mapping_file to avoid races with swapoff
  mm/gup: check page posion status for coredump.
  .mailmap: fix old email addresses
  mailmap: update email address for Jordan Crouse
  treewide: change my e-mail address, fix my name
  MAINTAINERS: update CZ.NIC's Turris information
2021-04-09 17:06:32 -07:00
Linus Torvalds 4e04e7513b Networking fixes for 5.12-rc7, including fixes from can, ipsec,
mac80211, wireless, and bpf trees. No scary regressions here
 or in the works, but small fixes for 5.12 changes keep coming.
 
 Current release - regressions:
 
  - virtio: do not pull payload in skb->head
 
  - virtio: ensure mac header is set in virtio_net_hdr_to_skb()
 
  - Revert "net: correct sk_acceptq_is_full()"
 
  - mptcp: revert "mptcp: provide subflow aware release function"
 
  - ethernet: lan743x: fix ethernet frame cutoff issue
 
  - dsa: fix type was not set for devlink port
 
  - ethtool: remove link_mode param and derive link params
             from driver
 
  - sched: htb: fix null pointer dereference on a null new_q
 
  - wireless: iwlwifi: Fix softirq/hardirq disabling in
                       iwl_pcie_enqueue_hcmd()
 
  - wireless: iwlwifi: fw: fix notification wait locking
 
  - wireless: brcmfmac: p2p: Fix deadlock introduced by avoiding
                             the rtnl dependency
 
 Current release - new code bugs:
 
  - napi: fix hangup on napi_disable for threaded napi
 
  - bpf: take module reference for trampoline in module
 
  - wireless: mt76: mt7921: fix airtime reporting and related
                            tx hangs
 
  - wireless: iwlwifi: mvm: rfi: don't lock mvm->mutex when sending
                                 config command
 
 Previous releases - regressions:
 
  - rfkill: revert back to old userspace API by default
 
  - nfc: fix infinite loop, refcount & memory leaks in LLCP sockets
 
  - let skb_orphan_partial wake-up waiters
 
  - xfrm/compat: Cleanup WARN()s that can be user-triggered
 
  - vxlan, geneve: do not modify the shared tunnel info when PMTU
                   triggers an ICMP reply
 
  - can: fix msg_namelen values depending on CAN_REQUIRED_SIZE
 
  - can: uapi: mark union inside struct can_frame packed
 
  - sched: cls: fix action overwrite reference counting
 
  - sched: cls: fix err handler in tcf_action_init()
 
  - ethernet: mlxsw: fix ECN marking in tunnel decapsulation
 
  - ethernet: nfp: Fix a use after free in nfp_bpf_ctrl_msg_rx
 
  - ethernet: i40e: fix receiving of single packets in xsk zero-copy
                    mode
 
  - ethernet: cxgb4: avoid collecting SGE_QBASE regs during traffic
 
 Previous releases - always broken:
 
  - bpf: Refuse non-O_RDWR flags in BPF_OBJ_GET
 
  - bpf: Refcount task stack in bpf_get_task_stack
 
  - bpf, x86: Validate computation of branch displacements
 
  - ieee802154: fix many similar syzbot-found bugs
     - fix NULL dereferences in netlink attribute handling
     - reject unsupported operations on monitor interfaces
     - fix error handling in llsec_key_alloc()
 
  - xfrm: make ipv4 pmtu check honor ip header df
 
  - xfrm: make hash generation lock per network namespace
 
  - xfrm: esp: delete NETIF_F_SCTP_CRC bit from features for esp
               offload
 
  - ethtool: fix incorrect datatype in set_eee ops
 
  - xdp: fix xdp_return_frame() kernel BUG throw for page_pool
         memory model
 
  - openvswitch: fix send of uninitialized stack memory in ct limit
                 reply
 
 Misc:
 
  - udp: add get handling for UDP_GRO sockopt
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmBwyfAACgkQMUZtbf5S
 IruJ/BAAnjghw2kWXRCKK3Tkm0pi0zjaKvTS30AcKCW2+GnqSxTdiWNv+mxqFgnm
 YdduPKiGwLoDkA2i2d4EF8/HK6m+Q6bHcUbZ2npEm1ElkKfxCYGmocor8n2kD+a9
 je94VGYV7zytnxXw85V6/jFLDqOXXwhBfHhlDMVBZP8OyzUfbDKGorWmyGuy9GJp
 81bvzqN2bHUGIM0cDr+ol3eYw2ituGWgiqNfnq7z+/NVcYmD0EPChDRbp0jtH1ng
 dcoONI6YlymDEDpu/9GmyKL1ken9lcWoVdvv/aDGtP62x6SYDt5HKe3wAtJ+Kjbq
 jIPADxPx5BymYIZRBtdNR0rP66LycA7hDtM/C/h1WoihDXwpGeNUU4g0aJ+hsP5Q
 ldwJI1DJo79VbwM2c3Kg73PaphLcPD4RdwF0/ovFsl0+bTDfj8i93ah4Wnzj0Qli
 EMiSDEDNb51e9nkW+xu+FjLWmxHJvLOL/+VgHV5bPJJBob2fqnjAMj2PkPEuEtXY
 TPWEh9y3zaEyp/9tNx0cstGOt6Gf5DQ5Nk6tX6hMpJT/BeL8mju1jm0yPLZhMJjF
 LlTrJgXftfP/cjltdSm4aVqSU5okjHNYDhmHlNgvzih5mt+NVslRJfzwq62Vudqy
 C0kpmVdQNFkOB0UcqQihevZg9mvem3m/dYl+v/MV7Uq6r4s4M2A=
 =SHL0
 -----END PGP SIGNATURE-----

Merge tag 'net-5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Networking fixes for 5.12-rc7, including fixes from can, ipsec,
  mac80211, wireless, and bpf trees.

  No scary regressions here or in the works, but small fixes for 5.12
  changes keep coming.

  Current release - regressions:

   - virtio: do not pull payload in skb->head

   - virtio: ensure mac header is set in virtio_net_hdr_to_skb()

   - Revert "net: correct sk_acceptq_is_full()"

   - mptcp: revert "mptcp: provide subflow aware release function"

   - ethernet: lan743x: fix ethernet frame cutoff issue

   - dsa: fix type was not set for devlink port

   - ethtool: remove link_mode param and derive link params from driver

   - sched: htb: fix null pointer dereference on a null new_q

   - wireless: iwlwifi: Fix softirq/hardirq disabling in
     iwl_pcie_enqueue_hcmd()

   - wireless: iwlwifi: fw: fix notification wait locking

   - wireless: brcmfmac: p2p: Fix deadlock introduced by avoiding the
     rtnl dependency

  Current release - new code bugs:

   - napi: fix hangup on napi_disable for threaded napi

   - bpf: take module reference for trampoline in module

   - wireless: mt76: mt7921: fix airtime reporting and related tx hangs

   - wireless: iwlwifi: mvm: rfi: don't lock mvm->mutex when sending
     config command

  Previous releases - regressions:

   - rfkill: revert back to old userspace API by default

   - nfc: fix infinite loop, refcount & memory leaks in LLCP sockets

   - let skb_orphan_partial wake-up waiters

   - xfrm/compat: Cleanup WARN()s that can be user-triggered

   - vxlan, geneve: do not modify the shared tunnel info when PMTU
     triggers an ICMP reply

   - can: fix msg_namelen values depending on CAN_REQUIRED_SIZE

   - can: uapi: mark union inside struct can_frame packed

   - sched: cls: fix action overwrite reference counting

   - sched: cls: fix err handler in tcf_action_init()

   - ethernet: mlxsw: fix ECN marking in tunnel decapsulation

   - ethernet: nfp: Fix a use after free in nfp_bpf_ctrl_msg_rx

   - ethernet: i40e: fix receiving of single packets in xsk zero-copy
     mode

   - ethernet: cxgb4: avoid collecting SGE_QBASE regs during traffic

  Previous releases - always broken:

   - bpf: Refuse non-O_RDWR flags in BPF_OBJ_GET

   - bpf: Refcount task stack in bpf_get_task_stack

   - bpf, x86: Validate computation of branch displacements

   - ieee802154: fix many similar syzbot-found bugs
       - fix NULL dereferences in netlink attribute handling
       - reject unsupported operations on monitor interfaces
       - fix error handling in llsec_key_alloc()

   - xfrm: make ipv4 pmtu check honor ip header df

   - xfrm: make hash generation lock per network namespace

   - xfrm: esp: delete NETIF_F_SCTP_CRC bit from features for esp
     offload

   - ethtool: fix incorrect datatype in set_eee ops

   - xdp: fix xdp_return_frame() kernel BUG throw for page_pool memory
     model

   - openvswitch: fix send of uninitialized stack memory in ct limit
     reply

  Misc:

   - udp: add get handling for UDP_GRO sockopt"

* tag 'net-5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (182 commits)
  net: fix hangup on napi_disable for threaded napi
  net: hns3: Trivial spell fix in hns3 driver
  lan743x: fix ethernet frame cutoff issue
  net: ipv6: check for validity before dereferencing cfg->fc_nlinfo.nlh
  net: dsa: lantiq_gswip: Configure all remaining GSWIP_MII_CFG bits
  net: dsa: lantiq_gswip: Don't use PHY auto polling
  net: sched: sch_teql: fix null-pointer dereference
  ipv6: report errors for iftoken via netlink extack
  net: sched: fix err handler in tcf_action_init()
  net: sched: fix action overwrite reference counting
  Revert "net: sched: bump refcount for new action in ACT replace mode"
  ice: fix memory leak of aRFS after resuming from suspend
  i40e: Fix sparse warning: missing error code 'err'
  i40e: Fix sparse error: 'vsi->netdev' could be null
  i40e: Fix sparse error: uninitialized symbol 'ring'
  i40e: Fix sparse errors in i40e_txrx.c
  i40e: Fix parameters in aq_get_phy_register()
  nl80211: fix beacon head validation
  bpf, x86: Validate computation of branch displacements for x86-32
  bpf, x86: Validate computation of branch displacements for x86-64
  ...
2021-04-09 15:26:51 -07:00
Marco Elver 6a77d38efc kfence, x86: fix preemptible warning on KPTI-enabled systems
On systems with KPTI enabled, we can currently observe the following
warning:

  BUG: using smp_processor_id() in preemptible
  caller is invalidate_user_asid+0x13/0x50
  CPU: 6 PID: 1075 Comm: dmesg Not tainted 5.12.0-rc4-gda4a2b1a5479-kfence_1+ #1
  Hardware name: Hewlett-Packard HP Pro 3500 Series/2ABF, BIOS 8.11 10/24/2012
  Call Trace:
   dump_stack+0x7f/0xad
   check_preemption_disabled+0xc8/0xd0
   invalidate_user_asid+0x13/0x50
   flush_tlb_one_kernel+0x5/0x20
   kfence_protect+0x56/0x80
   ...

While it normally makes sense to require preemption to be off, so that
the expected CPU's TLB is flushed and not another, in our case it really
is best-effort (see comments in kfence_protect_page()).

Avoid the warning by disabling preemption around flush_tlb_one_kernel().

Link: https://lore.kernel.org/lkml/YGIDBAboELGgMgXy@elver.google.com/
Link: https://lkml.kernel.org/r/20210330065737.652669-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Tomi Sarvela <tomi.p.sarvela@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-09 14:54:23 -07:00
Linus Torvalds ccd6c35c72 ACPI fix for 5.12-rc7
Fix a build issue introduced by a previous fix in the ACPI processor
 driver (Vitaly Kuznetsov).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmBwbxsSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxSQ4P/iLqRq8+u/6vNcc7nQXNChGJvybNXrgy
 +3lqc6c0DYUsj4KF0CpZnNMKF6V9ig/nCTYFyWrDxdhaKeQuJMxVncfD3iA9ZZ4q
 BNwvteNDFaUgyGJyrxvtNKFiWxFv454hM6mn1PU6bE5XBpX1++wRRBIKPY75lWOI
 lWgehcwW0lHQUQabvDaC0YYFK3ZxTxz/xiau26ZBtt2QYctC4VkAy3r+RaYnn3ug
 6+85rO5TW9Ul/AT3Csx2Xv5CEs15htzcJe0qoMBmCQHctTpObtzcw4+OihY6gBL5
 AaeJA0fgOS23G2ZjMbxZre8E9J1HsftWKaj4wBcMqYOwzT20FenrMa18beTRZM9F
 n7QHtriR1uaTohA+qMkXn2rOdhYp35jgC8nLfJzmMJWpXxRj4ejtZc+aMS3kuYk+
 YT18SDj8KmIxAIgvlqETkOaKtvjPYYnoMh1DWFOf4uWRPsDkGAdo+oWnh8uteWCa
 Nfc5COaajdoIswS+U1ExHQ7HbbIg+vAtx0/RD6M3JGp/mlvojo33GuuDrShmFXmO
 ZR0DH7GlApvgXAMR+NR60DG8D6xX6Rk7hi6P7hn8b5LOPraspohZehLdFHMlht4n
 js0CAWBJhX6SieyT+gQqY+rIEq4johm23EryA3AWVIc0USfiNTdYryl/GsiZ2W9y
 BkQX07tSorHX
 =Jzzk
 -----END PGP SIGNATURE-----

Merge tag 'acpi-5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fix from Rafael Wysocki:
 "Fix a build issue introduced by a previous fix in the ACPI processor
  driver (Vitaly Kuznetsov)"

* tag 'acpi-5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: processor: Fix build when CONFIG_ACPI_PROCESSOR=m
2021-04-09 09:25:31 -07:00
Thomas Tai 632a1c209b x86/traps: Correct exc_general_protection() and math_error() return paths
Commit

  334872a091 ("x86/traps: Attempt to fixup exceptions in vDSO before signaling")

added return statements which bypass calling cond_local_irq_disable().

According to

  ca4c6a9858 ("x86/traps: Make interrupt enable/disable symmetric in C code"),

cond_local_irq_disable() is needed because the asm return code no longer
disables interrupts. Follow the existing code as an example to use "goto
exit" instead of "return" statement.

 [ bp: Massage commit message. ]

Fixes: 334872a091 ("x86/traps: Attempt to fixup exceptions in vDSO before signaling")
Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/1617902914-83245-1-git-send-email-thomas.tai@oracle.com
2021-04-09 13:45:09 +02:00
Linus Torvalds d381b05e86 A lone x86 patch, for a bug found while developing a backport to
stable versions.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmBu7g0UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOb6wf/aKgdBEGlWA1qVci/Z19uAlgr30vN
 IXsDGG7XJWtcjCK18T23o1WOmGhyMzSAic3HjmyZtVKJ/OMXDLOE7yrcOgDMtx7l
 M5kPUiPjbbMFQB2oG/hzafq4FDfqyL8oOJf2+SvElkUNx43nrJ/FuaXKoq3ae5y8
 sQ+JGKnM/FYnP0++buItQ+QN1mcUXq7RmfYguUhjSUzkx1KjVZJuPpdV6VB8pTpD
 FBtOvBomlCSov1wNpsFMFp31VRsu5wGVU0/9CaKpKAvM7ZlEVnLygzIWZHyE6vfl
 VX4snrd1onlgRacthkovLYaArisGLgWPQjHjRa6YE0qm6rUoZM9VWQoprg==
 =Szst
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fix from Paolo Bonzini:
 "A lone x86 patch, for a bug found while developing a backport to
  stable versions"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86/mmu: preserve pending TLB flush across calls to kvm_tdp_mmu_zap_sp
2021-04-08 08:54:26 -07:00
Jarkko Sakkinen ae40aaf6bd x86/sgx: Do not update sgx_nr_free_pages in sgx_setup_epc_section()
The commit in Fixes: changed the SGX EPC page sanitization to end up in
sgx_free_epc_page() which puts clean and sanitized pages on the free
list.

This was done for the reason that it is best to keep the logic to assign
available-for-use EPC pages to the correct NUMA lists in a single
location.

sgx_nr_free_pages is also incremented by sgx_free_epc_pages() but those
pages which are being added there per EPC section do not belong to the
free list yet because they haven't been sanitized yet - they land on the
dirty list first and the sanitization happens later when ksgxd starts
massaging them.

So remove that addition there and have sgx_free_epc_page() do that
solely.

 [ bp: Sanitize commit message too. ]

Fixes: 51ab30eb2a ("x86/sgx: Replace section->init_laundry_list with sgx_dirty_page_list")
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210408092924.7032-1-jarkko@kernel.org
2021-04-08 17:24:42 +02:00
Piotr Krysiuk 26f55a59dc bpf, x86: Validate computation of branch displacements for x86-32
The branch displacement logic in the BPF JIT compilers for x86 assumes
that, for any generated branch instruction, the distance cannot
increase between optimization passes.

But this assumption can be violated due to how the distances are
computed. Specifically, whenever a backward branch is processed in
do_jit(), the distance is computed by subtracting the positions in the
machine code from different optimization passes. This is because part
of addrs[] is already updated for the current optimization pass, before
the branch instruction is visited.

And so the optimizer can expand blocks of machine code in some cases.

This can confuse the optimizer logic, where it assumes that a fixed
point has been reached for all machine code blocks once the total
program size stops changing. And then the JIT compiler can output
abnormal machine code containing incorrect branch displacements.

To mitigate this issue, we assert that a fixed point is reached while
populating the output image. This rejects any problematic programs.
The issue affects both x86-32 and x86-64. We mitigate separately to
ease backporting.

Signed-off-by: Piotr Krysiuk <piotras@gmail.com>
Reviewed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2021-04-08 16:24:53 +02:00
Piotr Krysiuk e4d4d45643 bpf, x86: Validate computation of branch displacements for x86-64
The branch displacement logic in the BPF JIT compilers for x86 assumes
that, for any generated branch instruction, the distance cannot
increase between optimization passes.

But this assumption can be violated due to how the distances are
computed. Specifically, whenever a backward branch is processed in
do_jit(), the distance is computed by subtracting the positions in the
machine code from different optimization passes. This is because part
of addrs[] is already updated for the current optimization pass, before
the branch instruction is visited.

And so the optimizer can expand blocks of machine code in some cases.

This can confuse the optimizer logic, where it assumes that a fixed
point has been reached for all machine code blocks once the total
program size stops changing. And then the JIT compiler can output
abnormal machine code containing incorrect branch displacements.

To mitigate this issue, we assert that a fixed point is reached while
populating the output image. This rejects any problematic programs.
The issue affects both x86-32 and x86-64. We mitigate separately to
ease backporting.

Signed-off-by: Piotr Krysiuk <piotras@gmail.com>
Reviewed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2021-04-08 16:24:36 +02:00
Peter Zijlstra 53375a5a21 x86/cpu: Resort and comment Intel models
The INTEL_FAM6 list has become a mess again. Try and bring some sanity
back into it.

Where previously we had one microarch per year and a number of SKUs
within that, this no longer seems to be the case. We now get different
uarch names that share a 'core' design.

Add the core name starting at skylake and reorder to keep the cores
in chronological order. Furthermore, Intel marketed the names {Amber,
Coffee, Whiskey} Lake, but those are in fact steppings of Kaby Lake, add
comments for them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/YE+HhS8i0gshHD3W@hirez.programming.kicks-ass.net
2021-04-08 14:22:10 +02:00
Kees Cook fe950f6020 x86/entry: Enable random_kstack_offset support
Allow for a randomized stack offset on a per-syscall basis, with roughly
5-6 bits of entropy, depending on compiler and word size. Since the
method of offsetting uses macros, this cannot live in the common entry
code (the stack offset needs to be retained for the life of the syscall,
which means it needs to happen at the actual entry point).

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210401232347.2791257-5-keescook@chromium.org
2021-04-08 14:05:20 +02:00
Paolo Bonzini 315f02c60d KVM: x86/mmu: preserve pending TLB flush across calls to kvm_tdp_mmu_zap_sp
Right now, if a call to kvm_tdp_mmu_zap_sp returns false, the caller
will skip the TLB flush, which is wrong.  There are two ways to fix
it:

- since kvm_tdp_mmu_zap_sp will not yield and therefore will not flush
  the TLB itself, we could change the call to kvm_tdp_mmu_zap_sp to
  use "flush |= ..."

- or we can chain the flush argument through kvm_tdp_mmu_zap_sp down
  to __kvm_tdp_mmu_zap_gfn_range.  Note that kvm_tdp_mmu_zap_sp will
  neither yield nor flush, so flush would never go from true to
  false.

This patch does the former to simplify application to stable kernels,
and to make it further clearer that kvm_tdp_mmu_zap_sp will not flush.

Cc: seanjc@google.com
Fixes: 048f49809c ("KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping")
Cc: <stable@vger.kernel.org> # 5.10.x: 048f49809c: KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping
Cc: <stable@vger.kernel.org> # 5.10.x: 33a3164161: KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages
Cc: <stable@vger.kernel.org>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-08 07:48:18 -04:00
Zhao Xuehui 3e7bbe15ed x86/msr: Make locally used functions static
The functions msr_read() and msr_write() are not used outside of msr.c,
make them static.

 [ bp: Massage commit message. ]

Signed-off-by: Zhao Xuehui <zhaoxuehui1@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210408095218.152264-1-zhaoxuehui1@huawei.com
2021-04-08 11:57:40 +02:00
Yang Li dda451f391 x86/cacheinfo: Remove unneeded dead-store initialization
$ make CC=clang clang-analyzer

(needs clang-tidy installed on the system too)

on x86_64 defconfig triggers:

  arch/x86/kernel/cpu/cacheinfo.c:880:24: warning: Value stored to 'this_cpu_ci' \
	  during its initialization is never read [clang-analyzer-deadcode.DeadStores]
        struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu);
                              ^
  arch/x86/kernel/cpu/cacheinfo.c:880:24: note: Value stored to 'this_cpu_ci' \
	during its initialization is never read

So simply remove this unneeded dead-store initialization.

As compilers will detect this unneeded assignment and optimize this
anyway the resulting object code is identical before and after this
change.

No functional change. No change to object code.

 [ bp: Massage commit message. ]

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lkml.kernel.org/r/1617177624-24670-1-git-send-email-yang.lee@linux.alibaba.com
2021-04-07 21:12:12 +02:00
Vitaly Kuznetsov fa26d0c778 ACPI: processor: Fix build when CONFIG_ACPI_PROCESSOR=m
Commit 8cdddd182b ("ACPI: processor: Fix CPU0 wakeup in
acpi_idle_play_dead()") tried to fix CPU0 hotplug breakage by copying
wakeup_cpu0() + start_cpu0() logic from hlt_play_dead()//mwait_play_dead()
into acpi_idle_play_dead(). The problem is that these functions are not
exported to modules so when CONFIG_ACPI_PROCESSOR=m build fails.

The issue could've been fixed by exporting both wakeup_cpu0()/start_cpu0()
(the later from assembly) but it seems putting the whole pattern into a
new function and exporting it instead is better.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: 8cdddd182b ("CPI: processor: Fix CPU0 wakeup in acpi_idle_play_dead()")
Cc: <stable@vger.kernel.org> # 5.10+
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-04-07 19:02:43 +02:00
Christoph Hellwig fc1b662050 iommu/amd: Move a few prototypes to include/linux/amd-iommu.h
A few functions that were intentended for the perf events support are
currently declared in arch/x86/events/amd/iommu.h, which mens they are
not in scope for the actual function definition.  Also amdkfd has started
using a few of them using externs in a .c file.  End that misery by
moving the prototypes to the proper header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210402143312.372386-5-hch@lst.de
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2021-04-07 11:14:55 +02:00
Sean Christopherson b3754e5d3d x86/sgx: Move provisioning device creation out of SGX driver
And extract sgx_set_attribute() out of sgx_ioc_enclave_provision() and
export it as symbol for KVM to use.

The provisioning key is sensitive. The SGX driver only allows to create
an enclave which can access the provisioning key when the enclave
creator has permission to open /dev/sgx_provision. It should apply to
a VM as well, as the provisioning key is platform-specific, thus an
unrestricted VM can also potentially compromise the provisioning key.

Move the provisioning device creation out of sgx_drv_init() to
sgx_init() as a preparation for adding SGX virtualization support,
so that even if the SGX driver is not enabled due to flexible launch
control not being available, SGX virtualization can still be enabled,
and use it to restrict a VM's capability of being able to access the
provisioning key.

 [ bp: Massage commit message. ]

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/0f4d044d621561f26d5f4ef73e8dc6cd18cc7e79.1616136308.git.kai.huang@intel.com
2021-04-06 19:18:46 +02:00
Sean Christopherson d155030b1e x86/sgx: Add helpers to expose ECREATE and EINIT to KVM
The host kernel must intercept ECREATE to impose policies on guests, and
intercept EINIT to be able to write guest's virtual SGX_LEPUBKEYHASH MSR
values to hardware before running guest's EINIT so it can run correctly
according to hardware behavior.

Provide wrappers around __ecreate() and __einit() to hide the ugliness
of overloading the ENCLS return value to encode multiple error formats
in a single int.  KVM will trap-and-execute ECREATE and EINIT as part
of SGX virtualization, and reflect ENCLS execution result to guest by
setting up guest's GPRs, or on an exception, injecting the correct fault
based on return value of __ecreate() and __einit().

Use host userspace addresses (provided by KVM based on guest physical
address of ENCLS parameters) to execute ENCLS/EINIT when possible.
Accesses to both EPC and memory originating from ENCLS are subject to
segmentation and paging mechanisms.  It's also possible to generate
kernel mappings for ENCLS parameters by resolving PFN but using
__uaccess_xx() is simpler.

 [ bp: Return early if the __user memory accesses fail, use
   cpu_feature_enabled(). ]

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/20e09daf559aa5e9e680a0b4b5fba940f1bad86e.1616136308.git.kai.huang@intel.com
2021-04-06 19:18:27 +02:00
Kai Huang 73916b6a0c x86/sgx: Add helper to update SGX_LEPUBKEYHASHn MSRs
Add a helper to update SGX_LEPUBKEYHASHn MSRs.  SGX virtualization also
needs to update those MSRs based on guest's "virtual" SGX_LEPUBKEYHASHn
before EINIT from guest.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/dfb7cd39d4dd62ea27703b64afdd8bccb579f623.1616136308.git.kai.huang@intel.com
2021-04-06 09:43:42 +02:00
Sean Christopherson a67136b458 x86/sgx: Add encls_faulted() helper
Add a helper to extract the fault indicator from an encoded ENCLS return
value.  SGX virtualization will also need to detect ENCLS faults.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/c1f955898110de2f669da536fc6cf62e003dff88.1616136308.git.kai.huang@intel.com
2021-04-06 09:43:42 +02:00
Sean Christopherson 32ddda8e44 x86/sgx: Add SGX2 ENCLS leaf definitions (EAUG, EMODPR and EMODT)
Define the ENCLS leafs that are available with SGX2, also referred to as
Enclave Dynamic Memory Management (EDMM).  The leafs will be used by KVM
to conditionally expose SGX2 capabilities to guests.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/5f0970c251ebcc6d5add132f0d750cc753b7060f.1616136308.git.kai.huang@intel.com
2021-04-06 09:43:42 +02:00
Sean Christopherson 9c55c78a73 x86/sgx: Move ENCLS leaf definitions to sgx.h
Move the ENCLS leaf definitions to sgx.h so that they can be used by
KVM.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/2e6cd7c5c1ced620cfcd292c3c6c382827fde6b2.1616136308.git.kai.huang@intel.com
2021-04-06 09:43:41 +02:00
Sean Christopherson 8ca52cc38d x86/sgx: Expose SGX architectural definitions to the kernel
Expose SGX architectural structures, as KVM will use many of the
architectural constants and structs to virtualize SGX.

Name the new header file as asm/sgx.h, rather than asm/sgx_arch.h, to
have single header to provide SGX facilities to share with other kernel
componments. Also update MAINTAINERS to include asm/sgx.h.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/6bf47acd91ab4d709e66ad1692c7803e4c9063a0.1616136308.git.kai.huang@intel.com
2021-04-06 09:43:41 +02:00
Kai Huang faa7d3e6f3 x86/sgx: Initialize virtual EPC driver even when SGX driver is disabled
Modify sgx_init() to always try to initialize the virtual EPC driver,
even if the SGX driver is disabled.  The SGX driver might be disabled
if SGX Launch Control is in locked mode, or not supported in the
hardware at all.  This allows (non-Linux) guests that support non-LC
configurations to use SGX.

 [ bp: De-silli-fy the test. ]

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/d35d17a02bbf8feef83a536cec8b43746d4ea557.1616136308.git.kai.huang@intel.com
2021-04-06 09:43:41 +02:00
Sean Christopherson 332bfc7bec x86/cpu/intel: Allow SGX virtualization without Launch Control support
The kernel will currently disable all SGX support if the hardware does
not support launch control.  Make it more permissive to allow SGX
virtualization on systems without Launch Control support.  This will
allow KVM to expose SGX to guests that have less-strict requirements on
the availability of flexible launch control.

Improve error message to distinguish between three cases.  There are two
cases where SGX support is completely disabled:
1) SGX has been disabled completely by the BIOS
2) SGX LC is locked by the BIOS.  Bare-metal support is disabled because
   of LC unavailability.  SGX virtualization is unavailable (because of
   Kconfig).
One where it is partially available:
3) SGX LC is locked by the BIOS.  Bare-metal support is disabled because
   of LC unavailability.  SGX virtualization is supported.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/b3329777076509b3b601550da288c8f3c406a865.1616136308.git.kai.huang@intel.com
2021-04-06 09:43:41 +02:00
Sean Christopherson 540745ddbc x86/sgx: Introduce virtual EPC for use by KVM guests
Add a misc device /dev/sgx_vepc to allow userspace to allocate "raw"
Enclave Page Cache (EPC) without an associated enclave. The intended
and only known use case for raw EPC allocation is to expose EPC to a
KVM guest, hence the 'vepc' moniker, virt.{c,h} files and X86_SGX_KVM
Kconfig.

The SGX driver uses the misc device /dev/sgx_enclave to support
userspace in creating an enclave. Each file descriptor returned from
opening /dev/sgx_enclave represents an enclave. Unlike the SGX driver,
KVM doesn't control how the guest uses the EPC, therefore EPC allocated
to a KVM guest is not associated with an enclave, and /dev/sgx_enclave
is not suitable for allocating EPC for a KVM guest.

Having separate device nodes for the SGX driver and KVM virtual EPC also
allows separate permission control for running host SGX enclaves and KVM
SGX guests.

To use /dev/sgx_vepc to allocate a virtual EPC instance with particular
size, the hypervisor opens /dev/sgx_vepc, and uses mmap() with the
intended size to get an address range of virtual EPC. Then it may use
the address range to create one KVM memory slot as virtual EPC for
a guest.

Implement the "raw" EPC allocation in the x86 core-SGX subsystem via
/dev/sgx_vepc rather than in KVM. Doing so has two major advantages:

  - Does not require changes to KVM's uAPI, e.g. EPC gets handled as
    just another memory backend for guests.

  - EPC management is wholly contained in the SGX subsystem, e.g. SGX
    does not have to export any symbols, changes to reclaim flows don't
    need to be routed through KVM, SGX's dirty laundry doesn't have to
    get aired out for the world to see, and so on and so forth.

The virtual EPC pages allocated to guests are currently not reclaimable.
Reclaiming an EPC page used by enclave requires a special reclaim
mechanism separate from normal page reclaim, and that mechanism is not
supported for virutal EPC pages. Due to the complications of handling
reclaim conflicts between guest and host, reclaiming virtual EPC pages
is significantly more complex than basic support for SGX virtualization.

 [ bp:
   - Massage commit message and comments
   - use cpu_feature_enabled()
   - vertically align struct members init
   - massage Virtual EPC clarification text
   - move Kconfig prompt to Virtualization ]

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/0c38ced8c8e5a69872db4d6a1c0dabd01e07cad7.1616136308.git.kai.huang@intel.com
2021-04-06 09:43:17 +02:00
Vipin Sharma 7aef27f0b2 svm/sev: Register SEV and SEV-ES ASIDs to the misc controller
Secure Encrypted Virtualization (SEV) and Secure Encrypted
Virtualization - Encrypted State (SEV-ES) ASIDs are used to encrypt KVMs
on AMD platform. These ASIDs are available in the limited quantities on
a host.

Register their capacity and usage to the misc controller for tracking
via cgroups.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: David Rientjes <rientjes@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-04-04 13:34:46 -04:00
Linus Torvalds 0a84c2e440 ACPI fixes for 5.12-rc6
- Ensure that the memory occupied by ACPI tables on x86 will always
    be reserved to prevent it from being allocated for other purposes
    which was possible in some cases (Rafael Wysocki).
 
  - Fix the ACPI device enumeration code to prevent it from attempting
    to evaluate the _STA control method for devices with unmet
    dependencies which is likely to fail (Hans de Goede).
 
  - Fix the handling of CPU0 wakeup in the ACPI processor driver to
    prevent CPU0 online failures from occurring (Vitaly Kuznetsov).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmBnNboSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxM8EP/ijQgQURrTha3167d7o1e5tABBP57qaa
 9w8biWSfzDhOY/8KvTfDGV38Hd8jmEoN1s1t6HitXIrzVFnLoI8x/1YrFCRvq9za
 rPpnneROfOSNP3KdrYa4T6IF1O/Zp5hRTpp72n3+iBVukSSbN+p8+u7Q26OW2Vgx
 OWF480ZZVgrKr1p1zjK5GzxVJV6UhM5L6rH5ZoCYGRbSaQOUgewd75/2IVhUOTKC
 Sb4ua1MNa1TXR1YFKr5GYuhrg6B4J78WIXwXgX0HxDOy6fSt7wSUK4u6vLbG8UnU
 uyyNlzhm5LYWOlJlJxfJpfzlNfukeKmONaYROmqTR3D090Zb382jkPYjJIw+VPsx
 EG5CPvqGYDW75x2kDe9p61YfXDgxWu2Qstx0Pek1oPubUXT5/WmuN10CcHm0TF3O
 j3fLwGUGByWRWOChmDVopXHyIcr1lbNm+wTYBts2AcygYfzo85ZuWtQXMUcsO9B5
 ORvz/ejFxOm62HrtN2cn5aIJg2he1dL8DgAUO7nPJsgs0k9d3BgXODNt61d+EnqZ
 4Fxs32s/6wVZQozpfEae+X3sdRpp5bSHOBOnOLTT8NGbBvrtcbrjQ6PaN3mQlbmw
 t6bnaYvO8kPwD/HvAAhmJb01alTtcGCccxReCeZLIVGFS7Cm69Zm9jTLfpaGlffF
 pGJoSYTSMxYP
 =8KTH
 -----END PGP SIGNATURE-----

Merge tag 'acpi-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:
 "These fix an ACPI tables management issue, an issue related to the
  ACPI enumeration of devices and CPU wakeup in the ACPI processor
  driver.

  Specifics:

   - Ensure that the memory occupied by ACPI tables on x86 will always
     be reserved to prevent it from being allocated for other purposes
     which was possible in some cases (Rafael Wysocki).

   - Fix the ACPI device enumeration code to prevent it from attempting
     to evaluate the _STA control method for devices with unmet
     dependencies which is likely to fail (Hans de Goede).

   - Fix the handling of CPU0 wakeup in the ACPI processor driver to
     prevent CPU0 online failures from occurring (Vitaly Kuznetsov)"

* tag 'acpi-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: processor: Fix CPU0 wakeup in acpi_idle_play_dead()
  ACPI: scan: Fix _STA getting called on devices with unmet dependencies
  ACPI: tables: x86: Reserve memory occupied by ACPI tables
2021-04-02 15:34:17 -07:00
Zheng Yongjun 90b9bfa470 x86/hyperv: remove unused linux/version.h header
That header is not needed in hv_proc.c.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yongjun Zheng <zhengyongjun3@huawei.com>
Link: https://lore.kernel.org/r/20210326064942.3263776-1-zhengyongjun3@huawei.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-04-02 22:07:26 +00:00
Rafael J. Wysocki 91463ebff3 Merge branches 'acpi-tables' and 'acpi-scan'
* acpi-tables:
  ACPI: tables: x86: Reserve memory occupied by ACPI tables

* acpi-scan:
  ACPI: scan: Fix _STA getting called on devices with unmet dependencies
2021-04-02 16:57:56 +02:00
Paolo Bonzini 657f1d86a3 Merge branch 'kvm-tdp-fix-rcu' into HEAD 2021-04-02 07:25:32 -04:00
Paolo Bonzini 57e45ea487 Merge branch 'kvm-tdp-fix-flushes' into HEAD 2021-04-02 07:24:54 -04:00
Peter Zijlstra 9bc0bb5072 objtool/x86: Rewrite retpoline thunk calls
When the compiler emits: "CALL __x86_indirect_thunk_\reg" for an
indirect call, have objtool rewrite it to:

	ALTERNATIVE "call __x86_indirect_thunk_\reg",
		    "call *%reg", ALT_NOT(X86_FEATURE_RETPOLINE)

Additionally, in order to not emit endless identical
.altinst_replacement chunks, use a global symbol for them, see
__x86_indirect_alt_*.

This also avoids objtool from having to do code generation.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Link: https://lkml.kernel.org/r/20210326151300.320177914@infradead.org
2021-04-02 12:47:28 +02:00
Peter Zijlstra 119251855f x86/retpoline: Simplify retpolines
Due to:

  c9c324dc22 ("objtool: Support stack layout changes in alternatives")

it is now possible to simplify the retpolines.

Currently our retpolines consist of 2 symbols:

 - __x86_indirect_thunk_\reg: the compiler target
 - __x86_retpoline_\reg:  the actual retpoline.

Both are consecutive in code and aligned such that for any one register
they both live in the same cacheline:

  0000000000000000 <__x86_indirect_thunk_rax>:
   0:   ff e0                   jmpq   *%rax
   2:   90                      nop
   3:   90                      nop
   4:   90                      nop

  0000000000000005 <__x86_retpoline_rax>:
   5:   e8 07 00 00 00          callq  11 <__x86_retpoline_rax+0xc>
   a:   f3 90                   pause
   c:   0f ae e8                lfence
   f:   eb f9                   jmp    a <__x86_retpoline_rax+0x5>
  11:   48 89 04 24             mov    %rax,(%rsp)
  15:   c3                      retq
  16:   66 2e 0f 1f 84 00 00 00 00 00   nopw   %cs:0x0(%rax,%rax,1)

The thunk is an alternative_2, where one option is a JMP to the
retpoline. This was done so that objtool didn't need to deal with
alternatives with stack ops. But that problem has been solved, so now
it is possible to fold the entire retpoline into the alternative to
simplify and consolidate unused bytes:

  0000000000000000 <__x86_indirect_thunk_rax>:
   0:   ff e0                   jmpq   *%rax
   2:   90                      nop
   3:   90                      nop
   4:   90                      nop
   5:   90                      nop
   6:   90                      nop
   7:   90                      nop
   8:   90                      nop
   9:   90                      nop
   a:   90                      nop
   b:   90                      nop
   c:   90                      nop
   d:   90                      nop
   e:   90                      nop
   f:   90                      nop
  10:   90                      nop
  11:   66 66 2e 0f 1f 84 00 00 00 00 00        data16 nopw %cs:0x0(%rax,%rax,1)
  1c:   0f 1f 40 00             nopl   0x0(%rax)

Notice that since the longest alternative sequence is now:

   0:   e8 07 00 00 00          callq  c <.altinstr_replacement+0xc>
   5:   f3 90                   pause
   7:   0f ae e8                lfence
   a:   eb f9                   jmp    5 <.altinstr_replacement+0x5>
   c:   48 89 04 24             mov    %rax,(%rsp)
  10:   c3                      retq

17 bytes, we have 15 bytes NOP at the end of our 32 byte slot. (IOW, if
we can shrink the retpoline by 1 byte we can pack it more densely).

 [ bp: Massage commit message. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210326151259.506071949@infradead.org
2021-04-02 12:42:04 +02:00
Peter Zijlstra 23c1ad538f x86/alternatives: Optimize optimize_nops()
Currently, optimize_nops() scans to see if the alternative starts with
NOPs. However, the emit pattern is:

  141:	\oldinstr
  142:	.skip (len-(142b-141b)), 0x90

That is, when 'oldinstr' is short, the tail is padded with NOPs. This case
never gets optimized.

Rewrite optimize_nops() to replace any trailing string of NOPs inside
the alternative to larger NOPs. Also run it irrespective of patching,
replacing NOPs in both the original and replaced code.

A direct consequence is that 'padlen' becomes superfluous, so remove it.

 [ bp:
   - Adjust commit message
   - remove a stale comment about needing to pad
   - add a comment in optimize_nops()
   - exit early if the NOP verif. loop catches a mismatch - function
     should not not add NOPs in that case
   - fix the "optimized NOPs" offsets output ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210326151259.442992235@infradead.org
2021-04-02 12:41:17 +02:00
Ingo Molnar b1f480bc06 Merge branch 'x86/cpu' into WIP.x86/core, to merge the NOP changes & resolve a semantic conflict
Conflict-merge this main commit in essence:

  a89dfde3dc3c: ("x86: Remove dynamic NOP selection")

With this upstream commit:

  b90829704780: ("bpf: Use NOP_ATOMIC5 instead of emit_nops(&prog, 5) for BPF_TRAMP_F_CALL_ORIG")

Semantic merge conflict:

  arch/x86/net/bpf_jit_comp.c

  - memcpy(prog, ideal_nops[NOP_ATOMIC5], X86_PATCH_SIZE);
  + memcpy(prog, x86_nops[5], X86_PATCH_SIZE);

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-04-02 12:36:30 +02:00
Ingo Molnar e855e80d00 Linux 5.12-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFRBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmBhB7AeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGCPUH+KKkSoOlN2YNu1oc
 iy2nznwZoSQTk5ZLz7PypO/WWmmtgzudkObG7yqIURdrncsAkHR17Wu2P7rdBr1j
 Ma+VhF9MQ+xx+r86upH7c3gYfhyfdUMvzuLy0rwLQ1Yrzrb7xFcVkj3BHk54TAQA
 w05sRPuVJ3/c/HPYV2iXkkdnnMbXSTCebeDDwjFb9D3qagr4vcd/PjDHmGbfNF8R
 o6gLpbK5Ly6ww1nth9gGGUjzrW95yVItvcroP6vQWljxhuy+NE1lXRm8LsGhxqtW
 foFFptJup5nhSNJXWtQt/U3huVD6mZ3W3y9cOThPjXZRy2wva3I1IpBKoEFReUpG
 /Tq8EA==
 =tPUY
 -----END PGP SIGNATURE-----

Merge tag 'v5.12-rc5' into WIP.x86/core, to pick up recent NOP related changes

In particular we want to have this upstream commit:

  b90829704780: ("bpf: Use NOP_ATOMIC5 instead of emit_nops(&prog, 5) for BPF_TRAMP_F_CALL_ORIG")

... before merging in x86/cpu changes and the removal of the NOP optimizations, and
applying PeterZ's !retpoline objtool series.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-04-02 12:33:16 +02:00
Alexander Antonov cface0326a perf/x86/intel/uncore: Enable IIO stacks to PMON mapping for multi-segment SKX
IIO stacks to PMON mapping on Skylake servers is exposed through introduced
early attributes /sys/devices/uncore_iio_<pmu_idx>/dieX, where dieX is a
file which holds "Segment:Root Bus" for PCIe root port which can
be monitored by that IIO PMON block. These sysfs attributes are disabled
for multiple segment topologies except VMD domains which start at 0x10000.
This patch removes the limitation and enables IIO stacks to PMON mapping
for multi-segment Skylake servers by introducing segment-aware
intel_uncore_topology structure and attributing the topology configuration
to the segment in skx_iio_get_topology() function.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Alexander Antonov <alexander.antonov@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Tested-by: Kyle Meyer <kyle.meyer@hpe.com>
Link: https://lkml.kernel.org/r/20210323150507.2013-1-alexander.antonov@linux.intel.com
2021-04-02 10:04:55 +02:00
Kan Liang c4c55e362a perf/x86/intel/uncore: Generic support for the MMIO type of uncore blocks
The discovery table provides the generic uncore block information
for the MMIO type of uncore blocks, which is good enough to provide
basic uncore support.

The box control field is composed of the BAR address and box control
offset. When initializing the uncore blocks, perf should ioremap the
address from the box control field.

Implement the generic support for the MMIO type of uncore block.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1616003977-90612-6-git-send-email-kan.liang@linux.intel.com
2021-04-02 10:04:55 +02:00
Kan Liang 42839ef4a2 perf/x86/intel/uncore: Generic support for the PCI type of uncore blocks
The discovery table provides the generic uncore block information
for the PCI type of uncore blocks, which is good enough to provide
basic uncore support.

The PCI BUS and DEVFN information can be retrieved from the box control
field. Introduce the uncore_pci_pmus_register() to register all the
PCICFG type of uncore blocks. The old PCI probe/remove way is dropped.

The PCI BUS and DEVFN information are different among dies. Add box_ctls
to store the box control field of each die.

Add a new BUS notifier for the PCI type of uncore block to support the
hotplug. If the device is "hot remove", the corresponding registered PMU
has to be unregistered. Perf cannot locate the PMU by searching a const
pci_device_id table, because the discovery tables don't provide such
information. Introduce uncore_pci_find_dev_pmu_from_types() to search
the whole uncore_pci_uncores for the PMU.

Implement generic support for the PCI type of uncore block.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1616003977-90612-5-git-send-email-kan.liang@linux.intel.com
2021-04-02 10:04:55 +02:00
Kan Liang 6477dc3934 perf/x86/intel/uncore: Rename uncore_notifier to uncore_pci_sub_notifier
Perf will use a similar method to the PCI sub driver to register
the PMUs for the PCI type of uncore blocks. The method requires a BUS
notifier to support hotplug. The current BUS notifier cannot be reused,
because it searches a const id_table for the corresponding registered
PMU. The PCI type of uncore blocks in the discovery tables doesn't
provide an id_table.

Factor out uncore_bus_notify() and add the pointer of an id_table as a
parameter. The uncore_bus_notify() will be reused in the following
patch.

The current BUS notifier is only used by the PCI sub driver. Its name is
too generic. Rename it to uncore_pci_sub_notifier, which is specific for
the PCI sub driver.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1616003977-90612-4-git-send-email-kan.liang@linux.intel.com
2021-04-02 10:04:54 +02:00
Kan Liang d6c7541304 perf/x86/intel/uncore: Generic support for the MSR type of uncore blocks
The discovery table provides the generic uncore block information for
the MSR type of uncore blocks, e.g., the counter width, the number of
counters, the location of control/counter registers, which is good
enough to provide basic uncore support. It can be used as a fallback
solution when the kernel doesn't support a platform.

The name of the uncore box cannot be retrieved from the discovery table.
uncore_type_&typeID_&boxID will be used as its name. Save the type ID
and the box ID information in the struct intel_uncore_type.
Factor out uncore_get_pmu_name() to handle different naming methods.

Implement generic support for the MSR type of uncore block.

Some advanced features, such as filters and constraints, cannot be
retrieved from discovery tables. Features that rely on that
information are not be supported here.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1616003977-90612-3-git-send-email-kan.liang@linux.intel.com
2021-04-02 10:04:54 +02:00
Kan Liang edae1f06c2 perf/x86/intel/uncore: Parse uncore discovery tables
A self-describing mechanism for the uncore PerfMon hardware has been
introduced with the latest Intel platforms. By reading through an MMIO
page worth of information, perf can 'discover' all the standard uncore
PerfMon registers in a machine.

The discovery mechanism relies on BIOS's support. With a proper BIOS,
a PCI device with the unique capability ID 0x23 can be found on each
die. Perf can retrieve the information of all available uncore PerfMons
from the device via MMIO. The information is composed of one global
discovery table and several unit discovery tables.
- The global discovery table includes global uncore information of the
  die, e.g., the address of the global control register, the offset of
  the global status register, the number of uncore units, the offset of
  unit discovery tables, etc.
- The unit discovery table includes generic uncore unit information,
  e.g., the access type, the counter width, the address of counters,
  the address of the counter control, the unit ID, the unit type, etc.
  The unit is also called "box" in the code.
Perf can provide basic uncore support based on this information
with the following patches.

To locate the PCI device with the discovery tables, check the generic
PCI ID first. If it doesn't match, go through the entire PCI device tree
and locate the device with the unique capability ID.

The uncore information is similar among dies. To save parsing time and
space, only completely parse and store the discovery tables on the first
die and the first box of each die. The parsed information is stored in
an
RB tree structure, intel_uncore_discovery_type. The size of the stored
discovery tables varies among platforms. It's around 4KB for a Sapphire
Rapids server.

If a BIOS doesn't support the 'discovery' mechanism, the uncore driver
will exit with -ENODEV. There is nothing changed.

Add a module parameter to disable the discovery feature. If a BIOS gets
the discovery tables wrong, users can have an option to disable the
feature. For the current patchset, the uncore driver will exit with
-ENODEV. In the future, it may fall back to the hardcode uncore driver
on a known platform.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1616003977-90612-2-git-send-email-kan.liang@linux.intel.com
2021-04-02 10:04:54 +02:00
Arnd Bergmann 8d195e7a8a crypto: poly1305 - fix poly1305_core_setkey() declaration
gcc-11 points out a mismatch between the declaration and the definition
of poly1305_core_setkey():

lib/crypto/poly1305-donna32.c:13:67: error: argument 2 of type ‘const u8[16]’ {aka ‘const unsigned char[16]’} with mismatched bound [-Werror=array-parameter=]
   13 | void poly1305_core_setkey(struct poly1305_core_key *key, const u8 raw_key[16])
      |                                                          ~~~~~~~~~^~~~~~~~~~~
In file included from lib/crypto/poly1305-donna32.c:11:
include/crypto/internal/poly1305.h:21:68: note: previously declared as ‘const u8 *’ {aka ‘const unsigned char *’}
   21 | void poly1305_core_setkey(struct poly1305_core_key *key, const u8 *raw_key);

This is harmless in principle, as the calling conventions are the same,
but the more specific prototype allows better type checking in the
caller.

Change the declaration to match the actual function definition.
The poly1305_simd_init() is a bit suspicious here, as it previously
had a 32-byte argument type, but looks like it needs to take the
16-byte POLY1305_BLOCK_SIZE array instead.

Fixes: 1c08a10436 ("crypto: poly1305 - add new 32 and 64-bit generic versions")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-04-02 18:28:12 +11:00
Vitaly Kuznetsov 8cdddd182b ACPI: processor: Fix CPU0 wakeup in acpi_idle_play_dead()
Commit 496121c021 ("ACPI: processor: idle: Allow probing on platforms
with one ACPI C-state") broke CPU0 hotplug on certain systems, e.g.
I'm observing the following on AWS Nitro (e.g r5b.xlarge but other
instance types are affected as well):

 # echo 0 > /sys/devices/system/cpu/cpu0/online
 # echo 1 > /sys/devices/system/cpu/cpu0/online
 <10 seconds delay>
 -bash: echo: write error: Input/output error

In fact, the above mentioned commit only revealed the problem and did
not introduce it. On x86, to wakeup CPU an NMI is being used and
hlt_play_dead()/mwait_play_dead() loops are prepared to handle it:

	/*
	 * If NMI wants to wake up CPU0, start CPU0.
	 */
	if (wakeup_cpu0())
		start_cpu0();

cpuidle_play_dead() -> acpi_idle_play_dead() (which is now being called on
systems where it wasn't called before the above mentioned commit) serves
the same purpose but it doesn't have a path for CPU0. What happens now on
wakeup is:
 - NMI is sent to CPU0
 - wakeup_cpu0_nmi() works as expected
 - we get back to while (1) loop in acpi_idle_play_dead()
 - safe_halt() puts CPU0 to sleep again.

The straightforward/minimal fix is add the special handling for CPU0 on x86
and that's what the patch is doing.

Fixes: 496121c021 ("ACPI: processor: idle: Allow probing on platforms with one ACPI C-state")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: 5.10+ <stable@vger.kernel.org> # 5.10+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-04-01 13:37:55 +02:00
Paolo Bonzini cb9b6a1b19 Merge branch 'kvm-fix-svm-races' into HEAD 2021-04-01 05:19:48 -04:00
Vitaly Kuznetsov 77fcbe823f KVM: x86: Prevent 'hv_clock->system_time' from going negative in kvm_guest_time_update()
When guest time is reset with KVM_SET_CLOCK(0), it is possible for
'hv_clock->system_time' to become a small negative number. This happens
because in KVM_SET_CLOCK handling we set 'kvm->arch.kvmclock_offset' based
on get_kvmclock_ns(kvm) but when KVM_REQ_CLOCK_UPDATE is handled,
kvm_guest_time_update() does (masterclock in use case):

hv_clock.system_time = ka->master_kernel_ns + v->kvm->arch.kvmclock_offset;

And 'master_kernel_ns' represents the last time when masterclock
got updated, it can precede KVM_SET_CLOCK() call. Normally, this is not a
problem, the difference is very small, e.g. I'm observing
hv_clock.system_time = -70 ns. The issue comes from the fact that
'hv_clock.system_time' is stored as unsigned and 'system_time / 100' in
compute_tsc_page_parameters() becomes a very big number.

Use 'master_kernel_ns' instead of get_kvmclock_ns() when masterclock is in
use and get_kvmclock_base_ns() when it's not to prevent 'system_time' from
going negative.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210331124130.337992-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 05:14:19 -04:00
Paolo Bonzini a83829f56c KVM: x86: disable interrupts while pvclock_gtod_sync_lock is taken
pvclock_gtod_sync_lock can be taken with interrupts disabled if the
preempt notifier calls get_kvmclock_ns to update the Xen
runstate information:

   spin_lock include/linux/spinlock.h:354 [inline]
   get_kvmclock_ns+0x25/0x390 arch/x86/kvm/x86.c:2587
   kvm_xen_update_runstate+0x3d/0x2c0 arch/x86/kvm/xen.c:69
   kvm_xen_update_runstate_guest+0x74/0x320 arch/x86/kvm/xen.c:100
   kvm_xen_runstate_set_preempted arch/x86/kvm/xen.h:96 [inline]
   kvm_arch_vcpu_put+0x2d8/0x5a0 arch/x86/kvm/x86.c:4062

So change the users of the spinlock to spin_lock_irqsave and
spin_unlock_irqrestore.

Reported-by: syzbot+b282b65c2c68492df769@syzkaller.appspotmail.com
Fixes: 30b5c851af ("KVM: x86/xen: Add support for vCPU runstate information")
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 05:14:19 -04:00
Paolo Bonzini c2c647f91a KVM: x86: reduce pvclock_gtod_sync_lock critical sections
There is no need to include changes to vcpu->requests into
the pvclock_gtod_sync_lock critical section.  The changes to
the shared data structures (in pvclock_update_vm_gtod_copy)
already occur under the lock.

Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 05:14:19 -04:00
Paolo Bonzini 6ebae23c07 Merge branch 'kvm-fix-svm-races' into kvm-master 2021-04-01 05:14:05 -04:00
Paolo Bonzini 3c346c0c60 KVM: SVM: ensure that EFER.SVME is set when running nested guest or on nested vmexit
Fixing nested_vmcb_check_save to avoid all TOC/TOU races
is a bit harder in released kernels, so do the bare minimum
by avoiding that EFER.SVME is cleared.  This is problematic
because svm_set_efer frees the data structures for nested
virtualization if EFER.SVME is cleared.

Also check that EFER.SVME remains set after a nested vmexit;
clearing it could happen if the bit is zero in the save area
that is passed to KVM_SET_NESTED_STATE (the save area of the
nested state corresponds to the nested hypervisor's state
and is restored on the next nested vmexit).

Cc: stable@vger.kernel.org
Fixes: 2fcf4876ad ("KVM: nSVM: implement on demand allocation of the nested state")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 05:11:35 -04:00
Paolo Bonzini a58d9166a7 KVM: SVM: load control fields from VMCB12 before checking them
Avoid races between check and use of the nested VMCB controls.  This
for example ensures that the VMRUN intercept is always reflected to the
nested hypervisor, instead of being processed by the host.  Without this
patch, it is possible to end up with svm->nested.hsave pointing to
the MSR permission bitmap for nested guests.

This bug is CVE-2021-29657.

Reported-by: Felix Wilhelm <fwilhelm@google.com>
Cc: stable@vger.kernel.org
Fixes: 2fcf4876ad ("KVM: nSVM: implement on demand allocation of the nested state")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-01 05:09:31 -04:00
Borislav Petkov f2ac256b9a Merge 'x86/alternatives'
Pick up dependent changes.

Signed-off-by: Borislav Petkov <bp@suse.de>
2021-03-31 18:04:19 +02:00
Peter Zijlstra 52fa82c21f x86: Add insn_decode_kernel()
Add a helper to decode kernel instructions; there's no point in
endlessly repeating those last two arguments.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210326151259.379242587@infradead.org
2021-03-31 16:20:22 +02:00
Paolo Bonzini 825e34d3c9 Merge commit 'kvm-tdp-fix-flushes' into kvm-master 2021-03-31 07:45:41 -04:00
Sean Christopherson 33a3164161 KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages
Prevent the TDP MMU from yielding when zapping a gfn range during NX
page recovery.  If a flush is pending from a previous invocation of the
zapping helper, either in the TDP MMU or the legacy MMU, but the TDP MMU
has not accumulated a flush for the current invocation, then yielding
will release mmu_lock with stale TLB entries.

That being said, this isn't technically a bug fix in the current code, as
the TDP MMU will never yield in this case.  tdp_mmu_iter_cond_resched()
will yield if and only if it has made forward progress, as defined by the
current gfn vs. the last yielded (or starting) gfn.  Because zapping a
single shadow page is guaranteed to (a) find that page and (b) step
sideways at the level of the shadow page, the TDP iter will break its loop
before getting a chance to yield.

But that is all very, very subtle, and will break at the slightest sneeze,
e.g. zapping while holding mmu_lock for read would break as the TDP MMU
wouldn't be guaranteed to see the present shadow page, and thus could step
sideways at a lower level.

Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210325200119.1359384-4-seanjc@google.com>
[Add lockdep assertion. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-30 13:19:56 -04:00
Sean Christopherson 048f49809c KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping
Honor the "flush needed" return from kvm_tdp_mmu_zap_gfn_range(), which
does the flush itself if and only if it yields (which it will never do in
this particular scenario), and otherwise expects the caller to do the
flush.  If pages are zapped from the TDP MMU but not the legacy MMU, then
no flush will occur.

Fixes: 29cf0f5007 ("kvm: x86/mmu: NX largepage recovery for TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210325200119.1359384-3-seanjc@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-30 13:19:55 -04:00
Sean Christopherson a835429cda KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap
When flushing a range of GFNs across multiple roots, ensure any pending
flush from a previous root is honored before yielding while walking the
tables of the current root.

Note, kvm_tdp_mmu_zap_gfn_range() now intentionally overwrites its local
"flush" with the result to avoid redundant flushes.  zap_gfn_range()
preserves and return the incoming "flush", unless of course the flush was
performed prior to yielding and no new flush was triggered.

Fixes: 1af4a96025 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
Cc: stable@vger.kernel.org
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210325200119.1359384-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-30 13:19:55 -04:00
Siddharth Chandrasekaran 6fb3084ab5 KVM: make: Fix out-of-source module builds
Building kvm module out-of-source with,

    make -C $SRC O=$BIN M=arch/x86/kvm

fails to find "irq.h" as the include dir passed to cflags-y does not
prefix the source dir. Fix this by prefixing $(srctree) to the include
dir path.

Signed-off-by: Siddharth Chandrasekaran <sidcha@amazon.de>
Message-Id: <20210324124347.18336-1-sidcha@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-30 13:07:10 -04:00
Vitaly Kuznetsov 1973cadd4c KVM: x86/vPMU: Forbid writing to MSR_F15H_PERF MSRs when guest doesn't have X86_FEATURE_PERFCTR_CORE
MSR_F15H_PERF_CTL0-5, MSR_F15H_PERF_CTR0-5 MSRs are only available when
X86_FEATURE_PERFCTR_CORE CPUID bit was exposed to the guest. KVM, however,
allows these MSRs unconditionally because kvm_pmu_is_valid_msr() ->
amd_msr_idx_to_pmc() check always passes and because kvm_pmu_set_msr() ->
amd_pmu_set_msr() doesn't fail.

In case of a counter (CTRn), no big harm is done as we only increase
internal PMC's value but in case of an eventsel (CTLn), we go deep into
perf internals with a non-existing counter.

Note, kvm_get_msr_common() just returns '0' when these MSRs don't exist
and this also seems to contradict architectural behavior which is #GP
(I did check one old Opteron host) but changing this status quo is a bit
scarier.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210323084515.1346540-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-30 13:07:10 -04:00
Dongli Zhang ecaf088f53 KVM: x86: remove unused declaration of kvm_write_tsc()
kvm_write_tsc() was renamed and made static since commit 0c899c25d7
("KVM: x86: do not attempt TSC synchronization on guest writes"). Remove
its unused declaration.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Message-Id: <20210326070334.12310-1-dongli.zhang@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-30 13:07:09 -04:00
Haiwei Li d632826f26 KVM: clean up the unused argument
kvm_msr_ignored_check function never uses vcpu argument. Clean up the
function and invokers.

Signed-off-by: Haiwei Li <lihaiwei@tencent.com>
Message-Id: <20210313051032.4171-1-lihaiwei.kernel@gmail.com>
Reviewed-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-30 13:07:09 -04:00
Thomas Gleixner 9a98bc2cf0 x86/vector: Add a sanity check to prevent IRQ2 allocations
To prevent another incidental removal of the IRQ2 ignore logic in the
IO/APIC code going unnoticed add a sanity check. Add some commentry at the
other place which ignores IRQ2 while at it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210318192819.795280387@linutronix.de
2021-03-30 00:39:12 +02:00
Rafael J. Wysocki 1a1c130ab7 ACPI: tables: x86: Reserve memory occupied by ACPI tables
The following problem has been reported by George Kennedy:

 Since commit 7fef431be9 ("mm/page_alloc: place pages to tail
 in __free_pages_core()") the following use after free occurs
 intermittently when ACPI tables are accessed.

 BUG: KASAN: use-after-free in ibft_init+0x134/0xc49
 Read of size 4 at addr ffff8880be453004 by task swapper/0/1
 CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.12.0-rc1-7a7fd0d #1
 Call Trace:
  dump_stack+0xf6/0x158
  print_address_description.constprop.9+0x41/0x60
  kasan_report.cold.14+0x7b/0xd4
  __asan_report_load_n_noabort+0xf/0x20
  ibft_init+0x134/0xc49
  do_one_initcall+0xc4/0x3e0
  kernel_init_freeable+0x5af/0x66b
  kernel_init+0x16/0x1d0
  ret_from_fork+0x22/0x30

 ACPI tables mapped via kmap() do not have their mapped pages
 reserved and the pages can be "stolen" by the buddy allocator.

Apparently, on the affected system, the ACPI table in question is
not located in "reserved" memory, like ACPI NVS or ACPI Data, that
will not be used by the buddy allocator, so the memory occupied by
that table has to be explicitly reserved to prevent the buddy
allocator from using it.

In order to address this problem, rearrange the initialization of the
ACPI tables on x86 to locate the initial tables earlier and reserve
the memory occupied by them.

The other architectures using ACPI should not be affected by this
change.

Link: https://lore.kernel.org/linux-acpi/1614802160-29362-1-git-send-email-george.kennedy@oracle.com/
Reported-by: George Kennedy <george.kennedy@oracle.com>
Tested-by: George Kennedy <george.kennedy@oracle.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: 5.10+ <stable@vger.kernel.org> # 5.10+
2021-03-29 19:26:04 +02:00
Ingo Molnar feecb81732 Linux 5.12-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFRBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmBhB7AeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGCPUH+KKkSoOlN2YNu1oc
 iy2nznwZoSQTk5ZLz7PypO/WWmmtgzudkObG7yqIURdrncsAkHR17Wu2P7rdBr1j
 Ma+VhF9MQ+xx+r86upH7c3gYfhyfdUMvzuLy0rwLQ1Yrzrb7xFcVkj3BHk54TAQA
 w05sRPuVJ3/c/HPYV2iXkkdnnMbXSTCebeDDwjFb9D3qagr4vcd/PjDHmGbfNF8R
 o6gLpbK5Ly6ww1nth9gGGUjzrW95yVItvcroP6vQWljxhuy+NE1lXRm8LsGhxqtW
 foFFptJup5nhSNJXWtQt/U3huVD6mZ3W3y9cOThPjXZRy2wva3I1IpBKoEFReUpG
 /Tq8EA==
 =tPUY
 -----END PGP SIGNATURE-----

Merge tag 'v5.12-rc5' into locking/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-03-29 15:56:48 +02:00
Fenghua Yu ebb1064e7c x86/traps: Handle #DB for bus lock
Bus locks degrade performance for the whole system, not just for the CPU
that requested the bus lock. Two CPU features "#AC for split lock" and
"#DB for bus lock" provide hooks so that the operating system may choose
one of several mitigation strategies.

#AC for split lock is already implemented. Add code to use the #DB for
bus lock feature to cover additional situations with new options to
mitigate.

split_lock_detect=
		#AC for split lock		#DB for bus lock

off		Do nothing			Do nothing

warn		Kernel OOPs			Warn once per task and
		Warn once per task and		and continues to run.
		disable future checking
	 	When both features are
		supported, warn in #AC

fatal		Kernel OOPs			Send SIGBUS to user.
		Send SIGBUS to user
		When both features are
		supported, fatal in #AC

ratelimit:N	Do nothing			Limit bus lock rate to
						N per second in the
						current non-root user.

Default option is "warn".

Hardware only generates #DB for bus lock detect when CPL>0 to avoid
nested #DB from multiple bus locks while the first #DB is being handled.
So no need to handle #DB for bus lock detected in the kernel.

#DB for bus lock is enabled by bus lock detection bit 2 in DEBUGCTL MSR
while #AC for split lock is enabled by split lock detection bit 29 in
TEST_CTRL MSR.

Both breakpoint and bus lock in the same instruction can trigger one #DB.
The bus lock is handled before the breakpoint in the #DB handler.

Delivery of #DB for bus lock in userspace clears DR6[11], which is set by
the #DB handler right after reading DR6.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210322135325.682257-3-fenghua.yu@intel.com
2021-03-28 22:52:15 +02:00
Fenghua Yu f21d4d3b97 x86/cpufeatures: Enumerate #DB for bus lock detection
A bus lock is acquired through either a split locked access to writeback
(WB) memory or any locked access to non-WB memory. This is typically >1000
cycles slower than an atomic operation within a cache line. It also
disrupts performance on other cores.

Some CPUs have the ability to notify the kernel by a #DB trap after a user
instruction acquires a bus lock and is executed. This allows the kernel to
enforce user application throttling or mitigation. Both breakpoint and bus
lock can trigger the #DB trap in the same instruction and the ordering of
handling them is the kernel #DB handler's choice.

The CPU feature flag to be shown in /proc/cpuinfo will be "bus_lock_detect".

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210322135325.682257-2-fenghua.yu@intel.com
2021-03-28 22:52:14 +02:00
Lai Jiangshan 1591584e2e x86/process/64: Move cpu_current_top_of_stack out of TSS
cpu_current_top_of_stack is currently stored in TSS.sp1. TSS is exposed
through the cpu_entry_area which is visible with user CR3 when PTI is
enabled and active.

This makes it a coveted fruit for attackers.  An attacker can fetch the
kernel stack top from it and continue next steps of actions based on the
kernel stack.

But it is actualy not necessary to be stored in the TSS.  It is only
accessed after the entry code switched to kernel CR3 and kernel GS_BASE
which means it can be in any regular percpu variable.

The reason why it is in TSS is historical (pre PTI) because TSS is also
used as scratch space in SYSCALL_64 and therefore cache hot.

A syscall also needs the per CPU variable current_task and eventually
__preempt_count, so placing cpu_current_top_of_stack next to them makes it
likely that they end up in the same cache line which should avoid
performance regressions. This is not enforced as the compiler is free to
place these variables, so these entry relevant variables should move into
a data structure to make this enforceable.

The seccomp_benchmark doesn't show any performance loss in the "getpid
native" test result.  Actually, the result changes from 93ns before to 92ns
with this change when KPTI is disabled. The test is very stable and
although the test doesn't show a higher degree of precision it gives enough
confidence that moving cpu_current_top_of_stack does not cause a
regression.

[ tglx: Removed unneeded export. Massaged changelog ]

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210125173444.22696-2-jiangshanlai@gmail.com
2021-03-28 22:40:10 +02:00
Linus Torvalds 36a14638f7 Two fixes:
- Fix build failure on Ubuntu with new GCC packages that turn on -fcf-protection
 
  - Fix SME memory encryption PTE encoding bug - AFAICT the code worked on
    4K page sizes (level 1) but had the wrong shift at higher page level orders
    (level 2 and higher).
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmBgXdERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gJWxAAgAOWwGY3yq3kUEtIExosXZPlHCFjal3N
 UXoVpQde4aBeZ9A4flMjkSZmTF5PVEN2npMz8ltnxU8NUJg4QR68UYiIE8BReARg
 +JuyNXGdAu1XyT+dWdTFqL9xgA9t8dG13o4WbBqGDZagnLNuvjYhzJtsgw9FbNWZ
 a1abBbcxpoZvSyQBHyqtuwoiWeeeFJiQZ02wZwxtonYHWVbBXEN5WhFL9Tc2kDJc
 /Ic09O+FDhpe3I/PvCiMrkpVJuBnaDdve5zDPDzR+FRMeAj4AhNLIJiMFj17bJWD
 eR6vCDoFz3EsbSdJz0XvHIZOSZnaiiC0ybTEv5nJTiRgDk+s6JDXWwDcJG+3yKJR
 Fm5TLlnaU++E9lYLpyCbgrWkrv0F2u3GmnieFnOOyzRv8NlkZqrThApf3xGsavy+
 qJZnXe5ftWp+mmIDw4DZDBVsJ8rBIflvURQxfG3SHkUc0iVsyUCrAK2eKYewk/dN
 eC6FVPkCdx4Ys50wb+OR9Enhq3yKFyRuZ2zIeguUX30sqoapJL85M1vglS5DFoX/
 pHcigRzBzFQOZhOh8Kq3VREOx0F+ioUfcZzmYdzjWSfXfpvqWFcLAIFgOv1hDfms
 XQ60X/voG0tWd0ODKXqyx6oa0rqamigPjLJp/gtDKpQHORFaabvnTJTLwN6n8N1Q
 syTWRiHMhi0=
 =tM9n
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2021-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:
 "Two fixes:

   - Fix build failure on Ubuntu with new GCC packages that turn
     on -fcf-protection

   - Fix SME memory encryption PTE encoding bug - AFAICT the code
     worked on 4K page sizes (level 1) but had the wrong shift at
     higher page level orders (level 2 and higher)"

* tag 'x86-urgent-2021-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/build: Turn off -fcf-protection for realmode targets
  x86/mem_encrypt: Correct physical address calculation in __set_clr_pte_enc()
2021-03-28 12:19:16 -07:00
Alexey Makhalov 0b4a285e2c x86/vmware: Avoid TSC recalibration when frequency is known
When the TSC frequency is known because it is retrieved from the
hypervisor, skip TSC refined calibration by setting X86_FEATURE_TSC_KNOWN_FREQ.

Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210105004752.131069-1-amakhalov@vmware.com
2021-03-28 21:11:43 +02:00
Martin KaFai Lau 797b84f727 bpf: Support kernel function call in x86-32
This patch adds kernel function call support to the x86-32 bpf jit.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210325015149.1545267-1-kafai@fb.com
2021-03-26 20:41:51 -07:00
Martin KaFai Lau e6ac2450d6 bpf: Support bpf program calling kernel function
This patch adds support to BPF verifier to allow bpf program calling
kernel function directly.

The use case included in this set is to allow bpf-tcp-cc to directly
call some tcp-cc helper functions (e.g. "tcp_cong_avoid_ai()").  Those
functions have already been used by some kernel tcp-cc implementations.

This set will also allow the bpf-tcp-cc program to directly call the
kernel tcp-cc implementation,  For example, a bpf_dctcp may only want to
implement its own dctcp_cwnd_event() and reuse other dctcp_*() directly
from the kernel tcp_dctcp.c instead of reimplementing (or
copy-and-pasting) them.

The tcp-cc kernel functions mentioned above will be white listed
for the struct_ops bpf-tcp-cc programs to use in a later patch.
The white listed functions are not bounded to a fixed ABI contract.
Those functions have already been used by the existing kernel tcp-cc.
If any of them has changed, both in-tree and out-of-tree kernel tcp-cc
implementations have to be changed.  The same goes for the struct_ops
bpf-tcp-cc programs which have to be adjusted accordingly.

This patch is to make the required changes in the bpf verifier.

First change is in btf.c, it adds a case in "btf_check_func_arg_match()".
When the passed in "btf->kernel_btf == true", it means matching the
verifier regs' states with a kernel function.  This will handle the
PTR_TO_BTF_ID reg.  It also maps PTR_TO_SOCK_COMMON, PTR_TO_SOCKET,
and PTR_TO_TCP_SOCK to its kernel's btf_id.

In the later libbpf patch, the insn calling a kernel function will
look like:

insn->code == (BPF_JMP | BPF_CALL)
insn->src_reg == BPF_PSEUDO_KFUNC_CALL /* <- new in this patch */
insn->imm == func_btf_id /* btf_id of the running kernel */

[ For the future calling function-in-kernel-module support, an array
  of module btf_fds can be passed at the load time and insn->off
  can be used to index into this array. ]

At the early stage of verifier, the verifier will collect all kernel
function calls into "struct bpf_kfunc_desc".  Those
descriptors are stored in "prog->aux->kfunc_tab" and will
be available to the JIT.  Since this "add" operation is similar
to the current "add_subprog()" and looking for the same insn->code,
they are done together in the new "add_subprog_and_kfunc()".

In the "do_check()" stage, the new "check_kfunc_call()" is added
to verify the kernel function call instruction:
1. Ensure the kernel function can be used by a particular BPF_PROG_TYPE.
   A new bpf_verifier_ops "check_kfunc_call" is added to do that.
   The bpf-tcp-cc struct_ops program will implement this function in
   a later patch.
2. Call "btf_check_kfunc_args_match()" to ensure the regs can be
   used as the args of a kernel function.
3. Mark the regs' type, subreg_def, and zext_dst.

At the later do_misc_fixups() stage, the new fixup_kfunc_call()
will replace the insn->imm with the function address (relative
to __bpf_call_base).  If needed, the jit can find the btf_func_model
by calling the new bpf_jit_find_kfunc_model(prog, insn).
With the imm set to the function address, "bpftool prog dump xlated"
will be able to display the kernel function calls the same way as
it displays other bpf helper calls.

gpl_compatible program is required to call kernel function.

This feature currently requires JIT.

The verifier selftests are adjusted because of the changes in
the verbose log in add_subprog_and_kfunc().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210325015142.1544736-1-kafai@fb.com
2021-03-26 20:41:51 -07:00
Sean Christopherson 231d3dbdda x86/sgx: Add SGX_CHILD_PRESENT hardware error code
SGX driver can accurately track how enclave pages are used.  This
enables SECS to be specifically targeted and EREMOVE'd only after all
child pages have been EREMOVE'd.  This ensures that SGX driver will
never encounter SGX_CHILD_PRESENT in normal operation.

Virtual EPC is different.  The host does not track how EPC pages are
used by the guest, so it cannot guarantee EREMOVE success.  It might,
for instance, encounter a SECS with a non-zero child count.

Add a definition of SGX_CHILD_PRESENT.  It will be used exclusively by
the SGX virtualization driver to handle recoverable EREMOVE errors when
saniziting EPC pages after they are freed.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/050b198e882afde7e6eba8e6a0d4da39161dbb5a.1616136308.git.kai.huang@intel.com
2021-03-26 22:51:36 +01:00
Kai Huang b0c7459be0 x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()
EREMOVE takes a page and removes any association between that page and
an enclave. It must be run on a page before it can be added into another
enclave. Currently, EREMOVE is run as part of pages being freed into the
SGX page allocator. It is not expected to fail, as it would indicate a
use-after-free of EPC pages. Rather than add the page back to the pool
of available EPC pages, the kernel intentionally leaks the page to avoid
additional errors in the future.

However, KVM does not track how guest pages are used, which means that
SGX virtualization use of EREMOVE might fail. Specifically, it is
legitimate that EREMOVE returns SGX_CHILD_PRESENT for EPC assigned to
KVM guest, because KVM/kernel doesn't track SECS pages.

To allow SGX/KVM to introduce a more permissive EREMOVE helper and
to let the SGX virtualization code use the allocator directly, break
out the EREMOVE call from the SGX page allocator. Rename the original
sgx_free_epc_page() to sgx_encl_free_epc_page(), indicating that
it is used to free an EPC page assigned to a host enclave. Replace
sgx_free_epc_page() with sgx_encl_free_epc_page() in all call sites so
there's no functional change.

At the same time, improve the error message when EREMOVE fails, and
add documentation to explain to the user what that failure means and
to suggest to the user what to do when this bug happens in the case it
happens.

 [ bp: Massage commit message, fix typos and sanitize text, simplify. ]

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/20210325093057.122834-1-kai.huang@intel.com
2021-03-26 22:51:23 +01:00
Linus Torvalds 6c20f6df61 xen: branch for v5.12-rc5
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYF37OgAKCRCAXGG7T9hj
 vp8hAP4h7mvjfkntbFXagrJK9pi2xVC9d/YO5nfa8/K3LcGVnQD/fKcU9ggPN9vI
 GLnhyprGLcCA4aTL6Ogb37o9fDd4Yws=
 =joIg
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.12b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:
 "This contains a small series with a more elegant fix of a problem
  which was originally fixed in rc2"

* tag 'for-linus-5.12b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  Revert "xen: fix p2m size in dom0 for disabled memory hotplug case"
  xen/x86: make XEN_BALLOON_MEMORY_HOTPLUG_LIMIT depend on MEMORY_HOTPLUG
2021-03-26 11:15:25 -07:00
Nathan Chancellor d5cbd80e30 x86/boot: Add $(CLANG_FLAGS) to compressed KBUILD_CFLAGS
When cross compiling x86 on an ARM machine with clang, there are several
errors along the lines of:

  arch/x86/include/asm/string_64.h:27:10: error: invalid output constraint '=&c' in asm

This happens because the compressed boot Makefile reassigns KBUILD_CFLAGS
and drops the clang flags that set the target architecture ('--target=')
and the path to the GNU cross tools ('--prefix='), meaning that the host
architecture is targeted.

These flags are available as $(CLANG_FLAGS) from the main Makefile so
add them to the compressed boot folder's KBUILD_CFLAGS so that cross
compiling works as expected.

Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lkml.kernel.org/r/20210326000435.4785-3-nathan@kernel.org
2021-03-26 11:32:55 +01:00
John Millikin 8abe7fc26a x86/build: Propagate $(CLANG_FLAGS) to $(REALMODE_FLAGS)
When cross-compiling with Clang, the `$(CLANG_FLAGS)' variable
contains additional flags needed to build C and assembly sources
for the target platform. Normally this variable is automatically
included in `$(KBUILD_CFLAGS)' via the top-level Makefile.

The x86 real-mode makefile builds `$(REALMODE_CFLAGS)' from a
plain assignment and therefore drops the Clang flags. This causes
Clang to not recognize x86-specific assembler directives:

  arch/x86/realmode/rm/header.S:36:1: error: unknown directive
  .type real_mode_header STT_OBJECT ; .size real_mode_header, .-real_mode_header
  ^

Explicit propagation of `$(CLANG_FLAGS)' to `$(REALMODE_CFLAGS)',
which is inherited by real-mode make rules, fixes cross-compilation
with Clang for x86 targets.

Relevant flags:

* `--target' sets the target architecture when cross-compiling. This
  flag must be set for both compilation and assembly (`KBUILD_AFLAGS')
  to support architecture-specific assembler directives.

* `-no-integrated-as' tells clang to assemble with GNU Assembler
  instead of its built-in LLVM assembler. This flag is set by default
  unless `LLVM_IAS=1' is set, because the LLVM assembler can't yet
  parse certain GNU extensions.

Signed-off-by: John Millikin <john@john-millikin.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Link: https://lkml.kernel.org/r/20210326000435.4785-2-nathan@kernel.org
2021-03-26 11:32:47 +01:00
Sean Christopherson b8921dccf3 x86/cpufeatures: Add SGX1 and SGX2 sub-features
Add SGX1 and SGX2 feature flags, via CPUID.0x12.0x0.EAX, as scattered
features, since adding a new leaf for only two bits would be wasteful.
As part of virtualizing SGX, KVM will expose the SGX CPUID leafs to its
guest, and to do so correctly needs to query hardware and kernel support
for SGX1 and SGX2.

Suppress both SGX1 and SGX2 from /proc/cpuinfo. SGX1 basically means
SGX, and for SGX2 there is no concrete use case of using it in
/proc/cpuinfo.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/d787827dbfca6b3210ac3e432e3ac1202727e786.1616136308.git.kai.huang@intel.com
2021-03-25 17:33:11 +01:00
Kai Huang e9a15a40e8 x86/cpufeatures: Make SGX_LC feature bit depend on SGX bit
Move SGX_LC feature bit to CPUID dependency table to make clearing all
SGX feature bits easier. Also remove clear_sgx_caps() since it is just
a wrapper of setup_clear_cpu_cap(X86_FEATURE_SGX) now.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/5d4220fd0a39f52af024d3fa166231c1d498dd10.1616136308.git.kai.huang@intel.com
2021-03-25 17:33:11 +01:00
Masahiro Yamada 7dfe553aff x86/syscalls: Fix -Wmissing-prototypes warnings from COND_SYSCALL()
Building kernel/sys_ni.c with W=1 emits tons of -Wmissing-prototypes warnings:

  $ make W=1 kernel/sys_ni.o
    [ snip ]
    CC      kernel/sys_ni.o
     ./arch/x86/include/asm/syscall_wrapper.h:83:14: warning: no previous prototype for '__ia32_sys_io_setup' [-Wmissing-prototypes]
     ...

The problem is in __COND_SYSCALL(), the __SYS_STUB0() and __SYS_STUBx() macros
defined a few lines above already have forward declarations.

Let's do likewise for __COND_SYSCALL() to fix the warnings.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Mickaël Salaün <mic@linux.microsoft.com>
Link: https://lore.kernel.org/r/20210301131533.64671-2-masahiroy@kernel.org
2021-03-25 16:20:41 +01:00
Wei Yongjun 2304d14db6 x86/kprobes: Move 'inline' to the beginning of the kprobe_is_ss() declaration
Address this GCC warning:

  arch/x86/kernel/kprobes/core.c:940:1:
   warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
    940 | static int nokprobe_inline kprobe_is_ss(struct kprobe_ctlblk *kcb)
        | ^~~~~~

[ mingo: Tidied up the changelog. ]

Fixes: 6256e668b7af: ("x86/kprobes: Use int3 instead of debug trap for single-step")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20210324144502.1154883-1-weiyongjun1@huawei.com
2021-03-25 13:07:58 +01:00
Masami Hiramatsu 2f706e0e5e x86/kprobes: Fix to identify indirect jmp and others using range case
Fix can_boost() to identify indirect jmp and others using range case
correctly.

Since the condition in switch statement is opcode & 0xf0, it can not
evaluate to 0xff case. This should be under the 0xf0 case. However,
there is no reason to use the conbinations of the bit-masked condition
and lower bit checking.

Use range case to clean up the switch statement too.

Fixes: 6256e668b7 ("x86/kprobes: Use int3 instead of debug trap for single-step")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/161666692308.1120877.4675552834049546493.stgit@devnote2
2021-03-25 11:37:22 +01:00
Masami Hiramatsu 6dd3b8c9f5 x86/kprobes: Fix to check non boostable prefixes correctly
There are 2 bugs in the can_boost() function because of using
x86 insn decoder. Since the insn->opcode never has a prefix byte,
it can not find CS override prefix in it. And the insn->attr is
the attribute of the opcode, thus inat_is_address_size_prefix(
insn->attr) always returns false.

Fix those by checking each prefix bytes with for_each_insn_prefix
loop and getting the correct attribute for each prefix byte.
Also, this removes unlikely, because this is a slow path.

Fixes: a8d11cd071 ("kprobes/x86: Consolidate insn decoder users for copying code")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/161666691162.1120877.2808435205294352583.stgit@devnote2
2021-03-25 11:37:21 +01:00
Ira Weiny 633b0616cf x86/sgx: Remove unnecessary kmap() from sgx_ioc_enclave_init()
kmap() is inefficient and is being replaced by kmap_local_page(), if
possible. There is no readily apparent reason why initp_page needs to be
allocated and kmap'ed() except that 'sigstruct' needs to be page-aligned
and 'token' 512 byte-aligned.

Rather than change it to kmap_local_page(), use kmalloc() instead
because kmalloc() can give this alignment when allocating PAGE_SIZE
bytes.

Remove the alloc_page()/kmap() and replace with kmalloc(PAGE_SIZE, ...)
to get a page aligned kernel address.

In addition, add a comment to document the alignment requirements so that
others don't attempt to 'fix' this again.

 [ bp: Massage commit message. ]

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210324182246.2484875-1-ira.weiny@intel.com
2021-03-25 09:50:32 +01:00
Linus Torvalds e138138003 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:
 "Various fixes, all over:

   1) Fix overflow in ptp_qoriq_adjfine(), from Yangbo Lu.

   2) Always store the rx queue mapping in veth, from Maciej
      Fijalkowski.

   3) Don't allow vmlinux btf in map_create, from Alexei Starovoitov.

   4) Fix memory leak in octeontx2-af from Colin Ian King.

   5) Use kvalloc in bpf x86 JIT for storing jit'd addresses, from
      Yonghong Song.

   6) Fix tx ptp stats in mlx5, from Aya Levin.

   7) Check correct ip version in tun decap, fropm Roi Dayan.

   8) Fix rate calculation in mlx5 E-Switch code, from arav Pandit.

   9) Work item memork leak in mlx5, from Shay Drory.

  10) Fix ip6ip6 tunnel crash with bpf, from Daniel Borkmann.

  11) Lack of preemptrion awareness in macvlan, from Eric Dumazet.

  12) Fix data race in pxa168_eth, from Pavel Andrianov.

  13) Range validate stab in red_check_params(), from Eric Dumazet.

  14) Inherit vlan filtering setting properly in b53 driver, from
      Florian Fainelli.

  15) Fix rtnl locking in igc driver, from Sasha Neftin.

  16) Pause handling fixes in igc driver, from Muhammad Husaini
      Zulkifli.

  17) Missing rtnl locking in e1000_reset_task, from Vitaly Lifshits.

  18) Use after free in qlcnic, from Lv Yunlong.

  19) fix crash in fritzpci mISDN, from Tong Zhang.

  20) Premature rx buffer reuse in igb, from Li RongQing.

  21) Missing termination of ip[a driver message handler arrays, from
      Alex Elder.

  22) Fix race between "x25_close" and "x25_xmit"/"x25_rx" in hdlc_x25
      driver, from Xie He.

  23) Use after free in c_can_pci_remove(), from Tong Zhang.

  24) Uninitialized variable use in nl80211, from Jarod Wilson.

  25) Off by one size calc in bpf verifier, from Piotr Krysiuk.

  26) Use delayed work instead of deferrable for flowtable GC, from
      Yinjun Zhang.

  27) Fix infinite loop in NPC unmap of octeontx2 driver, from
      Hariprasad Kelam.

  28) Fix being unable to change MTU of dwmac-sun8i devices due to lack
      of fifo sizes, from Corentin Labbe.

  29) DMA use after free in r8169 with WoL, fom Heiner Kallweit.

  30) Mismatched prototypes in isdn-capi, from Arnd Bergmann.

  31) Fix psample UAPI breakage, from Ido Schimmel"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (171 commits)
  psample: Fix user API breakage
  math: Export mul_u64_u64_div_u64
  ch_ktls: fix enum-conversion warning
  octeontx2-af: Fix memory leak of object buf
  ptp_qoriq: fix overflow in ptp_qoriq_adjfine() u64 calcalation
  net: bridge: don't notify switchdev for local FDB addresses
  net/sched: act_ct: clear post_ct if doing ct_clear
  net: dsa: don't assign an error value to tag_ops
  isdn: capi: fix mismatched prototypes
  net/mlx5: SF, do not use ecpu bit for vhca state processing
  net/mlx5e: Fix division by 0 in mlx5e_select_queue
  net/mlx5e: Fix error path for ethtool set-priv-flag
  net/mlx5e: Offload tuple rewrite for non-CT flows
  net/mlx5e: Allow to match on MPLS parameters only for MPLS over UDP
  net/mlx5: Add back multicast stats for uplink representor
  net: ipconfig: ic_dev can be NULL in ic_close_devs
  MAINTAINERS: Combine "QLOGIC QLGE 10Gb ETHERNET DRIVER" sections into one
  docs: networking: Fix a typo
  r8169: fix DMA being used after buffer free if WoL is enabled
  net: ipa: fix init header command validation
  ...
2021-03-24 18:16:04 -07:00
Roger Pau Monne af44a387e7 Revert "xen: fix p2m size in dom0 for disabled memory hotplug case"
This partially reverts commit 882213990d ("xen: fix p2m size in dom0
for disabled memory hotplug case")

There's no need to special case XEN_UNPOPULATED_ALLOC anymore in order
to correctly size the p2m. The generic memory hotplug option has
already been tied together with the Xen hotplug limit, so enabling
memory hotplug should already trigger a properly sized p2m on Xen PV.

Note that XEN_UNPOPULATED_ALLOC depends on ZONE_DEVICE which pulls in
MEMORY_HOTPLUG.

Leave the check added to __set_phys_to_machine and the adjusted
comment about EXTRA_MEM_RATIO.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20210324122424.58685-3-roger.pau@citrix.com

[boris: fixed formatting issues]
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2021-03-24 18:33:36 -05:00
Roger Pau Monne 2b514ec727 xen/x86: make XEN_BALLOON_MEMORY_HOTPLUG_LIMIT depend on MEMORY_HOTPLUG
The Xen memory hotplug limit should depend on the memory hotplug
generic option, rather than the Xen balloon configuration. It's
possible to have a kernel with generic memory hotplug enabled, but
without Xen balloon enabled, at which point memory hotplug won't work
correctly due to the size limitation of the p2m.

Rename the option to XEN_MEMORY_HOTPLUG_LIMIT since it's no longer
tied to ballooning.

Fixes: 9e2369c06c ("xen: add helpers to allocate unpopulated memory")
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20210324122424.58685-2-roger.pau@citrix.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2021-03-24 18:33:11 -05:00
Sunil Muthuswamy 6dc2a774cb x86/Hyper-V: Support for free page reporting
Linux has support for free page reporting now (36e66c554b) for
virtualized environment. On Hyper-V when virtually backed VMs are
configured, Hyper-V will advertise cold memory discard capability,
when supported. This patch adds the support to hook into the free
page reporting infrastructure and leverage the Hyper-V cold memory
discard hint hypercall to report/free these pages back to the host.

Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Tested-by: Matheus Castello <matheus@castello.eng.br>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/SN4PR2101MB0880121FA4E2FEC67F35C1DCC0649@SN4PR2101MB0880.namprd21.prod.outlook.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-03-24 11:35:24 +00:00
Xu Yihang 1b60280834 x86/hyperv: Fix unused variable 'hi' warning in hv_apic_read
Fixes the following W=1 kernel build warning(s):
arch/x86/hyperv/hv_apic.c:58:15: warning: variable ‘hi’ set but not used [-Wunused-but-set-variable]

Compiled with CONFIG_HYPERV enabled:
make allmodconfig ARCH=x86_64 CROSS_COMPILE=x86_64-linux-gnu-
make W=1 arch/x86/hyperv/hv_apic.o ARCH=x86_64 CROSS_COMPILE=x86_64-linux-gnu-

HV_X64_MSR_EOI occupies bit 31:0 and HV_X64_MSR_TPR occupies bit 7:0,
which means the higher 32 bits are not really used. Cast the variable hi
to void to silence this warning.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Xu Yihang <xuyihang@huawei.com>
Link: https://lore.kernel.org/r/20210323025013.191533-1-xuyihang@huawei.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-03-24 11:32:56 +00:00
Xu Yihang 13c4d4626a x86/hyperv: Fix unused variable 'msr_val' warning in hv_qlock_wait
Fixes the following W=1 kernel build warning(s):
arch/x86/hyperv/hv_spinlock.c:28:16: warning: variable ‘msr_val’ set but not used [-Wunused-but-set-variable]
  unsigned long msr_val;

As Hypervisor Top-Level Functional Specification states in chapter 7.5
Virtual Processor Idle Sleep State, "A partition which possesses the
AccessGuestIdleMsr privilege (refer to section 4.2.2) may trigger entry
into the virtual processor idle sleep state through a read to the
hypervisor-defined MSR HV_X64_MSR_GUEST_IDLE".

That means only a read of the MSR is necessary. The returned value
msr_val is not used. Cast it to void to silence this warning.

Reference:
https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Xu Yihang <xuyihang@huawei.com>
Link: https://lore.kernel.org/r/20210323024302.174434-1-xuyihang@huawei.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2021-03-24 11:31:04 +00:00
Borislav Petkov 2ffdc2c344 x86/mce/inject: Add IPID for injection too
Add an injection file in order to specify the IPID too when injecting
an error. One use case example is using the machinery to decode MCEs
collected from other machines.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210314201806.12798-1-bp@alien8.de
2021-03-24 00:04:45 +01:00
Mike Rapoport 4c674481dc x86/setup: Merge several reservations of start of memory
Currently, the first several pages are reserved both to avoid leaking
their contents on systems with L1TF and to avoid corrupting BIOS memory.

Merge the two memory reservations.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210302100406.22059-3-rppt@kernel.org
2021-03-23 17:17:36 +01:00
Mike Rapoport a799c2bd29 x86/setup: Consolidate early memory reservations
The early reservations of memory areas used by the firmware, bootloader,
kernel text and data are spread over setup_arch(). Moreover, some of them
happen *after* memblock allocations, e.g trim_platform_memory_ranges() and
trim_low_memory_range() are called after reserve_real_mode() that allocates
memory.

There was no corruption of these memory regions because memblock always
allocates memory either from the end of memory (in top-down mode) or above
the kernel image (in bottom-up mode). However, the bottom up mode is going
to be updated to span the entire memory [1] to avoid limitations caused by
KASLR.

Consolidate early memory reservations in a dedicated function to improve
robustness against future changes. Having the early reservations in one
place also makes it clearer what memory must be reserved before memblock
allocations are allowed.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
Link: [1] https://lore.kernel.org/lkml/20201217201214.3414100-2-guro@fb.com
Link: https://lkml.kernel.org/r/20210302100406.22059-2-rppt@kernel.org
2021-03-23 17:13:17 +01:00
Arnd Bergmann 9fcb51c14d x86/build: Turn off -fcf-protection for realmode targets
The new Ubuntu GCC packages turn on -fcf-protection globally,
which causes a build failure in the x86 realmode code:

  cc1: error: ‘-fcf-protection’ is not compatible with this target

Turn it off explicitly on compilers that understand this option.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210323124846.1584944-1-arnd@kernel.org
2021-03-23 16:36:01 +01:00
Masami Hiramatsu 6256e668b7 x86/kprobes: Use int3 instead of debug trap for single-step
Use int3 instead of debug trap exception for single-stepping the
probed instructions. Some instructions which change the ip
registers or modify IF flags are emulated because those are not
able to be single-stepped by int3 or may allow the interrupt
while single-stepping.

This actually changes the kprobes behavior.

- kprobes can not probe following instructions; int3, iret,
  far jmp/call which get absolute address as immediate,
  indirect far jmp/call, indirect near jmp/call with addressing
  by memory (register-based indirect jmp/call are OK), and
  vmcall/vmlaunch/vmresume/vmxoff.

- If the kprobe post_handler doesn't set before registering,
  it may not be called in some case even if you set it afterwards.
  (IOW, kprobe booster is enabled at registration, user can not
   change it)

But both are rare issue, unsupported instructions will not be
used in the kernel (or rarely used), and post_handlers are
rarely used (I don't see it except for the test code).

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/161469874601.49483.11985325887166921076.stgit@devnote2
2021-03-23 16:07:56 +01:00
Masami Hiramatsu a194acd316 x86/kprobes: Identify far indirect JMP correctly
Since Grp5 far indirect JMP is FF "mod 101 r/m", it should be
(modrm & 0x38) == 0x28, and near indirect JMP is also 0x38 == 0x20.
So we can mask modrm with 0x30 and check 0x20.
This is actually what the original code does, it also doesn't care
the last bit. So the result code is same.

Thus, I think this is just a cosmetic cleanup.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/161469873475.49483.13257083019966335137.stgit@devnote2
2021-03-23 16:07:56 +01:00
Masami Hiramatsu d60ad3d46f x86/kprobes: Retrieve correct opcode for group instruction
Since the opcodes start from 0xff are group5 instruction group which is
not 2 bytes opcode but the extended opcode determined by the MOD/RM byte.

The commit abd82e533d ("x86/kprobes: Do not decode opcode in resume_execution()")
used insn->opcode.bytes[1], but that is not correct. We have to refer
the insn->modrm.bytes[1] instead.

Fixes: abd82e533d ("x86/kprobes: Do not decode opcode in resume_execution()")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/161469872400.49483.18214724458034233166.stgit@devnote2
2021-03-23 16:07:55 +01:00
Isaku Yamahata 8249d17d31 x86/mem_encrypt: Correct physical address calculation in __set_clr_pte_enc()
The pfn variable contains the page frame number as returned by the
pXX_pfn() functions, shifted to the right by PAGE_SHIFT to remove the
page bits. After page protection computations are done to it, it gets
shifted back to the physical address using page_level_shift().

That is wrong, of course, because that function determines the shift
length based on the level of the page in the page table but in all the
cases, it was shifted by PAGE_SHIFT before.

Therefore, shift it back using PAGE_SHIFT to get the correct physical
address.

 [ bp: Rewrite commit message. ]

Fixes: dfaaec9033 ("x86: Add support for changing memory encryption attribute in early boot")
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/81abbae1657053eccc535c16151f63cd049dcb97.1616098294.git.isaku.yamahata@intel.com
2021-03-23 11:59:45 +01:00
Arnd Bergmann e14cfb3bdd x86/boot/compressed: Avoid gcc-11 -Wstringop-overread warning
GCC gets confused by the comparison of a pointer to an integer literal,
with the assumption that this is an offset from a NULL pointer and that
dereferencing it is invalid:

  In file included from arch/x86/boot/compressed/misc.c:18:
  In function ‘parse_elf’,
      inlined from ‘extract_kernel’ at arch/x86/boot/compressed/misc.c:442:2:
  arch/x86/boot/compressed/../string.h:15:23: error: ‘__builtin_memcpy’ reading 64 bytes from a region of size 0 [-Werror=stringop-overread]
     15 | #define memcpy(d,s,l) __builtin_memcpy(d,s,l)
        |                       ^~~~~~~~~~~~~~~~~~~~~~~
  arch/x86/boot/compressed/misc.c:283:9: note: in expansion of macro ‘memcpy’
    283 |         memcpy(&ehdr, output, sizeof(ehdr));
        |         ^~~~~~

I could not find any good workaround for this, but as this is only
a warning for a failure during early boot, removing the line entirely
works around the warning.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Sebor <msebor@gmail.com>
Link: https://lore.kernel.org/r/20210322160253.4032422-2-arnd@kernel.org
2021-03-23 00:16:25 +01:00
Arnd Bergmann cdc34cb8f2 x86/boot/tboot: Avoid Wstringop-overread-warning
gcc-11 warns about using string operations on pointers that are
defined at compile time as offsets from a NULL pointer. Unfortunately
that also happens on the result of fix_to_virt(), which is a
compile-time constant for a constant input:

  arch/x86/kernel/tboot.c: In function 'tboot_probe':
  arch/x86/kernel/tboot.c:70:13: error: '__builtin_memcmp_eq' specified bound 16 exceeds source size 0 [-Werror=stringop-overread]
     70 |         if (memcmp(&tboot_uuid, &tboot->uuid, sizeof(tboot->uuid))) {
        |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I hope this can get addressed in gcc-11 before the release.

As a workaround, split up the tboot_probe() function in two halves
to separate the pointer generation from the usage. This is a bit
ugly, and hopefully gcc understands that the code is actually correct
before it learns to peek into the noinline function.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Sebor <msebor@gmail.com>
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99578
Link: https://lore.kernel.org/r/20210322160253.4032422-3-arnd@kernel.org
2021-03-23 00:16:13 +01:00
Arnd Bergmann 279d56abc6 x86/fpu/math-emu: Fix function cast warning
Building with 'make W=1', gcc points out that casting between
incompatible function types can be dangerous:

  arch/x86/math-emu/fpu_trig.c:1638:60: error: cast between incompatible function types from ‘int (*)(FPU_REG *, u_char)’ {aka ‘int (*)(struct fpu__reg *, unsigned char)’} to ‘void (*)(FPU_REG *, u_char)’ {aka ‘void (*)(struct fpu__reg *, unsigned char)’} [-Werror=cast-function-type]
   1638 |         fprem, fyl2xp1, fsqrt_, fsincos, frndint_, fscale, (FUNC_ST0) fsin, fcos
        |                                                            ^

This one seems harmless, but it is easy enough to work around it by
adding an intermediate function that adjusts the return type.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210322214824.974323-1-arnd@kernel.org
2021-03-23 00:08:02 +01:00
Otavio Pontes 7189b3c119 x86/microcode: Check for offline CPUs before requesting new microcode
Currently, the late microcode loading mechanism checks whether any CPUs
are offlined, and, in such a case, aborts the load attempt.

However, this must be done before the kernel caches new microcode from
the filesystem. Otherwise, when offlined CPUs are onlined later, those
cores are going to be updated through the CPU hotplug notifier callback
with the new microcode, while CPUs previously onine will continue to run
with the older microcode.

For example:

Turn off one core (2 threads):

  echo 0 > /sys/devices/system/cpu/cpu3/online
  echo 0 > /sys/devices/system/cpu/cpu1/online

Install the ucode fails because a primary SMT thread is offline:

  cp intel-ucode/06-8e-09 /lib/firmware/intel-ucode/
  echo 1 > /sys/devices/system/cpu/microcode/reload
  bash: echo: write error: Invalid argument

Turn the core back on

  echo 1 > /sys/devices/system/cpu/cpu3/online
  echo 1 > /sys/devices/system/cpu/cpu1/online
  cat /proc/cpuinfo |grep microcode
  microcode : 0x30
  microcode : 0xde
  microcode : 0x30
  microcode : 0xde

The rationale for why the update is aborted when at least one primary
thread is offline is because even if that thread is soft-offlined
and idle, it will still have to participate in broadcasted MCE's
synchronization dance or enter SMM, and in both examples it will execute
instructions so it better have the same microcode revision as the other
cores.

 [ bp: Heavily edit and extend commit message with the reasoning behind all
   this. ]

Fixes: 30ec26da99 ("x86/microcode: Do not upload microcode if CPUs are offline")
Signed-off-by: Otavio Pontes <otavio.pontes@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Ashok Raj <ashok.raj@intel.com>
Link: https://lkml.kernel.org/r/20210319165515.9240-2-otavio.pontes@intel.com
2021-03-22 22:29:40 +01:00
Arnd Bergmann 396a66aa11 x86/msr: Fix wr/rdmsr_safe_regs_on_cpu() prototypes
gcc-11 warns about mismatched prototypes here:

  arch/x86/lib/msr-smp.c:255:51: error: argument 2 of type ‘u32 *’ {aka ‘unsigned int *’} declared as a pointer [-Werror=array-parameter=]
    255 | int rdmsr_safe_regs_on_cpu(unsigned int cpu, u32 *regs)
        |                                              ~~~~~^~~~
  arch/x86/include/asm/msr.h:347:50: note: previously declared as an array ‘u32[8]’ {aka ‘unsigned int[8]’}

GCC is right here - fix up the types.

[ mingo: Twiddled the changelog. ]

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210322164541.912261-1-arnd@kernel.org
2021-03-22 21:37:03 +01:00
Ingo Molnar 163b099146 x86: Fix various typos in comments, take #2
Fix another ~42 single-word typos in arch/x86/ code comments,
missed a few in the first pass, in particular in .S files.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-kernel@vger.kernel.org
2021-03-21 23:50:28 +01:00
Ingo Molnar c681df88dc x86: Remove unusual Unicode characters from comments
We've accumulated a few unusual Unicode characters in arch/x86/
over the years, substitute them with their proper ASCII equivalents.

A few of them were a whitespace equivalent: ' ' - the use was harmless.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
2021-03-21 23:50:07 +01:00
Ingo Molnar ca8778c45e Merge branch 'linus' into x86/cleanups, to resolve conflict
Conflicts:
	arch/x86/kernel/kprobes/ftrace.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-03-21 22:16:08 +01:00
Linus Torvalds 1c74516c2d Boundary condition fixes for bugs unearthed by the perf fuzzer.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmBXJeQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iE/BAAsawZZH9GFsnwv7HraBl0jKvftp3/xPh6
 WL/RKGhGfu3f9MOrcM+dflggJEfnvz6/Tfm7/XKWHlIW3nHrQcn+lQtddoTwb2wp
 CmpAUYGtGWr7tr/B5vQcIg+yyYsVGtfyEmVro+TfzYCl/e21zATqEKtgSGclCcXg
 g0u5ZJsL8AOPSk2cR/ABrpI0MUlKHjUSJJ3V9j69OqSLhfc+GCn6ifTC1XK05MyR
 JX1kNaVTpVSGk650+oCUOP2rNaSk/G2wVZtp/LB9O1N0b9Zot2hQYbx1cEGFRNOy
 Q2FeMcw3V2t26Xk2q9AFGlOS0IeasO/NKK/urotRS2/rXdcr8QMUHTZdmr85UVQJ
 oohM+/DqoCAY5TeC4+d+tL5i+DLVGkrdbHX8IKkzYmejhE9DMQ5+a16O7ZcGoVv4
 oFG8RYHsUHPjEqPgC9vxS8Iy3n2yk34TIKQg/DJBdNhkQPnNup/zAInCEs6WqWN7
 OZulpWGK2yEV3mJpX2ayAMxym3hGAk/pBGAEcFI1DTXVBlGlOTvr6J0S3O54efTH
 +hrx+V+bYKHZPk3gK9mjN8rzC/u2pFbFZpf0cC2+G9XhFctmx8sIiWZ8kHrftp8b
 OaKXxo9lhfZhDuBj0zl3Yz59bFzS5VKPrHCGJ43pZRsxZsv8PXJnhmAsKT197fv+
 xQOrdN+L7Cc=
 =bCUA
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Ingo Molnar:
 "Boundary condition fixes for bugs unearthed by the perf fuzzer"

* tag 'perf-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel: Fix unchecked MSR access error caused by VLBR_EVENT
  perf/x86/intel: Fix a crash caused by zero PEBS status
2021-03-21 11:26:21 -07:00
Linus Torvalds 5e3ddf96e7 - Add the arch-specific mapping between physical and logical CPUs to fix
devicetree-node lookups.
 
 - Restore the IRQ2 ignore logic
 
 - Fix get_nr_restart_syscall() to return the correct restart syscall number.
 Split in a 4-patches set to avoid kABI breakage when backporting to dead
 kernels.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmBXJu0ACgkQEsHwGGHe
 VUrCkQ/9Et5W76HMQfHccluks2i2yNXgd7nROhIt0iMS1Ph86AWYJZmMZ2dbaqW8
 nORU20ziHme+9PScmcJb2LdJxIRDtYNs1J811IYeKNpvj8KHXtV2VYCVG9UcL21E
 FmUlZf5oINiDMzu3q4SuqHw9t7X6RCItolQIRmQHDXqPraFhBxji2VOFXDIg+qhf
 a4sBz6UfxA4a/b7d/KxHxNvuQE5Cluc9gninhtaYh1b7OQZJX4+vTa3W5V4kK0df
 ohOH5pnJp9V7qH2CmB3UcGWJTxHeLbm4E0KYkyasnKG9M0KmIvJ6jNARlRAo3hAF
 hn9D4xLtsnIWjtO6xEVdF7kSizkYZRPay5kX88quvlSa0FkkPnsUvFtW79Yi3ZNy
 vL2NAu2biqNQyo7ZWVffJns2DrJwYZ6KOGA6oUBwTUBfieF9KMdDew8IXRUMYNdO
 LzW87Irf9eZj9c+b7Rtr0VofmKgRYwy1Lo8eVT+VGkV+nOTOB9rlAll2lYBq3aNA
 W6ei0S5/1zaRF5aU6Qmnap4eb1X/tp845q6CPYa9kIsZwVyGFOa7iLeYcNn9qHdB
 G6RW6CUh97A7wwxUYt5VGUscjYV2V9Ycv9HvIwrG/T7aezWnhI9ODtggzDgCnbls
 og6N/+heLZ9G/DyxAEmHuazV2ItDPJq69gag/POHhXJaSUGbdbA=
 =WfC4
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:
 "The freshest pile of shiny x86 fixes for 5.12:

   - Add the arch-specific mapping between physical and logical CPUs to
     fix devicetree-node lookups

   - Restore the IRQ2 ignore logic

   - Fix get_nr_restart_syscall() to return the correct restart syscall
     number. Split in a 4-patches set to avoid kABI breakage when
     backporting to dead kernels"

* tag 'x86_urgent_for_v5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/apic/of: Fix CPU devicetree-node lookups
  x86/ioapic: Ignore IRQ2 again
  x86: Introduce restart_block->arch_data to remove TS_COMPAT_RESTART
  x86: Introduce TS_COMPAT_RESTART to fix get_nr_restart_syscall()
  x86: Move TS_COMPAT back to asm/thread_info.h
  kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data()
2021-03-21 11:04:20 -07:00
Linus Torvalds 812da4d394 RISC-V Fixes for 5.12-rc4
I have handful of fixes for 5.12:
 
 * A fix to the SBI remote fence numbers for hypervisor fences, which had
   been transcribed in the wrong order in Linux.  These fences are only
   used with the KVM patches applied.
 * A whole host of build warnings have been fixed, these should have no
   functional change.
 * A fix to init_resources() that prevents an off-by-one error from
   causing an out-of-bounds array reference.  This is manifesting during
   boot on vexriscv.
 * A fix to ensure the KASAN mappings are visible before proceeding to
   use them.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmBVgV4THHBhbG1lckBk
 YWJiZWx0LmNvbQAKCRAuExnzX7sYiTOWD/4l+uRCwTelZqm/G0yKSSAevAv5Crsc
 Nzsa1uq7dOC+JLZ5y96SUng825WdGX+HiIf7QyUFPzpnqyYc4+ROwNb80ObPWQZU
 dctatP2g9Jk2ImmJbGQVeDXKAiqrMM3hf1bOF3N3VV9DpqID0z/S8l8H9mz7x9yl
 opd6kXxCPFKLgmAbMxcsytUduxZrJEcCpy3jPpIvjJ3BrzaGZlgjytqc2tYvbv/L
 9i//evmGTCNXfQPrWEcMpBPbMf+aSzb/9Im8THB42jpJVQ7kx3txVg6d+wb73oGf
 XHkm5mwrESAcnVGfxY5xRaaSK/L2k5Lg98J1K/BIHIKskjCTg5FdyrgeGwdtLg6T
 FuXEvK29FJgfMb7k2Mf25l/Lglzi4q4LxBO4wcAUb1OpaVeK2kgYJr1eniSKrE/v
 NF5/bD9h7sD1qbZLfk+lsTggBGfMBmthwp59jNb7V4cLkIFXwopgx2h/73jm6kn8
 8fMCTlwOoktewbv0DdWCy0Sfaa0iCXMSJy+Y13GWlcEMvQn1VLtX7RbQzZq9X+tV
 C/qkp1SdXfPG3vJbkNnZh/eS12F6vDauYJ814s3VAeJKOoMJWABB6Jm2SoBwFM6v
 kpIRNzDyJ1oKhF4PxIrmGkv6PvRM/j5akspOwy/zdHB3FBVCGmyuoB9GE8Bg1Rw7
 xyfdZthPDdvGyQ==
 =XhDE
 -----END PGP SIGNATURE-----

Merge tag 'riscv-for-linus-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V fixes from Palmer Dabbelt:
 "A handful of fixes for 5.12:

   - fix the SBI remote fence numbers for hypervisor fences, which had
     been transcribed in the wrong order in Linux. These fences are only
     used with the KVM patches applied.

   - fix a whole host of build warnings, these should have no functional
     change.

   - fix init_resources() to prevent an off-by-one error from causing an
     out-of-bounds array reference. This was manifesting during boot on
     vexriscv.

   - ensure the KASAN mappings are visible before proceeding to use
     them"

* tag 'riscv-for-linus-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: Correct SPARSEMEM configuration
  RISC-V: kasan: Declare kasan_shallow_populate() static
  riscv: Ensure page table writes are flushed when initializing KASAN vmalloc
  RISC-V: Fix out-of-bounds accesses in init_resources()
  riscv: Fix compilation error with Canaan SoC
  ftrace: Fix spelling mistake "disabed" -> "disabled"
  riscv: fix bugon.cocci warnings
  riscv: process: Fix no prototype for arch_dup_task_struct
  riscv: ftrace: Use ftrace_get_regs helper
  riscv: process: Fix no prototype for show_regs
  riscv: syscall_table: Reduce W=1 compilation warnings noise
  riscv: time: Fix no prototype for time_init
  riscv: ptrace: Fix no prototype warnings
  riscv: sbi: Fix comment of __sbi_set_timer_v01
  riscv: irq: Fix no prototype warning
  riscv: traps: Fix no prototype warnings
  RISC-V: correct enum sbi_ext_rfence_fid
2021-03-20 11:01:54 -07:00
Tony Luck a331f5fdd3 x86/mce: Add Xeon Sapphire Rapids to list of CPUs that support PPIN
New CPU model, same MSRs to control and read the inventory number.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210319173919.291428-1-tony.luck@intel.com
2021-03-20 12:12:10 +01:00
Stanislav Fomichev b908297047 bpf: Use NOP_ATOMIC5 instead of emit_nops(&prog, 5) for BPF_TRAMP_F_CALL_ORIG
__bpf_arch_text_poke does rewrite only for atomic nop5, emit_nops(xxx, 5)
emits non-atomic one which breaks fentry/fexit with k8 atomics:

P6_NOP5 == P6_NOP5_ATOMIC (0f1f440000 == 0f1f440000)
K8_NOP5 != K8_NOP5_ATOMIC (6666906690 != 6666666690)

Can be reproduced by doing "ideal_nops = k8_nops" in "arch_init_ideal_nops()
and running fexit_bpf2bpf selftest.

Fixes: e21aa34178 ("bpf: Fix fexit trampoline.")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210320000001.915366-1-sdf@google.com
2021-03-19 19:25:39 -07:00
Johan Hovold dd926880da x86/apic/of: Fix CPU devicetree-node lookups
Architectures that describe the CPU topology in devicetree and do not have
an identity mapping between physical and logical CPU ids must override the
default implementation of arch_match_cpu_phys_id().

Failing to do so breaks CPU devicetree-node lookups using of_get_cpu_node()
and of_cpu_device_node_get() which several drivers rely on. It also causes
the CPU struct devices exported through sysfs to point to the wrong
devicetree nodes.

On x86, CPUs are described in devicetree using their APIC ids and those
do not generally coincide with the logical ids, even if CPU0 typically
uses APIC id 0.

Add the missing implementation of arch_match_cpu_phys_id() so that CPU-node
lookups work also with SMP.

Apart from fixing the broken sysfs devicetree-node links this likely does
not affect current users of mainline kernels on x86.

Fixes: 4e07db9c8d ("x86/devicetree: Use CPU description from Device Tree")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210312092033.26317-1-johan@kernel.org
2021-03-19 23:01:49 +01:00
Linus Torvalds ecd8ee7f9c x86:
* new selftests
 * fixes for migration with HyperV re-enlightenment enabled
 * fix RCU/SRCU usage
 * fixes for local_irq_restore misuse false positive
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmBUpO8UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPj6Af+LSkDniR08Eh/x4GHdX+ZSA9EhNuP
 PMqL+nDYvLXqc0XaErbZQpQbSP4aK7Tjly0LguZmNkBk17pnbjLb5Vv9hqJ30pM/
 pI8bGgdh+KDO9LClfrgsaYgC+B4R+fwqqTIvtBYMilVZ96JwixFiODB4ntRQmZgd
 xJS99jwjD8TO9pTYskKPf8y8yv5W9RH+wVQGXwc+T/sSzK/rcL4Jwt/ibO2FLcJK
 gBRXJDVjMIlpxPrqqoejVB2FHQQe36Bns85QU3dz0QuXfDuuEvbShY/f4R1z32fT
 RaccrvdMQtvgwS0l9Ij06PT0BdiG0EdZv/gOBUq5gVgx4XZyJTleJaVURw==
 =WZP4
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "Fixes for kvm on x86:

   - new selftests

   - fixes for migration with HyperV re-enlightenment enabled

   - fix RCU/SRCU usage

   - fixes for local_irq_restore misuse false positive"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  documentation/kvm: additional explanations on KVM_SET_BOOT_CPU_ID
  x86/kvm: Fix broken irq restoration in kvm_wait
  KVM: X86: Fix missing local pCPU when executing wbinvd on all dirty pCPUs
  KVM: x86: Protect userspace MSR filter with SRCU, and set atomically-ish
  selftests: kvm: add set_boot_cpu_id test
  selftests: kvm: add _vm_ioctl
  selftests: kvm: add get_msr_index_features
  selftests: kvm: Add basic Hyper-V clocksources tests
  KVM: x86: hyper-v: Don't touch TSC page values when guest opted for re-enlightenment
  KVM: x86: hyper-v: Track Hyper-V TSC page status
  KVM: x86: hyper-v: Prevent using not-yet-updated TSC page by secondary CPUs
  KVM: x86: hyper-v: Limit guest to writing zero to HV_X64_MSR_TSC_EMULATION_STATUS
  KVM: x86/mmu: Store the address space ID in the TDP iterator
  KVM: x86/mmu: Factor out tdp_iter_return_to_root
  KVM: x86/mmu: Fix RCU usage when atomically zapping SPTEs
  KVM: x86/mmu: Fix RCU usage in handle_removed_tdp_mmu_page
2021-03-19 14:10:07 -07:00
Jarkko Sakkinen 901ddbb9ec x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()
Background
==========

SGX enclave memory is enumerated by the processor in contiguous physical
ranges called Enclave Page Cache (EPC) sections.  Currently, there is a
free list per section, but allocations simply target the lowest-numbered
sections.  This is functional, but has no NUMA awareness.

Fortunately, EPC sections are covered by entries in the ACPI SRAT table.
These entries allow each EPC section to be associated with a NUMA node,
just like normal RAM.

Solution
========

Implement a NUMA-aware enclave page allocator.  Mirror the buddy allocator
and maintain a list of enclave pages for each NUMA node.  Attempt to
allocate enclave memory first from local nodes, then fall back to other
nodes.

Note that the fallback is not as sophisticated as the buddy allocator
and is itself not aware of NUMA distances.  When a node's free list is
empty, it searches for the next-highest node with enclave pages (and
will wrap if necessary).  This could be improved in the future.

Other
=====

NUMA_KEEP_MEMINFO dependency is required for phys_to_target_node().

 [ Kai Huang: Do not return NULL from __sgx_alloc_epc_page() because
   callers do not expect that and that leads to a NULL ptr deref. ]

 [ dhansen: Fix an uninitialized 'nid' variable in
   __sgx_alloc_epc_page() as

   Reported-by: kernel test robot <lkp@intel.com>

   to avoid any potential allocations from the wrong NUMA node or even
   premature allocation failures. ]

Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/lkml/158188326978.894464.217282995221175417.stgit@dwillia2-desk3.amr.corp.intel.com/
Link: https://lkml.kernel.org/r/20210319040602.178558-1-kai.huang@intel.com
Link: https://lkml.kernel.org/r/20210318214933.29341-1-dave.hansen@intel.com
Link: https://lkml.kernel.org/r/20210317235332.362001-2-jarkko.sakkinen@intel.com
2021-03-19 19:16:51 +01:00
Joerg Roedel 799de1baaf x86/sev-es: Optimize __sev_es_ist_enter() for better readability
Reorganize the code and improve the comments to make the function more
readable and easier to understand.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210303141716.29223-4-joro@8bytes.org
2021-03-19 13:37:22 +01:00
Jiapeng Chong 21d6a7dcbf x86/kaslr: Return boolean values from a function returning bool
Fix the following coccicheck warnings:

  ./arch/x86/boot/compressed/kaslr.c:642:10-11: WARNING: return of 0/1 in
  function 'process_mem_region' with return type bool.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1615283963-67277-1-git-send-email-jiapeng.chong@linux.alibaba.com
2021-03-19 13:25:07 +01:00
Thomas Gleixner a501b048a9 x86/ioapic: Ignore IRQ2 again
Vitaly ran into an issue with hotplugging CPU0 on an Amazon instance where
the matrix allocator claimed to be out of vectors. He analyzed it down to
the point that IRQ2, the PIC cascade interrupt, which is supposed to be not
ever routed to the IO/APIC ended up having an interrupt vector assigned
which got moved during unplug of CPU0.

The underlying issue is that IRQ2 for various reasons (see commit
af174783b9 ("x86: I/O APIC: Never configure IRQ2" for details) is treated
as a reserved system vector by the vector core code and is not accounted as
a regular vector. The Amazon BIOS has an routing entry of pin2 to IRQ2
which causes the IO/APIC setup to claim that interrupt which is granted by
the vector domain because there is no sanity check. As a consequence the
allocation counter of CPU0 underflows which causes a subsequent unplug to
fail with:

  [ ... ] CPU 0 has 4294967295 vectors, 589 available. Cannot disable CPU

There is another sanity check missing in the matrix allocator, but the
underlying root cause is that the IO/APIC code lost the IRQ2 ignore logic
during the conversion to irqdomains.

For almost 6 years nobody complained about this wreckage, which might
indicate that this requirement could be lifted, but for any system which
actually has a PIC IRQ2 is unusable by design so any routing entry has no
effect and the interrupt cannot be connected to a device anyway.

Due to that and due to history biased paranoia reasons restore the IRQ2
ignore logic and treat it as non existent despite a routing entry claiming
otherwise.

Fixes: d32932d02e ("x86/irq: Convert IOAPIC to use hierarchical irqdomain interfaces")
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210318192819.636943062@linutronix.de
2021-03-19 12:43:41 +01:00
Ingo Molnar 01438749e3 Merge branch 'locking/urgent' into locking/core, to pick up dependent commits
We are applying further, lower-prio fixes on top of two ww_mutex fixes in locking/urgent.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-03-19 12:10:49 +01:00
Joerg Roedel f15a0a732a x86/sev-es: Replace open-coded hlt-loops with sev_es_terminate()
There are a few places left in the SEV-ES C code where hlt loops and/or
terminate requests are implemented. Replace them all with calls to
sev_es_terminate().

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210312123824.306-9-joro@8bytes.org
2021-03-18 23:04:12 +01:00
Joerg Roedel fef81c8626 x86/boot/compressed/64: Check SEV encryption in the 32-bit boot-path
Check whether the hypervisor reported the correct C-bit when running
as an SEV guest. Using a wrong C-bit position could be used to leak
sensitive data from the guest to the hypervisor.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210312123824.306-8-joro@8bytes.org
2021-03-18 23:04:12 +01:00
Joerg Roedel e927e62d8e x86/boot/compressed/64: Add CPUID sanity check to 32-bit boot-path
The 32-bit #VC handler has no GHCB and can only handle CPUID exit codes.
It is needed by the early boot code to handle #VC exceptions raised in
verify_cpu() and to get the position of the C-bit.

But the CPUID information comes from the hypervisor which is untrusted
and might return results which trick the guest into the no-SEV boot path
with no C-bit set in the page-tables. All data written to memory would
then be unencrypted and could leak sensitive data to the hypervisor.

Add sanity checks to the 32-bit boot #VC handler to make sure the
hypervisor does not pretend that SEV is not enabled.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210312123824.306-7-joro@8bytes.org
2021-03-18 23:04:12 +01:00
Joerg Roedel 1ccdbf748d x86/boot/compressed/64: Add 32-bit boot #VC handler
Add a #VC exception handler which is used when the kernel still executes
in protected mode. This boot-path already uses CPUID, which will cause #VC
exceptions in an SEV-ES guest.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210312123824.306-6-joro@8bytes.org
2021-03-18 23:03:43 +01:00
Wanpeng Li f4e61f0c9a x86/kvm: Fix broken irq restoration in kvm_wait
After commit 997acaf6b4 (lockdep: report broken irq restoration), the guest
splatting below during boot:

 raw_local_irq_restore() called with IRQs enabled
 WARNING: CPU: 1 PID: 169 at kernel/locking/irqflag-debug.c:10 warn_bogus_irq_restore+0x26/0x30
 Modules linked in: hid_generic usbhid hid
 CPU: 1 PID: 169 Comm: systemd-udevd Not tainted 5.11.0+ #25
 RIP: 0010:warn_bogus_irq_restore+0x26/0x30
 Call Trace:
  kvm_wait+0x76/0x90
  __pv_queued_spin_lock_slowpath+0x285/0x2e0
  do_raw_spin_lock+0xc9/0xd0
  _raw_spin_lock+0x59/0x70
  lockref_get_not_dead+0xf/0x50
  __legitimize_path+0x31/0x60
  legitimize_root+0x37/0x50
  try_to_unlazy_next+0x7f/0x1d0
  lookup_fast+0xb0/0x170
  path_openat+0x165/0x9b0
  do_filp_open+0x99/0x110
  do_sys_openat2+0x1f1/0x2e0
  do_sys_open+0x5c/0x80
  __x64_sys_open+0x21/0x30
  do_syscall_64+0x32/0x50
  entry_SYSCALL_64_after_hwframe+0x44/0xae

The new consistency checking,  expects local_irq_save() and
local_irq_restore() to be paired and sanely nested, and therefore expects
local_irq_restore() to be called with irqs disabled.
The irqflags handling in kvm_wait() which ends up doing:

	local_irq_save(flags);
	safe_halt();
	local_irq_restore(flags);

instead triggers it.  This patch fixes it by using
local_irq_disable()/enable() directly.

Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1615791328-2735-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-18 13:58:14 -04:00
Wanpeng Li c2162e13d6 KVM: X86: Fix missing local pCPU when executing wbinvd on all dirty pCPUs
In order to deal with noncoherent DMA, we should execute wbinvd on
all dirty pCPUs when guest wbinvd exits to maintain data consistency.
smp_call_function_many() does not execute the provided function on the
local core, therefore replace it by on_each_cpu_mask().

Reported-by: Nadav Amit <namit@vmware.com>
Cc: Nadav Amit <namit@vmware.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1615517151-7465-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-18 13:55:34 -04:00
Sean Christopherson b318e8decf KVM: x86: Protect userspace MSR filter with SRCU, and set atomically-ish
Fix a plethora of issues with MSR filtering by installing the resulting
filter as an atomic bundle instead of updating the live filter one range
at a time.  The KVM_X86_SET_MSR_FILTER ioctl() isn't truly atomic, as
the hardware MSR bitmaps won't be updated until the next VM-Enter, but
the relevant software struct is atomically updated, which is what KVM
really needs.

Similar to the approach used for modifying memslots, make arch.msr_filter
a SRCU-protected pointer, do all the work configuring the new filter
outside of kvm->lock, and then acquire kvm->lock only when the new filter
has been vetted and created.  That way vCPU readers either see the old
filter or the new filter in their entirety, not some half-baked state.

Yuan Yao pointed out a use-after-free in ksm_msr_allowed() due to a
TOCTOU bug, but that's just the tip of the iceberg...

  - Nothing is __rcu annotated, making it nigh impossible to audit the
    code for correctness.
  - kvm_add_msr_filter() has an unpaired smp_wmb().  Violation of kernel
    coding style aside, the lack of a smb_rmb() anywhere casts all code
    into doubt.
  - kvm_clear_msr_filter() has a double free TOCTOU bug, as it grabs
    count before taking the lock.
  - kvm_clear_msr_filter() also has memory leak due to the same TOCTOU bug.

The entire approach of updating the live filter is also flawed.  While
installing a new filter is inherently racy if vCPUs are running, fixing
the above issues also makes it trivial to ensure certain behavior is
deterministic, e.g. KVM can provide deterministic behavior for MSRs with
identical settings in the old and new filters.  An atomic update of the
filter also prevents KVM from getting into a half-baked state, e.g. if
installing a filter fails, the existing approach would leave the filter
in a half-baked state, having already committed whatever bits of the
filter were already processed.

[*] https://lkml.kernel.org/r/20210312083157.25403-1-yaoyuan0329os@gmail.com

Fixes: 1a155254ff ("KVM: x86: Introduce MSR filtering")
Cc: stable@vger.kernel.org
Cc: Alexander Graf <graf@amazon.com>
Reported-by: Yuan Yao <yaoyuan0329os@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210316184436.2544875-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-18 13:55:14 -04:00
Joerg Roedel 79419e13e8 x86/boot/compressed/64: Setup IDT in startup_32 boot path
This boot path needs exception handling when it is used with SEV-ES.
Setup an IDT and provide a helper function to write IDT entries for
use in 32-bit protected mode.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210312123824.306-5-joro@8bytes.org
2021-03-18 16:44:46 +01:00
Joerg Roedel 0c289ff81c x86/boot/compressed/64: Reload CS in startup_32
Exception handling in the startup_32 boot path requires the CS
selector to be correctly set up. Reload it from the current GDT.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210312123824.306-4-joro@8bytes.org
2021-03-18 16:44:43 +01:00
Joerg Roedel eab696d8e8 x86/sev: Do not require Hypervisor CPUID bit for SEV guests
A malicious hypervisor could disable the CPUID intercept for an SEV or
SEV-ES guest and trick it into the no-SEV boot path, where it could
potentially reveal secrets. This is not an issue for SEV-SNP guests,
as the CPUID intercept can't be disabled for those.

Remove the Hypervisor CPUID bit check from the SEV detection code to
protect against this kind of attack and add a Hypervisor bit equals zero
check to the SME detection path to prevent non-encrypted guests from
trying to enable SME.

This handles the following cases:

	1) SEV(-ES) guest where CPUID intercept is disabled. The guest
	   will still see leaf 0x8000001f and the SEV bit. It can
	   retrieve the C-bit and boot normally.

	2) Non-encrypted guests with intercepted CPUID will check
	   the SEV_STATUS MSR and find it 0 and will try to enable SME.
	   This will fail when the guest finds MSR_K8_SYSCFG to be zero,
	   as it is emulated by KVM. But we can't rely on that, as there
	   might be other hypervisors which return this MSR with bit
	   23 set. The Hypervisor bit check will prevent that the guest
	   tries to enable SME in this case.

	3) Non-encrypted guests on SEV capable hosts with CPUID intercept
	   disabled (by a malicious hypervisor) will try to boot into
	   the SME path. This will fail, but it is also not considered
	   a problem because non-encrypted guests have no protection
	   against the hypervisor anyway.

 [ bp: s/non-SEV/non-encrypted/g ]

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20210312123824.306-3-joro@8bytes.org
2021-03-18 16:44:40 +01:00