commit d4dc4bf282 upstream.
Fixes off-by-one bugs in the macro assignments for the crashlog control
bits. Was initially tested on emulation but bug revealed after testing on
silicon.
Fixes: 5ef9998c96 ("platform/x86: Intel PMT Crashlog capability driver")
Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Link: https://lore.kernel.org/r/20210317024455.3071477-2-david.e.box@linux.intel.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 501bb68a66 upstream.
Initialize the struct resource in intel_pmt_dev_register to zero to avoid a
fault should the char *name field be non-zero.
Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Link: https://lore.kernel.org/r/20210317024455.3071477-1-david.e.box@linux.intel.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit aa47ad3f85 upstream.
Adds PMT Telemetry aggregator support for the DG1 graphics PCIe card. The
device does not have the DVSEC region in its PCI config space so hard
code the discovery table data in the driver. Also requires a fix for DG1
in the Telemetry driver for how the ACCESS_TYPE field is used.
Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit a1a5c1c3df upstream.
Some products will be available that have PMT capabilities that are not
supported. Remove the warnings in this instance to avoid nuisance messages
and confusion.
Also return an error code for capabilities that are disabled by quirk to
prevent them from keeping the driver loaded if only disabled capabilities
are found.
Fixes: 4f8217d5b0 ("mfd: Intel Platform Monitoring Technology support")
Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit b85a0425d8 upstream.
Add AVX version of the Vector Neural Network (VNNI) Instructions.
A processor supports AVX VNNI instructions if CPUID.0x07.0x1:EAX[4] is
present. The following instructions are available when this feature is
present.
1. VPDPBUS: Multiply and Add Unsigned and Signed Bytes
2. VPDPBUSDS: Multiply and Add Unsigned and Signed Bytes with Saturation
3. VPDPWSSD: Multiply and Add Signed Word Integers
4. VPDPWSSDS: Multiply and Add Signed Integers with Saturation
The only in-kernel usage of this is kvm passthrough. The CPU feature
flag is shown as "avx_vnni" in /proc/cpuinfo.
This instruction is currently documented in the latest "extensions"
manual (ISE). It will appear in the "main" manual (SDM) in the future.
Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Message-Id: <20210105004909.42000-2-yang.zhong@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit fdd3feb37e upstream.
All devices that expose Intel Platform Monitoring Technology (PMT)
crashlog are currently owned by the intel_pmt MFD driver. Therefore make
the crashlog driver depend on the MFD driver for build.
Fixes: 5ef9998c96 ("platform/x86: Intel PMT Crashlog capability driver")
Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Link: https://lore.kernel.org/r/20210126205508.30907-3-david.e.box@linux.intel.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit f3f6da5014 upstream.
All devices that expose Intel Platform Monitoring Technology (PMT)
telemetry are currently owned by the intel_pmt MFD driver. Therefore make
the telemetry driver depend on the MFD driver for build.
Fixes: 68fe8e6e2c ("platform/x86: Intel PMT Telemetry capability driver")
Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Link: https://lore.kernel.org/r/20210126205508.30907-2-david.e.box@linux.intel.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 35d8a973fe upstream.
Fix error in Kconfig that exposed INTEL_PMT_CLASS as a user selectable
option. It is already selected by INTEL_PMT_TELEMETRY and
INTEL_PMT_CRASHLOG which are user selectable.
Fixes: e2729113ce ("platform/x86: Intel PMT class driver")
Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Link: https://lore.kernel.org/r/20210126205508.30907-1-david.e.box@linux.intel.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit e1b35da5e6 upstream.
Enumerate AVX512 Half-precision floating point (FP16) CPUID feature
flag. Compared with using FP32, using FP16 cut the number of bits
required for storage in half, reducing the exponent from 8 bits to 5,
and the mantissa from 23 bits to 10. Using FP16 also enables developers
to train and run inference on deep learning models fast when all
precision or magnitude (FP32) is not needed.
A processor supports AVX512 FP16 if CPUID.(EAX=7,ECX=0):EDX[bit 23]
is present. The AVX512 FP16 requires AVX512BW feature be implemented
since the instructions for manipulating 32bit masks are associated with
AVX512BW.
The only in-kernel usage of this is kvm passthrough. The CPU feature
flag is shown as "avx512_fp16" in /proc/cpuinfo.
Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Message-Id: <20201208033441.28207-2-kyung.min.park@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 75b6f6487c upstream.
Intel NTB device has custom LTR management that is not compliant with the
PCIe standard. Add support to set LTR status triggered by link status
change.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit d3d73d25e0 upstream.
The "ns->attr_grp" pointer can be NULL so this error handling code needs
to check for that to avoid an Oops.
Fixes: e2729113ce ("platform/x86: Intel PMT class driver")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David E. Box <david.e.box@linux.intel.com>
Link: https://lore.kernel.org/r/20201117072251.GC1111239@mwanda
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 5ef9998c96 upstream.
Add support for the Intel Platform Monitoring Technology crashlog
interface. This interface provides a few sysfs values to allow for
controlling the crashlog telemetry interface as well as a character
driver to allow for mapping the crashlog memory region so that it can be
accessed after a crashlog has been recorded.
This driver is meant to only support the server version of the crashlog
which is identified as crash_type 1 with a version of zero. Currently no
other types are supported.
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 68fe8e6e2c upstream.
PMT Telemetry is a capability of the Intel Platform Monitoring Technology.
The Telemetry capability provides access to device telemetry metrics that
provide hardware performance data to users from read-only register spaces.
With this driver present the intel_pmt directory can be populated with
telem<x> devices. These devices will contain the standard intel_pmt sysfs
data and a "telem" binary sysfs attribute which can be used to access the
telemetry data.
Also create a PCI device id list for early telemetry hardware that require
workarounds for known issues.
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Co-developed-by: David E. Box <david.e.box@linux.intel.com>
Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit e2729113ce upstream.
Intel Platform Monitoring Technology is meant to provide a common way to
access telemetry and system metrics.
Register mappings are not provided by the driver. Instead, a GUID is read
from a header for each endpoint. The GUID identifies the device and is to
be used with an XML, provided by the vendor, to discover the available set
of metrics and their register mapping. This allows firmware updates to
modify the register space without needing to update the driver every time
with new mappings. Firmware writes a new GUID in this case to specify the
new mapping. Software tools with access to the associated XML file can
then interpret the changes.
The module manages access to all Intel PMT endpoints on a system,
independent of the device exporting them. It creates an intel_pmt class
to manage the devices. For each telemetry endpoint, sysfs files provide
GUID and size information as well as a pointer to the parent device the
telemetry came from. Software may discover the association between
endpoints and devices by iterating through the list in sysfs, or by looking
for the existence of the class folder under the device of interest. A
binary sysfs attribute of the same name allows software to then read or map
the telemetry space for direct access.
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 4f8217d5b0 upstream.
Intel Platform Monitoring Technology (PMT) is an architecture for
enumerating and accessing hardware monitoring facilities. PMT supports
multiple types of monitoring capabilities. This driver creates platform
devices for each type so that they may be managed by capability specific
drivers (to be introduced). Capabilities are discovered using PCIe DVSEC
ids. Support is included for the 3 current capability types, Telemetry,
Watcher, and Crashlog. The features are available on new Intel platforms
starting from Tiger Lake for which support is added. This patch adds
support for Tiger Lake (TGL), Alder Lake (ADL), and Out-of-Band Management
Services Module (OOBMSM).
Also add a quirk mechanism for several early hardware differences and bugs.
For Tiger Lake and Alder Lake, do not support Watcher and Crashlog
capabilities since they will not be compatible with future product. Also,
fix use a quirk to fix the discovery table offset.
Co-developed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 1dc2da5cd5 upstream.
Add PCIe Designated Vendor-Specific Extended Capability (DVSEC) and defines
for the header offsets. Defined in PCIe r5.0, sec 7.9.6.
Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit e7af1ed3fa upstream.
Initial support for models recently added to intel-family.h.
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 23274faf96 upstream.
From a turbostat point of view, Ice Lake server looks like Sky Lake server.
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 4bf7132a0a upstream.
From a turbostat point of view, Tiger Lake looks like Ice Lake.
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 18ec63faef upstream.
Intel TSX suspend load tracking instructions aim to give a way to choose
which memory accesses do not need to be tracked in the TSX read set. Add
TSX suspend load tracking CPUID feature flag TSXLDTRK for enumeration.
A processor supports Intel TSX suspend load address tracking if
CPUID.0x07.0x0:EDX[16] is present. Two instructions XSUSLDTRK, XRESLDTRK
are available when this feature is present.
The CPU feature flag is shown as "tsxldtrk" in /proc/cpuinfo.
Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>
Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/1598316478-23337-2-git-send-email-cathy.zhang@intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit bcfd218b66 upstream.
Intel SPR platform uses fixed 16 bit energy unit for DRAM RAPL domain,
and fixed 0 bit energy unit for Psys RAPL domain.
After this, on SPR platform the energy counters appear in perf list.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Acked-by: Len Brown <len.brown@intel.com>
Link: https://lore.kernel.org/r/20200811153149.12242-4-rui.zhang@intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit be25d1b5ea upstream.
Latest edition (039) of "Intel Architecture Instruction Set Extensions
and Future Features Programming Reference" includes three new CPU model
numbers. Linux already has the two Ice Lake server ones. Add the new
model number for Sapphire Rapids.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200603173352.15506-1-tony.luck@intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 74f41adab0 upstream.
There will be more platforms with different fixed energy units.
Enhance the code to support different RAPL unit quirks for different
platforms.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Link: https://lore.kernel.org/r/20200811153149.12242-3-rui.zhang@intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 4c953f8794 upstream.
This patch modifies perf_probe_msr() by allowing passing of
struct perf_msr array where some entries are not populated, i.e.,
they have either an msr address of 0 or no attribute_group pointer.
This helps with certain call paths, e.g., RAPL.
In case the grp is NULL, the default sysfs visibility rule
applies which is to make the group visible. Without the patch,
you would get a kernel crash with a NULL group.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200527224659.206129-5-eranian@google.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 2a3e3f73a2 upstream.
This patch modifies the default visibility of the attribute_group
for each RAPL event. By default if the grp.is_visible field is NULL,
sysfs considers that it must display the attribute group.
If the field is not NULL (callback function), then the return value
of the callback determines the visibility (0 = not visible). The RAPL
attribute groups had the field set to NULL, meaning that unless they
failed the probing from perf_msr_probe(), they would be visible. We want
to avoid having to specify attribute groups that are not supported by the HW
in the rapl_msrs[] array, they don't have an MSR address to begin with.
Therefore, we intialize the visible field of all RAPL attribute groups
to a callback that returns 0. If the RAPL msr goes through probing
and succeeds the is_visible field will be set back to NULL (visible).
If the probing fails the field is set to a callback that return 0 (not visible).
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200527224659.206129-4-eranian@google.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 5c95c68949 upstream.
This patch modifies the rapl_model struct to include architecture specific
knowledge in this previously Intel specific structure, and in particular
it adds the MSR for POWER_UNIT and the rapl_msrs array.
No functional changes.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200527224659.206129-3-eranian@google.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit f69ca629d8 upstream.
Instead of having #ifdef/#endif blocks inside sync_core() for X86_64 and
X86_32, implement the new function iret_to_self() with two versions.
In this manner, avoid having to use even more more #ifdef/#endif blocks
when adding support for SERIALIZE in sync_core().
Co-developed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200727043132.15082-4-ricardo.neri-calderon@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 9998a9832c upstream.
Having sync_core() in processor.h is problematic since it is not possible
to check for hardware capabilities via the *cpu_has() family of macros.
The latter needs the definitions in processor.h.
It also looks more intuitive to relocate the function to sync_core.h.
This changeset does not make changes in functionality.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200727043132.15082-3-ricardo.neri-calderon@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 85b23fbc7d upstream.
The Intel architecture defines a set of Serializing Instructions (a
detailed definition can be found in Vol.3 Section 8.3 of the Intel "main"
manual, SDM). However, these instructions do more than what is required,
have side effects and/or may be rather invasive. Furthermore, some of
these instructions are only available in kernel mode or may cause VMExits.
Thus, software using these instructions only to serialize execution (as
defined in the manual) must handle the undesired side effects.
As indicated in the name, SERIALIZE is a new Intel architecture
Serializing Instruction. Crucially, it does not have any of the mentioned
side effects. Also, it does not cause VMExit and can be used in user mode.
This new instruction is currently documented in the latest "extensions"
manual (ISE). It will appear in the "main" manual in the future.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20200727043132.15082-2-ricardo.neri-calderon@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit c085fb8774 upstream.
Reading LBR registers in a perf NMI handler for a non-PEBS event
causes a high overhead because the number of LBR registers is huge.
To reduce the overhead, the XSAVES instruction should be used to replace
the LBR registers' reading method.
The XSAVES buffer used for LBR read has to be per-CPU because the NMI
handler invoked the lbr_read(). The existing task_ctx_data buffer
cannot be used which is per-task and only be allocated for the LBR call
stack mode. A new lbr_xsave pointer is introduced in the cpu_hw_events
as an XSAVES buffer for LBR read.
The XSAVES buffer should be allocated only when LBR is used by a
non-PEBS event on the CPU because the total size of the lbr_xsave is
not small (~1.4KB).
The XSAVES buffer is allocated when a non-PEBS event is added, but it
is lazily released in x86_release_hardware() when perf releases the
entire PMU hardware resource, because perf may frequently schedule the
event, e.g. high context switch. The lazy release method reduces the
overhead of frequently allocate/free the buffer.
If the lbr_xsave fails to be allocated, roll back to normal Arch LBR
lbr_read().
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/1593780569-62993-24-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit ce711ea3ca upstream.
In the LBR call stack mode, LBR information is used to reconstruct a
call stack. To get the complete call stack, perf has to save/restore
all LBR registers during a context switch. Due to a large number of the
LBR registers, this process causes a high CPU overhead. To reduce the
CPU overhead during a context switch, use the XSAVES/XRSTORS
instructions.
Every XSAVE area must follow a canonical format: the legacy region, an
XSAVE header and the extended region. Although the LBR information is
only kept in the extended region, a space for the legacy region and
XSAVE header is still required. Add a new dedicated structure for LBR
XSAVES support.
Before enabling XSAVES support, the size of the LBR state has to be
sanity checked, because:
- the size of the software structure is calculated from the max number
of the LBR depth, which is enumerated by the CPUID leaf for Arch LBR.
The size of the LBR state is enumerated by the CPUID leaf for XSAVE
support of Arch LBR. If the values from the two CPUID leaves are not
consistent, it may trigger a buffer overflow. For example, a hypervisor
may unconsciously set inconsistent values for the two emulated CPUID.
- unlike other state components, the size of an LBR state depends on the
max number of LBRs, which may vary from generation to generation.
Expose the function xfeature_size() for the sanity check.
The LBR XSAVES support will be disabled if the size of the LBR state
enumerated by CPUID doesn't match with the size of the software
structure.
The XSAVE instruction requires 64-byte alignment for state buffers. A
new macro is added to reflect the alignment requirement. A 64-byte
aligned kmem_cache is created for architecture LBR.
Currently, the structure for each state component is maintained in
fpu/types.h. The structure for the new LBR state component should be
maintained in the same place. Move structure lbr_entry to fpu/types.h as
well for broader sharing.
Add dedicated lbr_save/lbr_restore functions for LBR XSAVES support,
which invokes the corresponding xstate helpers to XSAVES/XRSTORS LBR
information at the context switch when the call stack mode is enabled.
Since the XSAVES/XRSTORS instructions will be eventually invoked, the
dedicated functions is named with '_xsaves'/'_xrstors' postfix.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lkml.kernel.org/r/1593780569-62993-23-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 5a09928d33 upstream.
A new kmem_cache method has replaced the kzalloc() to allocate the PMU
specific data. The task_ctx_size is not required anymore.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-19-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 33cad28449 upstream.
A new kmem_cache method is introduced to allocate the PMU specific data
task_ctx_data, which requires the PMU specific code to create a
kmem_cache.
Currently, the task_ctx_data is only used by the Intel LBR call stack
feature, which is introduced since Haswell. The kmem_cache should be
only created for Haswell and later platforms. There is no alignment
requirement for the existing platforms.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-18-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 217c2a633e upstream.
Currently, the PMU specific data task_ctx_data is allocated by the
function kzalloc() in the perf generic code. When there is no specific
alignment requirement for the task_ctx_data, the method works well for
now. However, there will be a problem once a specific alignment
requirement is introduced in future features, e.g., the Architecture LBR
XSAVE feature requires 64-byte alignment. If the specific alignment
requirement is not fulfilled, the XSAVE family of instructions will fail
to save/restore the xstate to/from the task_ctx_data.
The function kzalloc() itself only guarantees a natural alignment. A
new method to allocate the task_ctx_data has to be introduced, which
has to meet the requirements as below:
- must be a generic method can be used by different architectures,
because the allocation of the task_ctx_data is implemented in the
perf generic code;
- must be an alignment-guarantee method (The alignment requirement is
not changed after the boot);
- must be able to allocate/free a buffer (smaller than a page size)
dynamically;
- should not cause extra CPU overhead or space overhead.
Several options were considered as below:
- One option is to allocate a larger buffer for task_ctx_data. E.g.,
ptr = kmalloc(size + alignment, GFP_KERNEL);
ptr &= ~(alignment - 1);
This option causes space overhead.
- Another option is to allocate the task_ctx_data in the PMU specific
code. To do so, several function pointers have to be added. As a
result, both the generic structure and the PMU specific structure
will become bigger. Besides, extra function calls are added when
allocating/freeing the buffer. This option will increase both the
space overhead and CPU overhead.
- The third option is to use a kmem_cache to allocate a buffer for the
task_ctx_data. The kmem_cache can be created with a specific alignment
requirement by the PMU at boot time. A new pointer for kmem_cache has
to be added in the generic struct pmu, which would be used to
dynamically allocate a buffer for the task_ctx_data at run time.
Although the new pointer is added to the struct pmu, the existing
variable task_ctx_size is not required anymore. The size of the
generic structure is kept the same.
The third option which meets all the aforementioned requirements is used
to replace kzalloc() for the PMU specific data allocation. A later patch
will remove the kzalloc() method and the related variables.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-17-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit ff9ff92688 upstream.
The method to allocate/free the task_ctx_data is going to be changed in
the following patch. Currently, the task_ctx_data is allocated/freed in
several different places. To avoid repeatedly modifying the same codes
in several different places, alloc_task_ctx_data() and
free_task_ctx_data() are factored out to allocate/free the
task_ctx_data. The modification only needs to be applied once.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-16-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 47125db27e upstream.
Last Branch Records (LBR) enables recording of software path history by
logging taken branches and other control flows within architectural
registers now. Intel CPUs have had model-specific LBR for quite some
time, but this evolves them into an architectural feature now.
The main improvements of Architectural LBR implemented includes:
- Linux kernel can support the LBR features without knowing the model
number of the current CPU.
- Architectural LBR capabilities can be enumerated by CPUID. The
lbr_ctl_map is based on the CPUID Enumeration.
- The possible LBR depth can be retrieved from CPUID enumeration. The
max value is written to the new MSR_ARCH_LBR_DEPTH as the number of
LBR entries.
- A new IA32_LBR_CTL MSR is introduced to enable and configure LBRs,
which replaces the IA32_DEBUGCTL[bit 0] and the LBR_SELECT MSR.
- Each LBR record or entry is still comprised of three MSRs,
IA32_LBR_x_FROM_IP, IA32_LBR_x_TO_IP and IA32_LBR_x_TO_IP.
But they become the architectural MSRs.
- Architectural LBR is stack-like now. Entry 0 is always the youngest
branch, entry 1 the next youngest... The TOS MSR has been removed.
The way to enable/disable Architectural LBR is similar to the previous
model-specific LBR. __intel_pmu_lbr_enable/disable() can be reused, but
some modifications are required, which include:
- MSR_ARCH_LBR_CTL is used to enable and configure the Architectural
LBR.
- When checking the value of the IA32_DEBUGCTL MSR, ignoring the
DEBUGCTLMSR_LBR (bit 0) for Architectural LBR, which has no meaning
and always return 0.
- The FREEZE_LBRS_ON_PMI has to be explicitly set/clear, because
MSR_IA32_DEBUGCTLMSR is not touched in __intel_pmu_lbr_disable() for
Architectural LBR.
- Only MSR_ARCH_LBR_CTL is cleared in __intel_pmu_lbr_disable() for
Architectural LBR.
Some Architectural LBR dedicated functions are implemented to
reset/read/save/restore LBR.
- For reset, writing to the ARCH_LBR_DEPTH MSR clears all Arch LBR
entries, which is a lot faster and can improve the context switch
latency.
- For read, the branch type information can be retrieved from
the MSR_ARCH_LBR_INFO_*. But it's not fully compatible due to
OTHER_BRANCH type. The software decoding is still required for the
OTHER_BRANCH case.
LBR records are stored in the age order as well. Reuse
intel_pmu_store_lbr(). Check the CPUID enumeration before accessing
the corresponding bits in LBR_INFO.
- For save/restore, applying the fast reset (writing ARCH_LBR_DEPTH).
Reading 'lbr_from' of entry 0 instead of the TOS MSR to check if the
LBR registers are reset in the deep C-state. If 'the deep C-state
reset' bit is not set in CPUID enumeration, ignoring the check.
XSAVE support for Architectural LBR will be implemented later.
The number of LBR entries cannot be hardcoded anymore, which should be
retrieved from CPUID enumeration. A new structure
x86_perf_task_context_arch_lbr is introduced for Architectural LBR.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-15-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 631618a0dc upstream.
The way to store the LBR information from a PEBS LBR record can be
reused in Architecture LBR, because
- The LBR information is stored like a stack. Entry 0 is always the
youngest branch.
- The layout of the LBR INFO MSR is similar.
The LBR information may be retrieved from either the LBR registers
(non-PEBS event) or a buffer (PEBS event). Extend rdlbr_*() to support
both methods.
Explicitly check the invalid entry (0s), which can avoid unnecessary MSR
access if using a non-PEBS event. For a PEBS event, the check should
slightly improve the performance as well. The invalid entries are cut.
The intel_pmu_lbr_filter() doesn't need to check and filter them out.
Cannot share the function with current model-specific LBR read, because
the direction of the LBR growth is opposite.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-14-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit fda1f99f34 upstream.
The previous model-specific LBR and Architecture LBR (legacy way) use a
similar method to save/restore the LBR information, which directly
accesses the LBR registers. The codes which read/write a set of LBR
registers can be shared between them.
Factor out two functions which are used to read/write a set of LBR
registers.
Add lbr_info into structure x86_pmu, and use it to replace the hardcoded
LBR INFO MSR, because the LBR INFO MSR address of the previous
model-specific LBR is different from Architecture LBR. The MSR address
should be assigned at boot time. For now, only Sky Lake and later
platforms have the LBR INFO MSR.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-13-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 020d91e5f3 upstream.
The {rd,wr}lbr_{to,from} wrappers are invoked in hot paths, e.g. context
switch and NMI handler. They should be always inline to achieve better
performance. However, the CONFIG_OPTIMIZE_INLINING allows the compiler
to uninline functions marked 'inline'.
Mark the {rd,wr}lbr_{to,from} wrappers as __always_inline to force
inline the wrappers.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-12-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 5624986dc6 upstream.
Current LBR information in the structure x86_perf_task_context is stored
in a different format from the PEBS LBR record and Architecture LBR,
which prevents the sharing of the common codes.
Use the format of the PEBS LBR record as a unified format. Use a generic
name lbr_entry to replace pebs_lbr_entry.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-11-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit db278b90c3 upstream.
For Intel LBR, the LBR Top-of-Stack (TOS) information is the HW index of
raw branch record for the most recent branch.
For non-adaptive PEBS and non-PEBS, the TOS information can be directly
retrieved from TOS MSR read in intel_pmu_lbr_read().
For adaptive PEBS, the LBR information stored in PEBS record doesn't
include the TOS information. For single PEBS, TOS can be directly read
from MSR, because the PMI is triggered immediately after PEBS is
written. TOS MSR is still unchanged.
For large PEBS, TOS MSR has stale value. Set -1ULL to indicate that the
TOS information is not available.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200127165355.27495-3-kan.liang@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit bbfd5e4fab upstream.
The low level index is the index in the underlying hardware buffer of
the most recently captured taken branch which is always saved in
branch_entries[0]. It is very useful for reconstructing the call stack.
For example, in Intel LBR call stack mode, the depth of reconstructed
LBR call stack limits to the number of LBR registers. With the low level
index information, perf tool may stitch the stacks of two samples. The
reconstructed LBR call stack can break the HW limitation.
Add a new branch sample type to retrieve low level index of raw branch
records. The low level index is between -1 (unknown) and max depth which
can be retrieved in /sys/devices/cpu/caps/branches.
Only when the new branch sample type is set, the low level index
information is dumped into the PERF_SAMPLE_BRANCH_STACK output.
Perf tool should check the attr.branch_sample_type, and apply the
corresponding format for PERF_SAMPLE_BRANCH_STACK samples.
Otherwise, some user case may be broken. For example, users may parse a
perf.data, which include the new branch sample type, with an old version
perf tool (without the check). Users probably get incorrect information
without any warning.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200127165355.27495-2-kan.liang@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 49d8184f20 upstream.
An IA32_LBR_CTL is introduced for Architecture LBR to enable and config
LBR registers to replace the previous LBR_SELECT.
All the related members in struct cpu_hw_events and struct x86_pmu
have to be renamed.
Some new macros are added to reflect the layout of LBR_CTL.
The mapping from PERF_SAMPLE_BRANCH_* to the corresponding bits in
LBR_CTL MSR is saved in lbr_ctl_map now, which is not a const value.
The value relies on the CPUID enumeration.
For the previous model-specific LBR, most of the bits in LBR_SELECT
operate in the suppressed mode. For the bits in LBR_CTL, the polarity is
inverted.
For the previous model-specific LBR format 5 (LBR_FORMAT_INFO), if the
NO_CYCLES and NO_FLAGS type are set, the flag LBR_NO_INFO will be set to
avoid the unnecessary LBR_INFO MSR read. Although Architecture LBR also
has a dedicated LBR_INFO MSR, perf doesn't need to check and set the
flag LBR_NO_INFO. For Architecture LBR, XSAVES instruction will be used
as the default way to read the LBR MSRs all together. The overhead which
the flag tries to avoid doesn't exist anymore. Dropping the flag can
save the extra check for the flag in the lbr_read() later, and make the
code cleaner.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-10-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit af6cf12970 upstream.
The LBR capabilities of Architecture LBR are retrieved from the CPUID
enumeration once at boot time. The capabilities have to be saved for
future usage.
Several new fields are added into structure x86_pmu to indicate the
capabilities. The fields will be used in the following patches.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-9-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit d6a162a41b upstream.
Add Arch LBR related MSRs and the new LBR INFO bits in MSR-index.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-8-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit f42be8651a upstream.
The type of task_ctx is hardcoded as struct x86_perf_task_context,
which doesn't apply for Architecture LBR. For example, Architecture LBR
doesn't have the TOS MSR. The number of LBR entries is variable. A new
struct will be introduced for Architecture LBR. Perf has to determine
the type of task_ctx at run time.
The type of task_ctx pointer is changed to 'void *', which will be
determined at run time.
The generic LBR optimization can be shared between Architecture LBR and
model-specific LBR. Both need to access the structure for the generic
LBR optimization. A helper task_context_opt() is introduced to retrieve
the pointer of the structure at run time.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-7-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 530bfff648 upstream.
To reduce the overhead of a context switch with LBR enabled, some
generic optimizations were introduced, e.g. avoiding restore LBR if no
one else touched them. The generic optimizations can also be used by
Architecture LBR later. Currently, the fields for the generic
optimizations are part of structure x86_perf_task_context, which will be
deprecated by Architecture LBR. A new structure should be introduced
for the common fields of generic optimization, which can be shared
between Architecture LBR and model-specific LBR.
Both 'valid_lbrs' and 'tos' are also used by the generic optimizations,
but they are not moved into the new structure, because Architecture LBR
is stack-like. The 'valid_lbrs' which records the index of the valid LBR
is not required anymore. The TOS MSR will be removed.
LBR registers may be cleared in the deep Cstate. If so, the generic
optimizations should not be applied. Perf has to unconditionally
restore the LBR registers. A generic function is required to detect the
reset due to the deep Cstate. lbr_is_reset_in_cstate() is introduced.
Currently, for the model-specific LBR, the TOS MSR is used to detect the
reset. There will be another method introduced for Architecture LBR
later.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-6-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>
commit 799571bf38 upstream.
The MSRs of Architectural LBR are different from previous model-specific
LBR. Perf has to implement different functions to save and restore them.
The function pointers for LBR save and restore are introduced. Perf
should initialize the corresponding functions at boot time.
The generic optimizations, e.g. avoiding restore LBR if no one else
touched them, still apply for Architectural LBRs. The related codes are
not moved to model-specific functions.
Current model-specific LBR functions are set as default.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-5-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Xinghui Li <korantli@tencent.com>