perf tools changes for v6.0: 1st batch

- Introduce 'perf lock contention' subtool, using new lock contention tracepoints and using BPF for in kernel aggregation and then userspace processing using the perf tooling infrastructure for resolving symbols, target specification, etc. Since the new lock contention tracepoints don't provide lock names, get up to 8 stack traces and display the first non-lock function symbol name as a caller: $ perf lock report -F acquired,contended,avg_wait,wait_total Name acquired contended avg wait total wait update_blocked_a... 40 40 3.61 us 144.45 us kernfs_fop_open+... 5 5 3.64 us 18.18 us _nohz_idle_balance 3 3 2.65 us 7.95 us tick_do_update_j... 1 1 6.04 us 6.04 us ep_scan_ready_list 1 1 3.93 us 3.93 us Supports the usual 'perf record' + 'perf report' workflow as well as a BCC/bpftrace like mode where you start the tool and then press control+C to get results: $ sudo perf lock contention -b ^C contended total wait max wait avg wait type caller 42 192.67 us 13.64 us 4.59 us spinlock queue_work_on+0x20 23 85.54 us 10.28 us 3.72 us spinlock worker_thread+0x14a 6 13.92 us 6.51 us 2.32 us mutex kernfs_iop_permission+0x30 3 11.59 us 10.04 us 3.86 us mutex kernfs_dop_revalidate+0x3c 1 7.52 us 7.52 us 7.52 us spinlock kthread+0x115 1 7.24 us 7.24 us 7.24 us rwlock:W sys_epoll_wait+0x148 2 7.08 us 3.99 us 3.54 us spinlock delayed_work_timer_fn+0x1b 1 6.41 us 6.41 us 6.41 us spinlock idle_balance+0xa06 2 2.50 us 1.83 us 1.25 us mutex kernfs_iop_lookup+0x2f 1 1.71 us 1.71 us 1.71 us mutex kernfs_iop_getattr+0x2c ... - Add new 'perf kwork' tool to trace time properties of kernel work (such as softirq, and workqueue), uses eBPF skeletons to collect info in kernel space, aggregating data that then gets processed by the userspace tool, e.g.: # perf kwork report Kwork Name | Cpu | Total Runtime | Count | Max runtime | Max runtime start | Max runtime end | ---------------------------------------------------------------------------------------------------- nvme0q5:130 | 004 | 1.101 ms | 49 | 0.051 ms | 26035.056403 s | 26035.056455 s | amdgpu:162 | 002 | 0.176 ms | 9 | 0.046 ms | 26035.268020 s | 26035.268066 s | nvme0q24:149 | 023 | 0.161 ms | 55 | 0.009 ms | 26035.655280 s | 26035.655288 s | nvme0q20:145 | 019 | 0.090 ms | 33 | 0.014 ms | 26035.939018 s | 26035.939032 s | nvme0q31:156 | 030 | 0.075 ms | 21 | 0.010 ms | 26035.052237 s | 26035.052247 s | nvme0q8:133 | 007 | 0.062 ms | 12 | 0.021 ms | 26035.416840 s | 26035.416861 s | nvme0q6:131 | 005 | 0.054 ms | 22 | 0.010 ms | 26035.199919 s | 26035.199929 s | nvme0q19:144 | 018 | 0.052 ms | 14 | 0.010 ms | 26035.110615 s | 26035.110625 s | nvme0q7:132 | 006 | 0.049 ms | 13 | 0.007 ms | 26035.125180 s | 26035.125187 s | nvme0q18:143 | 017 | 0.033 ms | 14 | 0.007 ms | 26035.169698 s | 26035.169705 s | nvme0q17:142 | 016 | 0.013 ms | 1 | 0.013 ms | 26035.565147 s | 26035.565160 s | enp5s0-rx-0:164 | 006 | 0.004 ms | 4 | 0.002 ms | 26035.928882 s | 26035.928884 s | enp5s0-tx-0:166 | 008 | 0.003 ms | 3 | 0.002 ms | 26035.870923 s | 26035.870925 s | -------------------------------------------------------------------------------------------------------- See commit log messages for more examples with extra options to limit the events time window, etc. - Add support for new AMD IBS (Instruction Based Sampling) features: With the DataSrc extensions, the source of data can be decoded among: - Local L3 or other L1/L2 in CCX. - A peer cache in a near CCX. - Data returned from DRAM. - A peer cache in a far CCX. - DRAM address map with "long latency" bit set. - Data returned from MMIO/Config/PCI/APIC. - Extension Memory (S-Link, GenZ, etc - identified by the CS target and/or address map at DF's choice). - Peer Agent Memory. - Support hardware tracing with Intel PT on guest machines, combining the traces with the ones in the host machine. - Add a "-m" option to 'perf buildid-list' to show kernel and modules build-ids, to display all of the information needed to do external symbolization of kernel stack traces, such as those collected by bpf_get_stackid(). - Add arch TSC frequency information to perf.data file headers. - Handle changes in the binutils disassembler function signatures in perf, bpftool and bpf_jit_disasm (Acked by the bpftool maintainer). - Fix building the perf perl binding with the newest gcc in distros such as fedora rawhide, where some new warnings were breaking the build as perf uses -Werror. - Add 'perf test' entry for branch stack sampling. - Add ARM SPE system wide 'perf test' entry. - Add user space counter reading tests to 'perf test'. - Build with python3 by default, if available. - Add python converter script for the vendor JSON event files. - Update vendor JSON files for alderlake, bonnell, broadwell, broadwellde, broadwellx, cascadelakex, elkhartlake, goldmont, goldmontplus, haswell, haswellx, icelake, icelakex, ivybridge, ivytown, jaketown, knightslanding, nehalemep, nehalemex, sandybridge, sapphirerapids, silvermont, skylake, skylakex, snowridgex, tigerlake, westmereep-dp, westmereep-sp and westmereex. - Add vendor JSON File for Intel meteorlake. - Add Arm Cortex-A78C and X1C JSON vendor event files. - Add workaround to symbol address reading from ELF files without phdr, falling back to the previoous equation. - Convert legacy map definition to BTF-defined in the perf BPF script test. - Rework prologue generation code to stop using libbpf deprecated APIs. - Add default hybrid events for 'perf stat' on x86. - Add topdown metrics in the default 'perf stat' on the hybrid machines (big/little cores). - Prefer sampled CPU when exporting JSON in 'perf data convert' - Fix ('perf stat CSV output linter') and ("Check branch stack sampling") 'perf test' entries on s390. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCYuw6gwAKCRCyPKLppCJ+ J5+iAP0RL6sKMhzdkRjRYfG8CluJ401YaPHadzv5jxP8gOZz2gEAsuYDrMF9t1zB 4DqORfobdX9UQEJjP9oRltU73GM0swI= =2/M0 -----END PGP SIGNATURE----- Merge tag 'perf-tools-for-v6.0-2022-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux Pull perf tools updates from Arnaldo Carvalho de Melo: - Introduce 'perf lock contention' subtool, using new lock contention tracepoints and using BPF for in kernel aggregation and then userspace processing using the perf tooling infrastructure for resolving symbols, target specification, etc. Since the new lock contention tracepoints don't provide lock names, get up to 8 stack traces and display the first non-lock function symbol name as a caller: $ perf lock report -F acquired,contended,avg_wait,wait_total Name acquired contended avg wait total wait update_blocked_a... 40 40 3.61 us 144.45 us kernfs_fop_open+... 5 5 3.64 us 18.18 us _nohz_idle_balance 3 3 2.65 us 7.95 us tick_do_update_j... 1 1 6.04 us 6.04 us ep_scan_ready_list 1 1 3.93 us 3.93 us Supports the usual 'perf record' + 'perf report' workflow as well as a BCC/bpftrace like mode where you start the tool and then press control+C to get results: $ sudo perf lock contention -b ^C contended total wait max wait avg wait type caller 42 192.67 us 13.64 us 4.59 us spinlock queue_work_on+0x20 23 85.54 us 10.28 us 3.72 us spinlock worker_thread+0x14a 6 13.92 us 6.51 us 2.32 us mutex kernfs_iop_permission+0x30 3 11.59 us 10.04 us 3.86 us mutex kernfs_dop_revalidate+0x3c 1 7.52 us 7.52 us 7.52 us spinlock kthread+0x115 1 7.24 us 7.24 us 7.24 us rwlock:W sys_epoll_wait+0x148 2 7.08 us 3.99 us 3.54 us spinlock delayed_work_timer_fn+0x1b 1 6.41 us 6.41 us 6.41 us spinlock idle_balance+0xa06 2 2.50 us 1.83 us 1.25 us mutex kernfs_iop_lookup+0x2f 1 1.71 us 1.71 us 1.71 us mutex kernfs_iop_getattr+0x2c ... - Add new 'perf kwork' tool to trace time properties of kernel work (such as softirq, and workqueue), uses eBPF skeletons to collect info in kernel space, aggregating data that then gets processed by the userspace tool, e.g.: # perf kwork report Kwork Name | Cpu | Total Runtime | Count | Max runtime | Max runtime start | Max runtime end | ---------------------------------------------------------------------------------------------------- nvme0q5:130 | 004 | 1.101 ms | 49 | 0.051 ms | 26035.056403 s | 26035.056455 s | amdgpu:162 | 002 | 0.176 ms | 9 | 0.046 ms | 26035.268020 s | 26035.268066 s | nvme0q24:149 | 023 | 0.161 ms | 55 | 0.009 ms | 26035.655280 s | 26035.655288 s | nvme0q20:145 | 019 | 0.090 ms | 33 | 0.014 ms | 26035.939018 s | 26035.939032 s | nvme0q31:156 | 030 | 0.075 ms | 21 | 0.010 ms | 26035.052237 s | 26035.052247 s | nvme0q8:133 | 007 | 0.062 ms | 12 | 0.021 ms | 26035.416840 s | 26035.416861 s | nvme0q6:131 | 005 | 0.054 ms | 22 | 0.010 ms | 26035.199919 s | 26035.199929 s | nvme0q19:144 | 018 | 0.052 ms | 14 | 0.010 ms | 26035.110615 s | 26035.110625 s | nvme0q7:132 | 006 | 0.049 ms | 13 | 0.007 ms | 26035.125180 s | 26035.125187 s | nvme0q18:143 | 017 | 0.033 ms | 14 | 0.007 ms | 26035.169698 s | 26035.169705 s | nvme0q17:142 | 016 | 0.013 ms | 1 | 0.013 ms | 26035.565147 s | 26035.565160 s | enp5s0-rx-0:164 | 006 | 0.004 ms | 4 | 0.002 ms | 26035.928882 s | 26035.928884 s | enp5s0-tx-0:166 | 008 | 0.003 ms | 3 | 0.002 ms | 26035.870923 s | 26035.870925 s | -------------------------------------------------------------------------------------------------------- See commit log messages for more examples with extra options to limit the events time window, etc. - Add support for new AMD IBS (Instruction Based Sampling) features: With the DataSrc extensions, the source of data can be decoded among: - Local L3 or other L1/L2 in CCX. - A peer cache in a near CCX. - Data returned from DRAM. - A peer cache in a far CCX. - DRAM address map with "long latency" bit set. - Data returned from MMIO/Config/PCI/APIC. - Extension Memory (S-Link, GenZ, etc - identified by the CS target and/or address map at DF's choice). - Peer Agent Memory. - Support hardware tracing with Intel PT on guest machines, combining the traces with the ones in the host machine. - Add a "-m" option to 'perf buildid-list' to show kernel and modules build-ids, to display all of the information needed to do external symbolization of kernel stack traces, such as those collected by bpf_get_stackid(). - Add arch TSC frequency information to perf.data file headers. - Handle changes in the binutils disassembler function signatures in perf, bpftool and bpf_jit_disasm (Acked by the bpftool maintainer). - Fix building the perf perl binding with the newest gcc in distros such as fedora rawhide, where some new warnings were breaking the build as perf uses -Werror. - Add 'perf test' entry for branch stack sampling. - Add ARM SPE system wide 'perf test' entry. - Add user space counter reading tests to 'perf test'. - Build with python3 by default, if available. - Add python converter script for the vendor JSON event files. - Update vendor JSON files for most Intel cores. - Add vendor JSON File for Intel meteorlake. - Add Arm Cortex-A78C and X1C JSON vendor event files. - Add workaround to symbol address reading from ELF files without phdr, falling back to the previoous equation. - Convert legacy map definition to BTF-defined in the perf BPF script test. - Rework prologue generation code to stop using libbpf deprecated APIs. - Add default hybrid events for 'perf stat' on x86. - Add topdown metrics in the default 'perf stat' on the hybrid machines (big/little cores). - Prefer sampled CPU when exporting JSON in 'perf data convert' - Fix ('perf stat CSV output linter') and ("Check branch stack sampling") 'perf test' entries on s390. * tag 'perf-tools-for-v6.0-2022-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (169 commits) perf stat: Refactor __run_perf_stat() common code perf lock: Print the number of lost entries for BPF perf lock: Add --map-nr-entries option perf lock: Introduce struct lock_contention perf scripting python: Do not build fail on deprecation warnings genelf: Use HAVE_LIBCRYPTO_SUPPORT, not the never defined HAVE_LIBCRYPTO perf build: Suppress openssl v3 deprecation warnings in libcrypto feature test perf parse-events: Break out tracepoint and printing perf parse-events: Don't #define YY_EXTRA_TYPE tools bpftool: Don't display disassembler-four-args feature test tools bpftool: Fix compilation error with new binutils tools bpf_jit_disasm: Don't display disassembler-four-args feature test tools bpf_jit_disasm: Fix compilation error with new binutils tools perf: Fix compilation error with new binutils tools include: add dis-asm-compat.h to handle version differences tools build: Don't display disassembler-four-args feature test tools build: Add feature test for init_disassemble_info API changes perf test: Add ARM SPE system wide test perf tools: Rework prologue generation code perf bpf: Convert legacy map definition to BTF-defined ...
2022-08-06 09:36:08 -07:00 · 2022-08-06 09:36:08 -07:00 · 48a577dc1b
parent 033a94412b bb8bc52e75
commit 48a577dc1b
390 changed files with 98561 additions and 12149 deletions
--- a/tools/arch/x86/include/asm/amd-ibs.h
+++ b/tools/arch/x86/include/asm/amd-ibs.h
@ -29,7 +29,10 @@ union ibs_fetch_ctl {
 			rand_en:1,	/* 57: random tagging enable */
 			fetch_l2_miss:1,/* 58: L2 miss for sampled fetch
 					 *      (needs IbsFetchComp) */
-			reserved:5;	/* 59-63: reserved */
+			l3_miss_only:1,	/* 59: Collect L3 miss samples only */
+			fetch_oc_miss:1,/* 60: Op cache miss for the sampled fetch */
+			fetch_l3_miss:1,/* 61: L3 cache miss for the sampled fetch */
+			reserved:2;	/* 62-63: reserved */
 	};
 };

@ -38,14 +41,14 @@ union ibs_op_ctl {
 	__u64 val;
 	struct {
 		__u64	opmaxcnt:16,	/* 0-15: periodic op max. count */
-			reserved0:1,	/* 16: reserved */
+			l3_miss_only:1,	/* 16: Collect L3 miss samples only */
 			op_en:1,	/* 17: op sampling enable */
 			op_val:1,	/* 18: op sample valid */
 			cnt_ctl:1,	/* 19: periodic op counter control */
 			opmaxcnt_ext:7,	/* 20-26: upper 7 bits of periodic op maximum count */
-			reserved1:5,	/* 27-31: reserved */
+			reserved0:5,	/* 27-31: reserved */
 			opcurcnt:27,	/* 32-58: periodic op counter current count */
-			reserved2:5;	/* 59-63: reserved */
+			reserved1:5;	/* 59-63: reserved */
 	};
 };

@ -71,11 +74,12 @@ union ibs_op_data {
 union ibs_op_data2 {
 	__u64 val;
 	struct {
-		__u64	data_src:3,	/* 0-2: data source */
+		__u64	data_src_lo:3,	/* 0-2: data source low */
 			reserved0:1,	/* 3: reserved */
 			rmt_node:1,	/* 4: destination node */
 			cache_hit_st:1,	/* 5: cache hit state */
-			reserved1:57;	/* 5-63: reserved */
+			data_src_hi:2,	/* 6-7: data source high */
+			reserved1:56;	/* 8-63: reserved */
 	};
 };

--- a/tools/bpf/Makefile
+++ b/tools/bpf/Makefile
@ -34,8 +34,8 @@ else
 endif

 FEATURE_USER = .bpf
-FEATURE_TESTS = libbfd disassembler-four-args
-FEATURE_DISPLAY = libbfd disassembler-four-args
+FEATURE_TESTS = libbfd disassembler-four-args disassembler-init-styled
+FEATURE_DISPLAY = libbfd

 check_feat := 1
 NON_CHECK_FEAT_TARGETS := clean bpftool_clean runqslower_clean resolve_btfids_clean
@ -56,6 +56,9 @@ endif
 ifeq ($(feature-disassembler-four-args), 1)
 CFLAGS += -DDISASM_FOUR_ARGS_SIGNATURE
 endif
+ifeq ($(feature-disassembler-init-styled), 1)
+CFLAGS += -DDISASM_INIT_STYLED
+endif

 $(OUTPUT)%.yacc.c: $(srctree)/tools/bpf/%.y
 	$(QUIET_BISON)$(YACC) -o $@ -d $<
--- a/tools/bpf/bpf_jit_disasm.c
+++ b/tools/bpf/bpf_jit_disasm.c
@ -28,6 +28,7 @@
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <limits.h>
+#include <tools/dis-asm-compat.h>

 #define CMD_ACTION_SIZE_BUFFER		10
 #define CMD_ACTION_READ_ALL		3
@ -64,7 +65,9 @@ static void get_asm_insns(uint8_t *image, size_t len, int opcodes)
 	assert(bfdf);
 	assert(bfd_check_format(bfdf, bfd_object));

-	init_disassemble_info(&info, stdout, (fprintf_ftype) fprintf);
+	init_disassemble_info_compat(&info, stdout,
+				     (fprintf_ftype) fprintf,
+				     fprintf_styled);
 	info.arch = bfd_get_arch(bfdf);
 	info.mach = bfd_get_mach(bfdf);
 	info.buffer = image;
--- a/tools/bpf/bpftool/Makefile
+++ b/tools/bpf/bpftool/Makefile
@ -93,8 +93,9 @@ INSTALL ?= install
 RM ?= rm -f

 FEATURE_USER = .bpftool
-FEATURE_TESTS = libbfd disassembler-four-args libcap clang-bpf-co-re
-FEATURE_DISPLAY = libbfd disassembler-four-args libcap clang-bpf-co-re
+FEATURE_TESTS = libbfd disassembler-four-args disassembler-init-styled libcap \
+	clang-bpf-co-re
+FEATURE_DISPLAY = libbfd libcap clang-bpf-co-re

 check_feat := 1
 NON_CHECK_FEAT_TARGETS := clean uninstall doc doc-clean doc-install doc-uninstall
@ -115,6 +116,9 @@ endif
 ifeq ($(feature-disassembler-four-args), 1)
 CFLAGS += -DDISASM_FOUR_ARGS_SIGNATURE
 endif
+ifeq ($(feature-disassembler-init-styled), 1)
+    CFLAGS += -DDISASM_INIT_STYLED
+endif

 LIBS = $(LIBBPF) -lelf -lz
 LIBS_BOOTSTRAP = $(LIBBPF_BOOTSTRAP) -lelf -lz
--- a/tools/bpf/bpftool/jit_disasm.c
+++ b/tools/bpf/bpftool/jit_disasm.c
@ -24,6 +24,7 @@
 #include <sys/stat.h>
 #include <limits.h>
 #include <bpf/libbpf.h>
+#include <tools/dis-asm-compat.h>

 #include "json_writer.h"
 #include "main.h"
@ -39,15 +40,12 @@ static void get_exec_path(char *tpath, size_t size)
 }

 static int oper_count;
-static int fprintf_json(void *out, const char *fmt, ...)
+static int printf_json(void *out, const char *fmt, va_list ap)
 {
-	va_list ap;
 	char *s;
 	int err;

-	va_start(ap, fmt);
 	err = vasprintf(&s, fmt, ap);
-	va_end(ap);
 	if (err < 0)
 		return -1;

@ -73,6 +71,32 @@ static int fprintf_json(void *out, const char *fmt, ...)
 	return 0;
 }

+static int fprintf_json(void *out, const char *fmt, ...)
+{
+	va_list ap;
+	int r;
+
+	va_start(ap, fmt);
+	r = printf_json(out, fmt, ap);
+	va_end(ap);
+
+	return r;
+}
+
+static int fprintf_json_styled(void *out,
+			       enum disassembler_style style __maybe_unused,
+			       const char *fmt, ...)
+{
+	va_list ap;
+	int r;
+
+	va_start(ap, fmt);
+	r = printf_json(out, fmt, ap);
+	va_end(ap);
+
+	return r;
+}
+
 void disasm_print_insn(unsigned char *image, ssize_t len, int opcodes,
 		       const char *arch, const char *disassembler_options,
 		       const struct btf *btf,
@ -99,11 +123,13 @@ void disasm_print_insn(unsigned char *image, ssize_t len, int opcodes,
 	assert(bfd_check_format(bfdf, bfd_object));

 	if (json_output)
-		init_disassemble_info(&info, stdout,
-				      (fprintf_ftype) fprintf_json);
+		init_disassemble_info_compat(&info, stdout,
+					     (fprintf_ftype) fprintf_json,
+					     fprintf_json_styled);
 	else
-		init_disassemble_info(&info, stdout,
-				      (fprintf_ftype) fprintf);
+		init_disassemble_info_compat(&info, stdout,
+					     (fprintf_ftype) fprintf,
+					     fprintf_styled);

 	/* Update architecture info for offload. */
 	if (arch) {
--- a/tools/build/Makefile.feature
+++ b/tools/build/Makefile.feature
@ -70,6 +70,7 @@ FEATURE_TESTS_BASIC :=                  \
        libaio				\
        libzstd				\
        disassembler-four-args		\
+        disassembler-init-styled	\
        file-handle

 # FEATURE_TESTS_BASIC + FEATURE_TESTS_EXTRA is the complete list
@ -134,8 +135,7 @@ FEATURE_DISPLAY ?=              \
         get_cpuid              \
         bpf			\
         libaio			\
-         libzstd		\
-         disassembler-four-args
+         libzstd

 # Set FEATURE_CHECK_(C|LD)FLAGS-all for all FEATURE_TESTS features.
 # If in the future we need per-feature checks/flags for features not
--- a/tools/build/feature/Makefile
+++ b/tools/build/feature/Makefile
@ -18,6 +18,7 @@ FILES=                                          \
         test-libbfd.bin                        \
         test-libbfd-buildid.bin		\
         test-disassembler-four-args.bin        \
+         test-disassembler-init-styled.bin	\
         test-reallocarray.bin			\
         test-libbfd-liberty.bin                \
         test-libbfd-liberty-z.bin              \
@ -248,6 +249,9 @@ $(OUTPUT)test-libbfd-buildid.bin:
 $(OUTPUT)test-disassembler-four-args.bin:
 	$(BUILD) -DPACKAGE='"perf"' -lbfd -lopcodes

+$(OUTPUT)test-disassembler-init-styled.bin:
+	$(BUILD) -DPACKAGE='"perf"' -lbfd -lopcodes
+
 $(OUTPUT)test-reallocarray.bin:
 	$(BUILD)

--- a/tools/build/feature/test-all.c
+++ b/tools/build/feature/test-all.c
@ -166,6 +166,10 @@
 # include "test-disassembler-four-args.c"
 #undef main

+#define main main_test_disassembler_init_styled
+# include "test-disassembler-init-styled.c"
+#undef main
+
 #define main main_test_libzstd
 # include "test-libzstd.c"
 #undef main
--- a/tools/build/feature/test-disassembler-init-styled.c
+++ b/tools/build/feature/test-disassembler-init-styled.c
@ -0,0 +1,13 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include <dis-asm.h>
+
+int main(void)
+{
+	struct disassemble_info info;
+
+	init_disassemble_info(&info, stdout,
+			      NULL, NULL);
+
+	return 0;
+}
--- a/tools/build/feature/test-libcrypto.c
+++ b/tools/build/feature/test-libcrypto.c
@ -2,6 +2,12 @@
 #include <openssl/sha.h>
 #include <openssl/md5.h>

+/*
+ * The MD5_* API have been deprecated since OpenSSL 3.0, which causes the
+ * feature test to fail silently. This is a workaround.
+ */
+#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
+
 int main(void)
 {
 	MD5_CTX context;
--- a/tools/include/linux/list.h
+++ b/tools/include/linux/list.h
@ -384,6 +384,17 @@ static inline void list_splice_tail_init(struct list_head *list,
 #define list_first_entry_or_null(ptr, type, member) \
 	(!list_empty(ptr) ? list_first_entry(ptr, type, member) : NULL)

+/**
+ * list_last_entry_or_null - get the last element from a list
+ * @ptr:       the list head to take the element from.
+ * @type:      the type of the struct this is embedded in.
+ * @member:    the name of the list_head within the struct.
+ *
+ * Note that if the list is empty, it returns NULL.
+ */
+#define list_last_entry_or_null(ptr, type, member) \
+	(!list_empty(ptr) ? list_last_entry(ptr, type, member) : NULL)
+
 /**
 * list_next_entry - get the next element in list
 * @pos:	the type * to cursor
--- a/tools/include/tools/dis-asm-compat.h
+++ b/tools/include/tools/dis-asm-compat.h
@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause */
+#ifndef _TOOLS_DIS_ASM_COMPAT_H
+#define _TOOLS_DIS_ASM_COMPAT_H
+
+#include <stdio.h>
+#include <dis-asm.h>
+
+/* define types for older binutils version, to centralize ifdef'ery a bit */
+#ifndef DISASM_INIT_STYLED
+enum disassembler_style {DISASSEMBLER_STYLE_NOT_EMPTY};
+typedef int (*fprintf_styled_ftype) (void *, enum disassembler_style, const char*, ...);
+#endif
+
+/*
+ * Trivial fprintf wrapper to be used as the fprintf_styled_func argument to
+ * init_disassemble_info_compat() when normal fprintf suffices.
+ */
+static inline int fprintf_styled(void *out,
+				 enum disassembler_style style,
+				 const char *fmt, ...)
+{
+	va_list args;
+	int r;
+
+	(void)style;
+
+	va_start(args, fmt);
+	r = vfprintf(out, fmt, args);
+	va_end(args);
+
+	return r;
+}
+
+/*
+ * Wrapper for init_disassemble_info() that hides version
+ * differences. Depending on binutils version and architecture either
+ * fprintf_func or fprintf_styled_func will be called.
+ */
+static inline void init_disassemble_info_compat(struct disassemble_info *info,
+						void *stream,
+						fprintf_ftype unstyled_func,
+						fprintf_styled_ftype styled_func)
+{
+#ifdef DISASM_INIT_STYLED
+	init_disassemble_info(info, stream,
+			      unstyled_func,
+			      styled_func);
+#else
+	(void)styled_func;
+	init_disassemble_info(info, stream,
+			      unstyled_func);
+#endif
+}
+
+#endif /* _TOOLS_DIS_ASM_COMPAT_H */
--- a/tools/lib/perf/include/internal/evsel.h
+++ b/tools/lib/perf/include/internal/evsel.h
@ -30,6 +30,10 @@ struct perf_sample_id {
 	struct perf_cpu		 cpu;
 	pid_t			 tid;

+	/* Guest machine pid and VCPU, valid only if machine_pid is non-zero */
+	pid_t			 machine_pid;
+	struct perf_cpu		 vcpu;
+
 	/* Holds total ID period value for PERF_SAMPLE_READ processing. */
 	u64			 period;
 };
--- a/tools/lib/perf/include/perf/event.h
+++ b/tools/lib/perf/include/perf/event.h
@ -237,6 +237,11 @@ struct id_index_entry {
 	__u64			 tid;
 };

+struct id_index_entry_2 {
+	__u64			 machine_pid;
+	__u64			 vcpu;
+};
+
 struct perf_record_id_index {
 	struct perf_event_header header;
 	__u64			 nr;
@ -274,6 +279,8 @@ struct perf_record_auxtrace_error {
 	__u64			 ip;
 	__u64			 time;
 	char			 msg[MAX_AUXTRACE_ERROR_MSG];
+	__u32			 machine_pid;
+	__u32			 vcpu;
 };

 struct perf_record_aux {
@ -389,6 +396,7 @@ enum perf_user_event_type { /* above any possible kernel type */
 	PERF_RECORD_TIME_CONV			= 79,
 	PERF_RECORD_HEADER_FEATURE		= 80,
 	PERF_RECORD_COMPRESSED			= 81,
+	PERF_RECORD_FINISHED_INIT		= 82,
 	PERF_RECORD_HEADER_MAX
 };

--- a/tools/perf/Build
+++ b/tools/perf/Build
@ -25,6 +25,7 @@ perf-y += builtin-data.o
 perf-y += builtin-version.o
 perf-y += builtin-c2c.o
 perf-y += builtin-daemon.o
+perf-y += builtin-kwork.o

 perf-$(CONFIG_TRACE) += builtin-trace.o
 perf-$(CONFIG_LIBELF) += builtin-probe.o
--- a/tools/perf/Documentation/perf-buildid-list.txt
+++ b/tools/perf/Documentation/perf-buildid-list.txt
@ -33,6 +33,10 @@ OPTIONS
 -k::
 --kernel::
 	Show running kernel build id.
+-m::
+--kernel-maps::
+	Show buildid, start/end text address, and path of running kernel and
+	its modules.
 -v::
 --verbose::
 	Be more verbose.
--- a/tools/perf/Documentation/perf-dlfilter.txt
+++ b/tools/perf/Documentation/perf-dlfilter.txt
@ -107,9 +107,31 @@ struct perf_dlfilter_sample {
 	__u64 raw_callchain_nr;	/* Number of raw_callchain entries */
 	const __u64 *raw_callchain; /* Refer <linux/perf_event.h> */
 	const char *event;
+	__s32 machine_pid;
+	__s32 vcpu;
 };
 ----

+Note: 'machine_pid' and 'vcpu' are not original members, but were added together later.
+'size' can be used to determine their presence at run time.
+PERF_DLFILTER_HAS_MACHINE_PID will be defined if they are present at compile time.
+For example:
+[source,c]
+----
+#include <perf/perf_dlfilter.h>
+#include <stddef.h>
+#include <stdbool.h>
+
+static inline bool have_machine_pid(const struct perf_dlfilter_sample *sample)
+{
+#ifdef PERF_DLFILTER_HAS_MACHINE_PID
+	return sample->size >= offsetof(struct perf_dlfilter_sample, vcpu) + sizeof(sample->vcpu);
+#else
+	return false;
+#endif
+}
+----
+
 The perf_dlfilter_fns structure
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- a/tools/perf/Documentation/perf-inject.txt
+++ b/tools/perf/Documentation/perf-inject.txt
@ -85,6 +85,23 @@ include::itrace.txt[]
 	without updating it. Currently this option is supported only by
 	Intel PT, refer linkperf:perf-intel-pt[1]

+--guest-data=<path>,<pid>[,<time offset>[,<time scale>]]::
+	Insert events from a perf.data file recorded in a virtual machine at
+	the same time as the input perf.data file was recorded on the host.
+	The Process ID (PID) of the QEMU hypervisor process must be provided,
+	and the time offset and time scale (multiplier) will likely be needed
+	to convert guest time stamps into host time stamps. For example, for
+	x86 the TSC Offset and Multiplier could be provided for a virtual machine
+	using Linux command line option no-kvmclock.
+	Currently only mmap, mmap2, comm, task, context_switch, ksymbol,
+	and text_poke events are inserted, as well as build ID information.
+	The QEMU option -name debug-threads=on is needed so that thread names
+	can be used to determine which thread is running which VCPU. Note
+	libvirt seems to use this by default.
+	When using perf record in the guest, option --sample-identifier
+	should be used, and also --buildid-all and --switch-events may be
+	useful.
+
 SEE ALSO
 --------
 linkperf:perf-record[1], linkperf:perf-report[1], linkperf:perf-archive[1],
--- a/tools/perf/Documentation/perf-intel-pt.txt
+++ b/tools/perf/Documentation/perf-intel-pt.txt
@ -267,7 +267,7 @@ Note that, as with all events, the event is suffixed with event modifiers:
 	H	host
 	p	precise ip

-'h', 'G' and 'H' are for virtualization which is not supported by Intel PT.
+'h', 'G' and 'H' are for virtualization which are not used by Intel PT.
 'p' is also not relevant to Intel PT.  So only options 'u' and 'k' are
 meaningful for Intel PT.

@ -1218,10 +1218,10 @@ XED
 include::build-xed.txt[]


-Tracing Virtual Machines
------------------------
+Tracing Virtual Machines (kernel only)
+--------------------------------------

-Currently, only kernel tracing is supported and only with either "timeless" decoding
+Currently, kernel tracing is supported with either "timeless" decoding
 (i.e. no TSC timestamps) or VM Time Correlation. VM Time Correlation is an extra step
 using 'perf inject' and requires unchanging VMX TSC Offset and no VMX TSC Scaling.

@ -1400,6 +1400,179 @@ There were none.
          :17006 17006 [001] 11500.262869216:  ffffffff8220116e error_entry+0xe ([guest.kernel.kallsyms])               pushq  %rax


+Tracing Virtual Machines (including user space)
+-----------------------------------------------
+
+It is possible to use perf record to record sideband events within a virtual machine, so that an Intel PT trace on the host can be decoded.
+Sideband events from the guest perf.data file can be injected into the host perf.data file using perf inject.
+
+Here is an example of the steps needed:
+
+On the guest machine:
+
+Check that no-kvmclock kernel command line option was used to boot:
+
+Note, this is essential to enable time correlation between host and guest machines.
+
+ $ cat /proc/cmdline
+ BOOT_IMAGE=/boot/vmlinuz-5.10.0-16-amd64 root=UUID=cb49c910-e573-47e0-bce7-79e293df8e1d ro no-kvmclock
+
+There is no BPF support at present so, if possible, disable JIT compiling:
+
+ $ echo 0 | sudo tee /proc/sys/net/core/bpf_jit_enable
+ 0
+
+Start perf record to collect sideband events:
+
+ $ sudo perf record -o guest-sideband-testing-guest-perf.data --sample-identifier --buildid-all --switch-events --kcore -a -e dummy
+
+On the host machine:
+
+Start perf record to collect Intel PT trace:
+
+Note, the host trace will get very big, very fast, so the steps from starting to stopping the host trace really need to be done so that they happen in the shortest time possible.
+
+ $ sudo perf record -o guest-sideband-testing-host-perf.data -m,64M --kcore -a -e intel_pt/cyc/
+
+On the guest machine:
+
+Run a small test case, just 'uname' in this example:
+
+ $ uname
+ Linux
+
+On the host machine:
+
+Stop the Intel PT trace:
+
+ ^C
+ [ perf record: Woken up 1 times to write data ]
+ [ perf record: Captured and wrote 76.122 MB guest-sideband-testing-host-perf.data ]
+
+On the guest machine:
+
+Stop the Intel PT trace:
+
+ ^C
+ [ perf record: Woken up 1 times to write data ]
+ [ perf record: Captured and wrote 1.247 MB guest-sideband-testing-guest-perf.data ]
+
+And then copy guest-sideband-testing-guest-perf.data to the host (not shown here).
+
+On the host machine:
+
+With the 2 perf.data recordings, and with their ownership changed to the user.
+
+Identify the TSC Offset:
+
+ $ perf inject -i guest-sideband-testing-host-perf.data --vm-time-correlation=dry-run
+ VMCS: 0x103fc6  TSC Offset 0xfffffa6ae070cb20
+ VMCS: 0x103ff2  TSC Offset 0xfffffa6ae070cb20
+ VMCS: 0x10fdaa  TSC Offset 0xfffffa6ae070cb20
+ VMCS: 0x24d57c  TSC Offset 0xfffffa6ae070cb20
+
+Correct Intel PT TSC timestamps for the guest machine:
+
+ $ perf inject -i guest-sideband-testing-host-perf.data --vm-time-correlation=0xfffffa6ae070cb20 --force
+
+Identify the guest machine PID:
+
+ $ perf script -i guest-sideband-testing-host-perf.data --no-itrace --show-task-events | grep KVM
+       CPU 0/KVM     0 [000]     0.000000: PERF_RECORD_COMM: CPU 0/KVM:13376/13381
+       CPU 1/KVM     0 [000]     0.000000: PERF_RECORD_COMM: CPU 1/KVM:13376/13382
+       CPU 2/KVM     0 [000]     0.000000: PERF_RECORD_COMM: CPU 2/KVM:13376/13383
+       CPU 3/KVM     0 [000]     0.000000: PERF_RECORD_COMM: CPU 3/KVM:13376/13384
+
+Note, the QEMU option -name debug-threads=on is needed so that thread names
+can be used to determine which thread is running which VCPU as above. libvirt seems to use this by default.
+
+Create a guestmount, assuming the guest machine is 'vm_to_test':
+
+ $ mkdir -p ~/guestmount/13376
+ $ sshfs -o direct_io vm_to_test:/ ~/guestmount/13376
+
+Inject the guest perf.data file into the host perf.data file:
+
+Note, due to the guestmount option, guest object files and debug files will be copied into the build ID cache from the guest machine, with the notable exception of VDSO.
+If needed, VDSO can be copied manually in a fashion similar to that used by the perf-archive script.
+
+ $ perf inject -i guest-sideband-testing-host-perf.data -o inj --guestmount ~/guestmount --guest-data=guest-sideband-testing-guest-perf.data,13376,0xfffffa6ae070cb20
+
+Show an excerpt from the result.  In this case the CPU and time range have been to chosen to show interaction between guest and host when 'uname' is starting to run on the guest machine:
+
+Notes:
+
+	- the CPU displayed, [002] in this case, is always the host CPU
+	- events happening in the virtual machine start with VM:13376 VCPU:003, which shows the hypervisor PID 13376 and the VCPU number
+	- only calls and errors are displayed i.e. --itrace=ce
+	- branches entering and exiting the virtual machine are split, and show as 2 branches to/from "0 [unknown] ([unknown])"
+
+ $ perf script -i inj --itrace=ce -F+machine_pid,+vcpu,+addr,+pid,+tid,-period --ns --time 7919.408803365,7919.408804631 -C 2
+       CPU 3/KVM 13376/13384 [002]  7919.408803365:      branches:  ffffffffc0f8ebe0 vmx_vcpu_enter_exit+0xc0 ([kernel.kallsyms]) => ffffffffc0f8edc0 __vmx_vcpu_run+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803365:      branches:  ffffffffc0f8edd5 __vmx_vcpu_run+0x15 ([kernel.kallsyms]) => ffffffffc0f8eca0 vmx_update_host_rsp+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803365:      branches:  ffffffffc0f8ee1b __vmx_vcpu_run+0x5b ([kernel.kallsyms]) => ffffffffc0f8ed60 vmx_vmenter+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803461:      branches:  ffffffffc0f8ed62 vmx_vmenter+0x2 ([kernel.kallsyms]) =>                0 [unknown] ([unknown])
+ VM:13376 VCPU:003            uname  3404/3404  [002]  7919.408803461:      branches:                 0 [unknown] ([unknown]) =>     7f851c9b5a5c init_cacheinfo+0x3ac (/usr/lib/x86_64-linux-gnu/libc-2.31.so)
+ VM:13376 VCPU:003            uname  3404/3404  [002]  7919.408803567:      branches:      7f851c9b5a5a init_cacheinfo+0x3aa (/usr/lib/x86_64-linux-gnu/libc-2.31.so) =>                0 [unknown] ([unknown])
+       CPU 3/KVM 13376/13384 [002]  7919.408803567:      branches:                 0 [unknown] ([unknown]) => ffffffffc0f8ed80 vmx_vmexit+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803596:      branches:  ffffffffc0f6619a vmx_vcpu_run+0x26a ([kernel.kallsyms]) => ffffffffb2255c60 x86_virt_spec_ctrl+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803801:      branches:  ffffffffc0f66445 vmx_vcpu_run+0x515 ([kernel.kallsyms]) => ffffffffb2290b30 native_write_msr+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803850:      branches:  ffffffffc0f661f8 vmx_vcpu_run+0x2c8 ([kernel.kallsyms]) => ffffffffc1092300 kvm_load_host_xsave_state+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803850:      branches:  ffffffffc1092327 kvm_load_host_xsave_state+0x27 ([kernel.kallsyms]) => ffffffffc1092220 kvm_load_host_xsave_state.part.0+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803862:      branches:  ffffffffc0f662cf vmx_vcpu_run+0x39f ([kernel.kallsyms]) => ffffffffc0f63f90 vmx_recover_nmi_blocking+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803862:      branches:  ffffffffc0f662e9 vmx_vcpu_run+0x3b9 ([kernel.kallsyms]) => ffffffffc0f619a0 __vmx_complete_interrupts+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803872:      branches:  ffffffffc109cfb2 vcpu_enter_guest+0x752 ([kernel.kallsyms]) => ffffffffc0f5f570 vmx_handle_exit_irqoff+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803881:      branches:  ffffffffc109d028 vcpu_enter_guest+0x7c8 ([kernel.kallsyms]) => ffffffffb234f900 __srcu_read_lock+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803897:      branches:  ffffffffc109d06f vcpu_enter_guest+0x80f ([kernel.kallsyms]) => ffffffffc0f72e30 vmx_handle_exit+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803897:      branches:  ffffffffc0f72e3d vmx_handle_exit+0xd ([kernel.kallsyms]) => ffffffffc0f727c0 __vmx_handle_exit+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803897:      branches:  ffffffffc0f72b15 __vmx_handle_exit+0x355 ([kernel.kallsyms]) => ffffffffc0f60ae0 vmx_flush_pml_buffer+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803903:      branches:  ffffffffc0f72994 __vmx_handle_exit+0x1d4 ([kernel.kallsyms]) => ffffffffc10b7090 kvm_emulate_cpuid+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803903:      branches:  ffffffffc10b70f1 kvm_emulate_cpuid+0x61 ([kernel.kallsyms]) => ffffffffc10b6e10 kvm_cpuid+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803941:      branches:  ffffffffc10b7125 kvm_emulate_cpuid+0x95 ([kernel.kallsyms]) => ffffffffc1093110 kvm_skip_emulated_instruction+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803941:      branches:  ffffffffc109311f kvm_skip_emulated_instruction+0xf ([kernel.kallsyms]) => ffffffffc0f5e180 vmx_get_rflags+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803951:      branches:  ffffffffc109312a kvm_skip_emulated_instruction+0x1a ([kernel.kallsyms]) => ffffffffc0f5fd30 vmx_skip_emulated_instruction+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803951:      branches:  ffffffffc0f5fd79 vmx_skip_emulated_instruction+0x49 ([kernel.kallsyms]) => ffffffffc0f5fb50 skip_emulated_instruction+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803956:      branches:  ffffffffc0f5fc68 skip_emulated_instruction+0x118 ([kernel.kallsyms]) => ffffffffc0f6a940 vmx_cache_reg+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803964:      branches:  ffffffffc0f5fc11 skip_emulated_instruction+0xc1 ([kernel.kallsyms]) => ffffffffc0f5f9e0 vmx_set_interrupt_shadow+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803980:      branches:  ffffffffc109f8b1 vcpu_run+0x71 ([kernel.kallsyms]) => ffffffffc10ad2f0 kvm_cpu_has_pending_timer+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803980:      branches:  ffffffffc10ad2fb kvm_cpu_has_pending_timer+0xb ([kernel.kallsyms]) => ffffffffc10b0490 apic_has_pending_timer+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803991:      branches:  ffffffffc109f899 vcpu_run+0x59 ([kernel.kallsyms]) => ffffffffc109c860 vcpu_enter_guest+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803993:      branches:  ffffffffc109cd4c vcpu_enter_guest+0x4ec ([kernel.kallsyms]) => ffffffffc0f69140 vmx_prepare_switch_to_guest+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803996:      branches:  ffffffffc109cd7d vcpu_enter_guest+0x51d ([kernel.kallsyms]) => ffffffffb234f930 __srcu_read_unlock+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803996:      branches:  ffffffffc109cd9c vcpu_enter_guest+0x53c ([kernel.kallsyms]) => ffffffffc0f609b0 vmx_sync_pir_to_irr+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408803996:      branches:  ffffffffc0f60a6d vmx_sync_pir_to_irr+0xbd ([kernel.kallsyms]) => ffffffffc10adc20 kvm_lapic_find_highest_irr+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804010:      branches:  ffffffffc0f60abd vmx_sync_pir_to_irr+0x10d ([kernel.kallsyms]) => ffffffffc0f60820 vmx_set_rvi+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804019:      branches:  ffffffffc109ceca vcpu_enter_guest+0x66a ([kernel.kallsyms]) => ffffffffb2249840 fpregs_assert_state_consistent+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804021:      branches:  ffffffffc109cf10 vcpu_enter_guest+0x6b0 ([kernel.kallsyms]) => ffffffffc0f65f30 vmx_vcpu_run+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804024:      branches:  ffffffffc0f6603b vmx_vcpu_run+0x10b ([kernel.kallsyms]) => ffffffffb229bed0 __get_current_cr3_fast+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804024:      branches:  ffffffffc0f66055 vmx_vcpu_run+0x125 ([kernel.kallsyms]) => ffffffffb2253050 cr4_read_shadow+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804030:      branches:  ffffffffc0f6608d vmx_vcpu_run+0x15d ([kernel.kallsyms]) => ffffffffc10921e0 kvm_load_guest_xsave_state+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804030:      branches:  ffffffffc1092207 kvm_load_guest_xsave_state+0x27 ([kernel.kallsyms]) => ffffffffc1092110 kvm_load_guest_xsave_state.part.0+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804032:      branches:  ffffffffc0f660c6 vmx_vcpu_run+0x196 ([kernel.kallsyms]) => ffffffffb22061a0 perf_guest_get_msrs+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804032:      branches:  ffffffffb22061a9 perf_guest_get_msrs+0x9 ([kernel.kallsyms]) => ffffffffb220cda0 intel_guest_get_msrs+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804039:      branches:  ffffffffc0f66109 vmx_vcpu_run+0x1d9 ([kernel.kallsyms]) => ffffffffc0f652c0 clear_atomic_switch_msr+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804040:      branches:  ffffffffc0f66119 vmx_vcpu_run+0x1e9 ([kernel.kallsyms]) => ffffffffc0f73f60 intel_pmu_lbr_is_enabled+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804042:      branches:  ffffffffc0f73f81 intel_pmu_lbr_is_enabled+0x21 ([kernel.kallsyms]) => ffffffffc10b68e0 kvm_find_cpuid_entry+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804045:      branches:  ffffffffc0f66454 vmx_vcpu_run+0x524 ([kernel.kallsyms]) => ffffffffc0f61ff0 vmx_update_hv_timer+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804057:      branches:  ffffffffc0f66142 vmx_vcpu_run+0x212 ([kernel.kallsyms]) => ffffffffc10af100 kvm_wait_lapic_expire+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804057:      branches:  ffffffffc0f66156 vmx_vcpu_run+0x226 ([kernel.kallsyms]) => ffffffffb2255c60 x86_virt_spec_ctrl+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804057:      branches:  ffffffffc0f66161 vmx_vcpu_run+0x231 ([kernel.kallsyms]) => ffffffffc0f8eb20 vmx_vcpu_enter_exit+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804057:      branches:  ffffffffc0f8eb44 vmx_vcpu_enter_exit+0x24 ([kernel.kallsyms]) => ffffffffb2353e10 rcu_note_context_switch+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804057:      branches:  ffffffffb2353e1c rcu_note_context_switch+0xc ([kernel.kallsyms]) => ffffffffb2353db0 rcu_qs+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804066:      branches:  ffffffffc0f8ebe0 vmx_vcpu_enter_exit+0xc0 ([kernel.kallsyms]) => ffffffffc0f8edc0 __vmx_vcpu_run+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804066:      branches:  ffffffffc0f8edd5 __vmx_vcpu_run+0x15 ([kernel.kallsyms]) => ffffffffc0f8eca0 vmx_update_host_rsp+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804066:      branches:  ffffffffc0f8ee1b __vmx_vcpu_run+0x5b ([kernel.kallsyms]) => ffffffffc0f8ed60 vmx_vmenter+0x0 ([kernel.kallsyms])
+       CPU 3/KVM 13376/13384 [002]  7919.408804162:      branches:  ffffffffc0f8ed62 vmx_vmenter+0x2 ([kernel.kallsyms]) =>                0 [unknown] ([unknown])
+ VM:13376 VCPU:003            uname  3404/3404  [002]  7919.408804162:      branches:                 0 [unknown] ([unknown]) =>     7f851c9b5a5c init_cacheinfo+0x3ac (/usr/lib/x86_64-linux-gnu/libc-2.31.so)
+ VM:13376 VCPU:003            uname  3404/3404  [002]  7919.408804273:      branches:      7f851cb7c0e4 _dl_init+0x74 (/usr/lib/x86_64-linux-gnu/ld-2.31.so) =>     7f851cb7bf50 call_init.part.0+0x0 (/usr/lib/x86_64-linux-gnu/ld-2.31.so)
+ VM:13376 VCPU:003            uname  3404/3404  [002]  7919.408804526:      branches:      55e0c00136f0 _start+0x0 (/usr/bin/uname) => ffffffff83200ac0 asm_exc_page_fault+0x0 ([kernel.kallsyms])
+ VM:13376 VCPU:003            uname  3404/3404  [002]  7919.408804526:      branches:  ffffffff83200ac3 asm_exc_page_fault+0x3 ([kernel.kallsyms]) => ffffffff83201290 error_entry+0x0 ([kernel.kallsyms])
+ VM:13376 VCPU:003            uname  3404/3404  [002]  7919.408804534:      branches:  ffffffff832012fa error_entry+0x6a ([kernel.kallsyms]) => ffffffff830b59a0 sync_regs+0x0 ([kernel.kallsyms])
+ VM:13376 VCPU:003            uname  3404/3404  [002]  7919.408804631:      branches:  ffffffff83200ad9 asm_exc_page_fault+0x19 ([kernel.kallsyms]) => ffffffff830b8210 exc_page_fault+0x0 ([kernel.kallsyms])
+ VM:13376 VCPU:003            uname  3404/3404  [002]  7919.408804631:      branches:  ffffffff830b82a4 exc_page_fault+0x94 ([kernel.kallsyms]) => ffffffff830b80e0 __kvm_handle_async_pf+0x0 ([kernel.kallsyms])
+ VM:13376 VCPU:003            uname  3404/3404  [002]  7919.408804631:      branches:  ffffffff830b80ed __kvm_handle_async_pf+0xd ([kernel.kallsyms]) => ffffffff830b80c0 kvm_read_and_reset_apf_flags+0x0 ([kernel.kallsyms])
+
+
 Tracing Virtual Machines - Guest Code
 -------------------------------------

--- a/tools/perf/Documentation/perf-kwork.txt
+++ b/tools/perf/Documentation/perf-kwork.txt
@ -0,0 +1,180 @@
+perf-kowrk(1)
+=============
+
+NAME
+----
+perf-kwork - Tool to trace/measure kernel work properties (latencies)
+
+SYNOPSIS
+--------
+[verse]
+'perf kwork' {record}
+
+DESCRIPTION
+-----------
+There are several variants of 'perf kwork':
+
+  'perf kwork record <command>' to record the kernel work
+  of an arbitrary workload.
+
+  'perf kwork report' to report the per kwork runtime.
+
+  'perf kwork latency' to report the per kwork latencies.
+
+  'perf kwork timehist' provides an analysis of kernel work events.
+
+    Example usage:
+        perf kwork record -- sleep 1
+        perf kwork report
+        perf kwork report -b
+        perf kwork latency
+        perf kwork latency -b
+        perf kwork timehist
+
+   By default it shows the individual work events such as irq, workqeueu,
+   including the run time and delay (time between raise and actually entry):
+
+      Runtime start      Runtime end        Cpu     Kwork name                 Runtime     Delaytime
+                                                    (TYPE)NAME:NUM             (msec)      (msec)
+   -----------------  -----------------  ------  -------------------------  ----------  ----------
+      1811186.976062     1811186.976327  [0000]  (s)RCU:9                        0.266       0.114
+      1811186.978452     1811186.978547  [0000]  (s)SCHED:7                      0.095       0.171
+      1811186.980327     1811186.980490  [0000]  (s)SCHED:7                      0.162       0.083
+      1811186.981221     1811186.981271  [0000]  (s)SCHED:7                      0.050       0.077
+      1811186.984267     1811186.984318  [0000]  (s)SCHED:7                      0.051       0.075
+      1811186.987252     1811186.987315  [0000]  (s)SCHED:7                      0.063       0.081
+      1811186.987785     1811186.987843  [0006]  (s)RCU:9                        0.058       0.645
+      1811186.988319     1811186.988383  [0000]  (s)SCHED:7                      0.064       0.143
+      1811186.989404     1811186.989607  [0002]  (s)TIMER:1                      0.203       0.111
+      1811186.989660     1811186.989732  [0002]  (s)SCHED:7                      0.072       0.310
+      1811186.991295     1811186.991407  [0002]  eth0:10                         0.112
+      1811186.991639     1811186.991734  [0002]  (s)NET_RX:3                     0.095       0.277
+      1811186.989860     1811186.991826  [0002]  (w)vmstat_shepherd              1.966       0.345
+    ...
+
+   Times are in msec.usec.
+
+OPTIONS
+-------
+-D::
+--dump-raw-trace=::
+	Display verbose dump of the sched data.
+
+-f::
+--force::
+	Don't complain, do it.
+
+-k::
+--kwork::
+	List of kwork to profile (irq, softirq, workqueue, etc)
+
+-v::
+--verbose::
+	Be more verbose. (show symbol address, etc)
+
+OPTIONS for 'perf kwork report'
+----------------------------
+
+-b::
+--use-bpf::
+	Use BPF to measure kwork runtime
+
+-C::
+--cpu::
+	Only show events for the given CPU(s) (comma separated list).
+
+-i::
+--input::
+	Input file name. (default: perf.data unless stdin is a fifo)
+
+-n::
+--name::
+	Only show events for the given name.
+
+-s::
+--sort::
+	Sort by key(s): runtime, max, count
+
+-S::
+--with-summary::
+	Show summary with statistics
+
+--time::
+	Only analyze samples within given time window: <start>,<stop>. Times
+	have the format seconds.microseconds. If start is not given (i.e., time
+	string is ',x.y') then analysis starts at the beginning of the file. If
+	stop time is not given (i.e, time string is 'x.y,') then analysis goes
+	to end of file.
+
+OPTIONS for 'perf kwork latency'
+----------------------------
+
+-b::
+--use-bpf::
+	Use BPF to measure kwork latency
+
+-C::
+--cpu::
+	Only show events for the given CPU(s) (comma separated list).
+
+-i::
+--input::
+	Input file name. (default: perf.data unless stdin is a fifo)
+
+-n::
+--name::
+	Only show events for the given name.
+
+-s::
+--sort::
+	Sort by key(s): avg, max, count
+
+--time::
+	Only analyze samples within given time window: <start>,<stop>. Times
+	have the format seconds.microseconds. If start is not given (i.e., time
+	string is ',x.y') then analysis starts at the beginning of the file. If
+	stop time is not given (i.e, time string is 'x.y,') then analysis goes
+	to end of file.
+
+OPTIONS for 'perf kwork timehist'
+---------------------------------
+
+-C::
+--cpu::
+	Only show events for the given CPU(s) (comma separated list).
+
+-g::
+--call-graph::
+	Display call chains if present (default off).
+
+-i::
+--input::
+	Input file name. (default: perf.data unless stdin is a fifo)
+
+-k::
+--vmlinux=<file>::
+	Vmlinux pathname
+
+-n::
+--name::
+	Only show events for the given name.
+
+--kallsyms=<file>::
+	Kallsyms pathname
+
+--max-stack::
+	Maximum number of functions to display in backtrace, default 5.
+
+--symfs=<directory>::
+    Look for files with symbols relative to this directory.
+
+--time::
+	Only analyze samples within given time window: <start>,<stop>. Times
+	have the format seconds.microseconds. If start is not given (i.e., time
+	string is ',x.y') then analysis starts at the beginning of the file. If
+	stop time is not given (i.e, time string is 'x.y,') then analysis goes
+	to end of file.
+
+SEE ALSO
+--------
+linkperf:perf-record[1]
--- a/tools/perf/Documentation/perf-lock.txt
+++ b/tools/perf/Documentation/perf-lock.txt
@ -8,7 +8,7 @@ perf-lock - Analyze lock events
 SYNOPSIS
 --------
 [verse]
-'perf lock' {record|report|script|info}
+'perf lock' {record|report|script|info|contention}

 DESCRIPTION
 -----------
@ -27,6 +27,8 @@ and statistics with this 'perf lock' command.
  'perf lock info' shows metadata like threads or addresses
  of lock instances.

+  'perf lock contention' shows contention statistics.
+
 COMMON OPTIONS
 --------------

@ -46,6 +48,13 @@ COMMON OPTIONS
 --force::
 	Don't complain, do it.

+--vmlinux=<file>::
+        vmlinux pathname
+
+--kallsyms=<file>::
+        kallsyms pathname
+
+
 REPORT OPTIONS
 --------------

@ -96,6 +105,50 @@ INFO OPTIONS
 --map::
 	dump map of lock instances (address:name table)

+CONTENTION OPTIONS
+--------------
+
+-k::
+--key=<value>::
+	Sorting key. Possible values: contended, wait_total (default),
+	wait_max, wait_min, avg_wait.
+
+-F::
+--field=<value>::
+	Output fields. By default it shows all but the wait_min fields
+	and users can customize that using this.  Possible values:
+	contended, wait_total, wait_max, wait_min, avg_wait.
+
+-t::
+--threads::
+	Show per-thread lock contention stat
+
+-b::
+--use-bpf::
+	Use BPF program to collect lock contention stats instead of
+	using the input data.
+
+-a::
+--all-cpus::
+        System-wide collection from all CPUs.
+
+-C::
+--cpu::
+	Collect samples only on the list of CPUs provided. Multiple CPUs can be
+	provided as a comma-separated list with no space: 0,1. Ranges of CPUs
+	are specified with -: 0-2.  Default is to monitor all CPUs.
+
+-p::
+--pid=::
+	Record events on existing process ID (comma separated list).
+
+--tid=::
+        Record events on existing thread ID (comma separated list).
+
+--map-nr-entries::
+	Maximum number of BPF map entries (default: 10240).
+
+
 SEE ALSO
 --------
 linkperf:perf[1]
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@ -275,6 +275,11 @@ OPTIONS
 	User can change the size by passing the size after comma like
 	"--call-graph dwarf,4096".

+	When "fp" recording is used, perf tries to save stack enties
+	up to the number specified in sysctl.kernel.perf_event_max_stack
+	by default.  User can change the number by passing it after comma
+	like "--call-graph fp,32".
+
 -q::
 --quiet::
 	Don't print any message, useful for scripting.
@ -313,6 +318,11 @@ OPTIONS
 --sample-cpu::
 	Record the sample cpu.

+--sample-identifier::
+	Record the sample identifier i.e. PERF_SAMPLE_IDENTIFIER bit set in
+	the sample_type member of the struct perf_event_attr argument to the
+	perf_event_open system call.
+
 -n::
 --no-samples::
 	Don't sample.
--- a/tools/perf/Documentation/perf-script.txt
+++ b/tools/perf/Documentation/perf-script.txt
@ -79,6 +79,9 @@ OPTIONS
 --dump-raw-trace=::
        Display verbose dump of the trace data.

+--dump-unsorted-raw-trace=::
+        Same as --dump-raw-trace but not sorted in time order.
+
 -L::
 --Latency=::
        Show latency attributes (irqs/preemption disabled, etc).
@ -130,7 +133,8 @@ OPTIONS
        comm, tid, pid, time, cpu, event, trace, ip, sym, dso, addr, symoff,
        srcline, period, iregs, uregs, brstack, brstacksym, flags, bpf-output,
        brstackinsn, brstackinsnlen, brstackoff, callindent, insn, insnlen, synth,
-        phys_addr, metric, misc, srccode, ipc, data_page_size, code_page_size, ins_lat.
+        phys_addr, metric, misc, srccode, ipc, data_page_size, code_page_size, ins_lat,
+        machine_pid, vcpu.
        Field list can be prepended with the type, trace, sw or hw,
        to indicate to which event type the field list applies.
        e.g., -F sw:comm,tid,time,ip,sym  and -F trace:time,cpu,trace
@ -223,6 +227,10 @@ OPTIONS
 	The ipc (instructions per cycle) field is synthesized and may have a value when
 	Instruction Trace decoding.

+	The machine_pid and vcpu fields are derived from data resulting from using
+	perf insert to insert a perf.data file recorded inside a virtual machine into
+	a perf.data file recorded on the host at the same time.
+
 	Finally, a user may not set fields to none for all event types.
 	i.e., -F "" is not allowed.

--- a/tools/perf/Documentation/perf.data-file-format.txt
+++ b/tools/perf/Documentation/perf.data-file-format.txt
@ -419,18 +419,20 @@ Example:
  cpu_core cpu list : 0-15
  cpu_atom cpu list : 16-23

-	HEADER_HYBRID_CPU_PMU_CAPS = 31,
+	HEADER_PMU_CAPS = 31,

-	A list of hybrid CPU PMU capabilities.
+	List of pmu capabilities (except cpu pmu which is already
+	covered by HEADER_CPU_PMU_CAPS). Note that hybrid cpu pmu
+	capabilities are also stored here.

 struct {
 	u32 nr_pmu;
 	struct {
-		u32 nr_cpu_pmu_caps;
+		u32 nr_caps;
 		{
 			char	name[];
 			char	value[];
-		} [nr_cpu_pmu_caps];
+		} [nr_caps];
 		char pmu_name[];
 	} [nr_pmu];
 };
@ -607,6 +609,16 @@ struct compressed_event {
 	char				data[];
 };

+	PERF_RECORD_FINISHED_INIT			= 82,
+
+Marks the end of records for the system, pre-existing threads in system wide
+sessions, etc. Those are the ones prefixed PERF_RECORD_USER_*.
+
+This is used, for instance, to 'perf inject' events after init and before
+regular events, those emitted by the kernel, to support combining guest and
+host records.
+
+
 The header is followed by compressed data frame that can be decompressed
 into array of perf trace records. The size of the entire compressed event
 record including the header is limited by the max value of header.size.
--- a/tools/perf/Makefile.config
+++ b/tools/perf/Makefile.config
@ -241,15 +241,15 @@ endif
 # Try different combinations to accommodate systems that only have
 # python[2][3]-config in weird combinations in the following order of
 # priority from lowest to highest:
-#   * python3-config
-#   * python-config
 #   * python2-config as per pep-0394.
+#   * python-config
+#   * python3-config
 #   * $(PYTHON)-config (If PYTHON is user supplied but PYTHON_CONFIG isn't)
 #
 PYTHON_AUTO := python-config
-PYTHON_AUTO := $(if $(call get-executable,python3-config),python3-config,$(PYTHON_AUTO))
-PYTHON_AUTO := $(if $(call get-executable,python-config),python-config,$(PYTHON_AUTO))
 PYTHON_AUTO := $(if $(call get-executable,python2-config),python2-config,$(PYTHON_AUTO))
+PYTHON_AUTO := $(if $(call get-executable,python-config),python-config,$(PYTHON_AUTO))
+PYTHON_AUTO := $(if $(call get-executable,python3-config),python3-config,$(PYTHON_AUTO))

 # If PYTHON is defined but PYTHON_CONFIG isn't, then take $(PYTHON)-config as if it was the user
 # supplied value for PYTHON_CONFIG. Because it's "user supplied", error out if it doesn't exist.
@ -298,6 +298,7 @@ FEATURE_CHECK_LDFLAGS-libpython := $(PYTHON_EMBED_LDOPTS)
 FEATURE_CHECK_LDFLAGS-libaio = -lrt

 FEATURE_CHECK_LDFLAGS-disassembler-four-args = -lbfd -lopcodes -ldl
+FEATURE_CHECK_LDFLAGS-disassembler-init-styled = -lbfd -lopcodes -ldl

 CORE_CFLAGS += -fno-omit-frame-pointer
 CORE_CFLAGS += -ggdb3
@ -342,7 +343,7 @@ endif

 ifeq ($(DEBUG),0)
  ifeq ($(feature-fortify-source), 1)
-    CORE_CFLAGS += -D_FORTIFY_SOURCE=2
+    CORE_CFLAGS += -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2
  endif
 endif

@ -889,6 +890,25 @@ else
  endif
 endif

+ifneq ($(NO_JEVENTS),1)
+  ifeq ($(wildcard pmu-events/arch/$(SRCARCH)/mapfile.csv),)
+    NO_JEVENTS := 1
+  endif
+endif
+ifneq ($(NO_JEVENTS),1)
+  NO_JEVENTS := 0
+  ifndef PYTHON
+    $(warning No python interpreter disabling jevent generation)
+    NO_JEVENTS := 1
+  else
+    # jevents.py uses f-strings present in Python 3.6 released in Dec. 2016.
+    JEVENTS_PYTHON_GOOD := $(shell $(PYTHON) -c 'import sys;print("1" if(sys.version_info.major >= 3 and sys.version_info.minor >= 6) else "0")' 2> /dev/null)
+    ifneq ($(JEVENTS_PYTHON_GOOD), 1)
+      $(warning Python interpreter too old (older than 3.6) disabling jevent generation)
+      NO_JEVENTS := 1
+    endif
+  endif
+endif

 ifndef NO_LIBBFD
  ifeq ($(feature-libbfd), 1)
@ -905,13 +925,16 @@ ifndef NO_LIBBFD
    ifeq ($(feature-libbfd-liberty), 1)
      EXTLIBS += -lbfd -lopcodes -liberty
      FEATURE_CHECK_LDFLAGS-disassembler-four-args += -liberty -ldl
+      FEATURE_CHECK_LDFLAGS-disassembler-init-styled += -liberty -ldl
    else
      ifeq ($(feature-libbfd-liberty-z), 1)
        EXTLIBS += -lbfd -lopcodes -liberty -lz
        FEATURE_CHECK_LDFLAGS-disassembler-four-args += -liberty -lz -ldl
+        FEATURE_CHECK_LDFLAGS-disassembler-init-styled += -liberty -lz -ldl
      endif
    endif
    $(call feature_check,disassembler-four-args)
+    $(call feature_check,disassembler-init-styled)
  endif

  ifeq ($(feature-libbfd-buildid), 1)
@ -1025,6 +1048,10 @@ ifeq ($(feature-disassembler-four-args), 1)
    CFLAGS += -DDISASM_FOUR_ARGS_SIGNATURE
 endif

+ifeq ($(feature-disassembler-init-styled), 1)
+    CFLAGS += -DDISASM_INIT_STYLED
+endif
+
 ifeq (${IS_64_BIT}, 1)
  ifndef NO_PERF_READ_VDSO32
    $(call feature_check,compile-32)
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@ -651,25 +651,15 @@ strip: $(PROGRAMS) $(OUTPUT)perf

 PERF_IN := $(OUTPUT)perf-in.o

-JEVENTS       := $(OUTPUT)pmu-events/jevents
-JEVENTS_IN    := $(OUTPUT)pmu-events/jevents-in.o
-
 PMU_EVENTS_IN := $(OUTPUT)pmu-events/pmu-events-in.o
-
-export JEVENTS
+export NO_JEVENTS

 build := -f $(srctree)/tools/build/Makefile.build dir=. obj

 $(PERF_IN): prepare FORCE
 	$(Q)$(MAKE) $(build)=perf

-$(JEVENTS_IN): FORCE
-	$(Q)$(MAKE) -f $(srctree)/tools/build/Makefile.build dir=pmu-events obj=jevents
-
-$(JEVENTS): $(JEVENTS_IN)
-	$(QUIET_LINK)$(HOSTCC) $(JEVENTS_IN) -o $@
-
-$(PMU_EVENTS_IN): $(JEVENTS) FORCE
+$(PMU_EVENTS_IN): FORCE
 	$(Q)$(MAKE) -f $(srctree)/tools/build/Makefile.build dir=pmu-events obj=pmu-events

 $(OUTPUT)perf: $(PERFLIBS) $(PERF_IN) $(PMU_EVENTS_IN) $(LIBTRACEEVENT_DYNAMIC_LIST)
@ -1038,7 +1028,8 @@ SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
 SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
 SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
 SKELETONS += $(SKEL_OUT)/bperf_cgroup.skel.h $(SKEL_OUT)/func_latency.skel.h
-SKELETONS += $(SKEL_OUT)/off_cpu.skel.h
+SKELETONS += $(SKEL_OUT)/off_cpu.skel.h $(SKEL_OUT)/lock_contention.skel.h
+SKELETONS += $(SKEL_OUT)/kwork_trace.skel.h

 $(SKEL_TMP_OUT) $(LIBBPF_OUTPUT):
 	$(Q)$(MKDIR) -p $@
@ -1089,7 +1080,7 @@ clean:: $(LIBTRACEEVENT)-clean $(LIBAPI)-clean $(LIBBPF)-clean $(LIBSUBCMD)-clea
 	$(call QUIET_CLEAN, core-objs)  $(RM) $(LIBPERF_A) $(OUTPUT)perf-archive $(OUTPUT)perf-iostat $(LANG_BINDINGS)
 	$(Q)find $(or $(OUTPUT),.) -name '*.o' -delete -o -name '\.*.cmd' -delete -o -name '\.*.d' -delete
 	$(Q)$(RM) $(OUTPUT).config-detected
-	$(call QUIET_CLEAN, core-progs) $(RM) $(ALL_PROGRAMS) perf perf-read-vdso32 perf-read-vdsox32 $(OUTPUT)pmu-events/jevents $(OUTPUT)$(LIBJVMTI).so
+	$(call QUIET_CLEAN, core-progs) $(RM) $(ALL_PROGRAMS) perf perf-read-vdso32 perf-read-vdsox32 $(OUTPUT)$(LIBJVMTI).so
 	$(call QUIET_CLEAN, core-gen)   $(RM)  *.spec *.pyc *.pyo */*.pyc */*.pyo $(OUTPUT)common-cmds.h TAGS tags cscope* $(OUTPUT)PERF-VERSION-FILE $(OUTPUT)FEATURE-DUMP $(OUTPUT)util/*-bison* $(OUTPUT)util/*-flex* \
 		$(OUTPUT)util/intel-pt-decoder/inat-tables.c \
 		$(OUTPUT)tests/llvm-src-{base,kbuild,prologue,relocation}.c \
--- a/tools/perf/arch/x86/tests/Build
+++ b/tools/perf/arch/x86/tests/Build
@ -2,7 +2,6 @@ perf-$(CONFIG_DWARF_UNWIND) += regs_load.o
 perf-$(CONFIG_DWARF_UNWIND) += dwarf-unwind.o

 perf-y += arch-tests.o
-perf-y += rdpmc.o
 perf-y += sample-parsing.o
 perf-$(CONFIG_AUXTRACE) += insn-x86.o intel-pt-pkt-decoder-test.o
 perf-$(CONFIG_X86_64) += bp-modify.o
--- a/tools/perf/arch/x86/tests/arch-tests.c
+++ b/tools/perf/arch/x86/tests/arch-tests.c
@ -3,7 +3,6 @@
 #include "tests/tests.h"
 #include "arch-tests.h"

-DEFINE_SUITE("x86 rdpmc", rdpmc);
 #ifdef HAVE_AUXTRACE_SUPPORT
 DEFINE_SUITE("x86 instruction decoder - new instructions", insn_x86);
 DEFINE_SUITE("Intel PT packet decoder", intel_pt_pkt_decoder);
@ -14,7 +13,6 @@ DEFINE_SUITE("x86 bp modify", bp_modify);
 DEFINE_SUITE("x86 Sample parsing", x86_sample_parsing);

 struct test_suite *arch_tests[] = {
-	&suite__rdpmc,
 #ifdef HAVE_DWARF_UNWIND_SUPPORT
 	&suite__dwarf_unwind,
 #endif
--- a/tools/perf/arch/x86/tests/rdpmc.c
+++ b/tools/perf/arch/x86/tests/rdpmc.c
@ -1,182 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-#include <errno.h>
-#include <unistd.h>
-#include <stdlib.h>
-#include <signal.h>
-#include <sys/mman.h>
-#include <sys/types.h>
-#include <sys/wait.h>
-#include <linux/string.h>
-#include <linux/types.h>
-#include "perf-sys.h"
-#include "debug.h"
-#include "tests/tests.h"
-#include "cloexec.h"
-#include "event.h"
-#include <internal/lib.h> // page_size
-#include "arch-tests.h"
-
-static u64 rdpmc(unsigned int counter)
-{
-	unsigned int low, high;
-
-	asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (counter));
-
-	return low | ((u64)high) << 32;
-}
-
-static u64 rdtsc(void)
-{
-	unsigned int low, high;
-
-	asm volatile("rdtsc" : "=a" (low), "=d" (high));
-
-	return low | ((u64)high) << 32;
-}
-
-static u64 mmap_read_self(void *addr)
-{
-	struct perf_event_mmap_page *pc = addr;
-	u32 seq, idx, time_mult = 0, time_shift = 0;
-	u64 count, cyc = 0, time_offset = 0, enabled, running, delta;
-
-	do {
-		seq = pc->lock;
-		barrier();
-
-		enabled = pc->time_enabled;
-		running = pc->time_running;
-
-		if (enabled != running) {
-			cyc = rdtsc();
-			time_mult = pc->time_mult;
-			time_shift = pc->time_shift;
-			time_offset = pc->time_offset;
-		}
-
-		idx = pc->index;
-		count = pc->offset;
-		if (idx)
-			count += rdpmc(idx - 1);
-
-		barrier();
-	} while (pc->lock != seq);
-
-	if (enabled != running) {
-		u64 quot, rem;
-
-		quot = (cyc >> time_shift);
-		rem = cyc & (((u64)1 << time_shift) - 1);
-		delta = time_offset + quot * time_mult +
-			((rem * time_mult) >> time_shift);
-
-		enabled += delta;
-		if (idx)
-			running += delta;
-
-		quot = count / running;
-		rem = count % running;
-		count = quot * enabled + (rem * enabled) / running;
-	}
-
-	return count;
-}
-
-/*
- * If the RDPMC instruction faults then signal this back to the test parent task:
- */
-static void segfault_handler(int sig __maybe_unused,
-			     siginfo_t *info __maybe_unused,
-			     void *uc __maybe_unused)
-{
-	exit(-1);
-}
-
-static int __test__rdpmc(void)
-{
-	volatile int tmp = 0;
-	u64 i, loops = 1000;
-	int n;
-	int fd;
-	void *addr;
-	struct perf_event_attr attr = {
-		.type = PERF_TYPE_HARDWARE,
-		.config = PERF_COUNT_HW_INSTRUCTIONS,
-		.exclude_kernel = 1,
-	};
-	u64 delta_sum = 0;
-        struct sigaction sa;
-	char sbuf[STRERR_BUFSIZE];
-
-	sigfillset(&sa.sa_mask);
-	sa.sa_sigaction = segfault_handler;
-	sa.sa_flags = 0;
-	sigaction(SIGSEGV, &sa, NULL);
-
-	fd = sys_perf_event_open(&attr, 0, -1, -1,
-				 perf_event_open_cloexec_flag());
-	if (fd < 0) {
-		pr_err("Error: sys_perf_event_open() syscall returned "
-		       "with %d (%s)\n", fd,
-		       str_error_r(errno, sbuf, sizeof(sbuf)));
-		return -1;
-	}
-
-	addr = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
-	if (addr == (void *)(-1)) {
-		pr_err("Error: mmap() syscall returned with (%s)\n",
-		       str_error_r(errno, sbuf, sizeof(sbuf)));
-		goto out_close;
-	}
-
-	for (n = 0; n < 6; n++) {
-		u64 stamp, now, delta;
-
-		stamp = mmap_read_self(addr);
-
-		for (i = 0; i < loops; i++)
-			tmp++;
-
-		now = mmap_read_self(addr);
-		loops *= 10;
-
-		delta = now - stamp;
-		pr_debug("%14d: %14Lu\n", n, (long long)delta);
-
-		delta_sum += delta;
-	}
-
-	munmap(addr, page_size);
-	pr_debug("   ");
-out_close:
-	close(fd);
-
-	if (!delta_sum)
-		return -1;
-
-	return 0;
-}
-
-int test__rdpmc(struct test_suite *test __maybe_unused, int subtest __maybe_unused)
-{
-	int status = 0;
-	int wret = 0;
-	int ret;
-	int pid;
-
-	pid = fork();
-	if (pid < 0)
-		return -1;
-
-	if (!pid) {
-		ret = __test__rdpmc();
-
-		exit(ret);
-	}
-
-	wret = waitpid(pid, &status, 0);
-	if (wret < 0 || status)
-		return -1;
-
-	return 0;
-}
--- a/tools/perf/arch/x86/util/cpuid.h
+++ b/tools/perf/arch/x86/util/cpuid.h
@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef PERF_CPUID_H
+#define PERF_CPUID_H 1
+
+
+static inline void
+cpuid(unsigned int op, unsigned int op2, unsigned int *a, unsigned int *b,
+	unsigned int *c, unsigned int *d)
+{
+	/*
+	 * Preserve %ebx/%rbx register by either placing it in %rdi or saving it
+	 * on the stack - x86-64 needs to avoid the stack red zone. In PIC
+	 * compilations %ebx contains the address of the global offset
+	 * table. %rbx is occasionally used to address stack variables in
+	 * presence of dynamic allocas.
+	 */
+	asm(
+#if defined(__x86_64__)
+		"mov %%rbx, %%rdi\n"
+		"cpuid\n"
+		"xchg %%rdi, %%rbx\n"
+#else
+		"pushl %%ebx\n"
+		"cpuid\n"
+		"movl %%ebx, %%edi\n"
+		"popl %%ebx\n"
+#endif
+		: "=a"(*a), "=D"(*b), "=c"(*c), "=d"(*d)
+		: "a"(op), "2"(op2));
+}
+
+void get_cpuid_0(char *vendor, unsigned int *lvl);
+
+#endif
--- a/tools/perf/arch/x86/util/evlist.c
+++ b/tools/perf/arch/x86/util/evlist.c
@ -3,20 +3,66 @@
 #include "util/pmu.h"
 #include "util/evlist.h"
 #include "util/parse-events.h"
+#include "util/event.h"
+#include "util/pmu-hybrid.h"
 #include "topdown.h"

-#define TOPDOWN_L1_EVENTS	"{slots,topdown-retiring,topdown-bad-spec,topdown-fe-bound,topdown-be-bound}"
-#define TOPDOWN_L2_EVENTS	"{slots,topdown-retiring,topdown-bad-spec,topdown-fe-bound,topdown-be-bound,topdown-heavy-ops,topdown-br-mispredict,topdown-fetch-lat,topdown-mem-bound}"
-
-int arch_evlist__add_default_attrs(struct evlist *evlist)
+static int ___evlist__add_default_attrs(struct evlist *evlist,
+					struct perf_event_attr *attrs,
+					size_t nr_attrs)
 {
-	if (!pmu_have_event("cpu", "slots"))
-		return 0;
+	struct perf_cpu_map *cpus;
+	struct evsel *evsel, *n;
+	struct perf_pmu *pmu;
+	LIST_HEAD(head);
+	size_t i = 0;

-	if (pmu_have_event("cpu", "topdown-heavy-ops"))
-		return parse_events(evlist, TOPDOWN_L2_EVENTS, NULL);
-	else
-		return parse_events(evlist, TOPDOWN_L1_EVENTS, NULL);
+	for (i = 0; i < nr_attrs; i++)
+		event_attr_init(attrs + i);
+
+	if (!perf_pmu__has_hybrid())
+		return evlist__add_attrs(evlist, attrs, nr_attrs);
+
+	for (i = 0; i < nr_attrs; i++) {
+		if (attrs[i].type == PERF_TYPE_SOFTWARE) {
+			evsel = evsel__new(attrs + i);
+			if (evsel == NULL)
+				goto out_delete_partial_list;
+			list_add_tail(&evsel->core.node, &head);
+			continue;
+		}
+
+		perf_pmu__for_each_hybrid_pmu(pmu) {
+			evsel = evsel__new(attrs + i);
+			if (evsel == NULL)
+				goto out_delete_partial_list;
+			evsel->core.attr.config |= (__u64)pmu->type << PERF_PMU_TYPE_SHIFT;
+			cpus = perf_cpu_map__get(pmu->cpus);
+			evsel->core.cpus = cpus;
+			evsel->core.own_cpus = perf_cpu_map__get(cpus);
+			evsel->pmu_name = strdup(pmu->name);
+			list_add_tail(&evsel->core.node, &head);
+		}
+	}
+
+	evlist__splice_list_tail(evlist, &head);
+
+	return 0;
+
+out_delete_partial_list:
+	__evlist__for_each_entry_safe(&head, n, evsel)
+		evsel__delete(evsel);
+	return -1;
+}
+
+int arch_evlist__add_default_attrs(struct evlist *evlist,
+				   struct perf_event_attr *attrs,
+				   size_t nr_attrs)
+{
+	if (nr_attrs)
+		return ___evlist__add_default_attrs(evlist, attrs, nr_attrs);
+
+	return topdown_parse_events(evlist);
 }

 struct evsel *arch_evlist__leader(struct list_head *list)
--- a/tools/perf/arch/x86/util/evsel.c
+++ b/tools/perf/arch/x86/util/evsel.c
@ -6,6 +6,10 @@
 #include "util/pmu.h"
 #include "linux/string.h"
 #include "evsel.h"
+#include "util/debug.h"
+
+#define IBS_FETCH_L3MISSONLY   (1ULL << 59)
+#define IBS_OP_L3MISSONLY      (1ULL << 16)

 void arch_evsel__set_sample_weight(struct evsel *evsel)
 {
@ -61,3 +65,71 @@ bool arch_evsel__must_be_in_group(const struct evsel *evsel)
 		(strcasestr(evsel->name, "slots") ||
 		 strcasestr(evsel->name, "topdown"));
 }
+
+int arch_evsel__hw_name(struct evsel *evsel, char *bf, size_t size)
+{
+	u64 event = evsel->core.attr.config & PERF_HW_EVENT_MASK;
+	u64 pmu = evsel->core.attr.config >> PERF_PMU_TYPE_SHIFT;
+	const char *event_name;
+
+	if (event < PERF_COUNT_HW_MAX && evsel__hw_names[event])
+		event_name = evsel__hw_names[event];
+	else
+		event_name = "unknown-hardware";
+
+	/* The PMU type is not required for the non-hybrid platform. */
+	if (!pmu)
+		return  scnprintf(bf, size, "%s", event_name);
+
+	return scnprintf(bf, size, "%s/%s/",
+			 evsel->pmu_name ? evsel->pmu_name : "cpu",
+			 event_name);
+}
+
+static void ibs_l3miss_warn(void)
+{
+	pr_warning(
+"WARNING: Hw internally resets sampling period when L3 Miss Filtering is enabled\n"
+"and tagged operation does not cause L3 Miss. This causes sampling period skew.\n");
+}
+
+void arch__post_evsel_config(struct evsel *evsel, struct perf_event_attr *attr)
+{
+	struct perf_pmu *evsel_pmu, *ibs_fetch_pmu, *ibs_op_pmu;
+	static int warned_once;
+	/* 0: Uninitialized, 1: Yes, -1: No */
+	static int is_amd;
+
+	if (warned_once || is_amd == -1)
+		return;
+
+	if (!is_amd) {
+		struct perf_env *env = evsel__env(evsel);
+
+		if (!perf_env__cpuid(env) || !env->cpuid ||
+		    !strstarts(env->cpuid, "AuthenticAMD")) {
+			is_amd = -1;
+			return;
+		}
+		is_amd = 1;
+	}
+
+	evsel_pmu = evsel__find_pmu(evsel);
+	if (!evsel_pmu)
+		return;
+
+	ibs_fetch_pmu = perf_pmu__find("ibs_fetch");
+	ibs_op_pmu = perf_pmu__find("ibs_op");
+
+	if (ibs_fetch_pmu && ibs_fetch_pmu->type == evsel_pmu->type) {
+		if (attr->config & IBS_FETCH_L3MISSONLY) {
+			ibs_l3miss_warn();
+			warned_once = 1;
+		}
+	} else if (ibs_op_pmu && ibs_op_pmu->type == evsel_pmu->type) {
+		if (attr->config & IBS_OP_L3MISSONLY) {
+			ibs_l3miss_warn();
+			warned_once = 1;
+		}
+	}
+}
--- a/tools/perf/arch/x86/util/header.c
+++ b/tools/perf/arch/x86/util/header.c
@ -9,18 +9,17 @@

 #include "../../../util/debug.h"
 #include "../../../util/header.h"
+#include "cpuid.h"

-static inline void
-cpuid(unsigned int op, unsigned int *a, unsigned int *b, unsigned int *c,
-      unsigned int *d)
+void get_cpuid_0(char *vendor, unsigned int *lvl)
 {
-	__asm__ __volatile__ (".byte 0x53\n\tcpuid\n\t"
-			      "movl %%ebx, %%esi\n\t.byte 0x5b"
-			: "=a" (*a),
-			"=S" (*b),
-			"=c" (*c),
-			"=d" (*d)
-			: "a" (op));
+	unsigned int b, c, d;
+
+	cpuid(0, 0, lvl, &b, &c, &d);
+	strncpy(&vendor[0], (char *)(&b), 4);
+	strncpy(&vendor[4], (char *)(&d), 4);
+	strncpy(&vendor[8], (char *)(&c), 4);
+	vendor[12] = '\0';
 }

 static int
@ -31,14 +30,10 @@ __get_cpuid(char *buffer, size_t sz, const char *fmt)
 	int nb;
 	char vendor[16];

-	cpuid(0, &lvl, &b, &c, &d);
-	strncpy(&vendor[0], (char *)(&b), 4);
-	strncpy(&vendor[4], (char *)(&d), 4);
-	strncpy(&vendor[8], (char *)(&c), 4);
-	vendor[12] = '\0';
+	get_cpuid_0(vendor, &lvl);

 	if (lvl >= 1) {
-		cpuid(1, &a, &b, &c, &d);
+		cpuid(1, 0, &a, &b, &c, &d);

 		family = (a >> 8) & 0xf;  /* bits 11 - 8 */
 		model  = (a >> 4) & 0xf;  /* Bits  7 - 4 */
--- a/tools/perf/arch/x86/util/topdown.c
+++ b/tools/perf/arch/x86/util/topdown.c
@ -3,9 +3,17 @@
 #include "api/fs/fs.h"
 #include "util/pmu.h"
 #include "util/topdown.h"
+#include "util/evlist.h"
+#include "util/debug.h"
+#include "util/pmu-hybrid.h"
 #include "topdown.h"
 #include "evsel.h"

+#define TOPDOWN_L1_EVENTS       "{slots,topdown-retiring,topdown-bad-spec,topdown-fe-bound,topdown-be-bound}"
+#define TOPDOWN_L1_EVENTS_CORE  "{slots,cpu_core/topdown-retiring/,cpu_core/topdown-bad-spec/,cpu_core/topdown-fe-bound/,cpu_core/topdown-be-bound/}"
+#define TOPDOWN_L2_EVENTS       "{slots,topdown-retiring,topdown-bad-spec,topdown-fe-bound,topdown-be-bound,topdown-heavy-ops,topdown-br-mispredict,topdown-fetch-lat,topdown-mem-bound}"
+#define TOPDOWN_L2_EVENTS_CORE  "{slots,cpu_core/topdown-retiring/,cpu_core/topdown-bad-spec/,cpu_core/topdown-fe-bound/,cpu_core/topdown-be-bound/,cpu_core/topdown-heavy-ops/,cpu_core/topdown-br-mispredict/,cpu_core/topdown-fetch-lat/,cpu_core/topdown-mem-bound/}"
+
 /* Check whether there is a PMU which supports the perf metrics. */
 bool topdown_sys_has_perf_metrics(void)
 {
@ -73,3 +81,46 @@ bool arch_topdown_sample_read(struct evsel *leader)

 	return false;
 }
+
+const char *arch_get_topdown_pmu_name(struct evlist *evlist, bool warn)
+{
+	const char *pmu_name;
+
+	if (!perf_pmu__has_hybrid())
+		return "cpu";
+
+	if (!evlist->hybrid_pmu_name) {
+		if (warn)
+			pr_warning("WARNING: default to use cpu_core topdown events\n");
+		evlist->hybrid_pmu_name = perf_pmu__hybrid_type_to_pmu("core");
+	}
+
+	pmu_name = evlist->hybrid_pmu_name;
+
+	return pmu_name;
+}
+
+int topdown_parse_events(struct evlist *evlist)
+{
+	const char *topdown_events;
+	const char *pmu_name;
+
+	if (!topdown_sys_has_perf_metrics())
+		return 0;
+
+	pmu_name = arch_get_topdown_pmu_name(evlist, false);
+
+	if (pmu_have_event(pmu_name, "topdown-heavy-ops")) {
+		if (!strcmp(pmu_name, "cpu_core"))
+			topdown_events = TOPDOWN_L2_EVENTS_CORE;
+		else
+			topdown_events = TOPDOWN_L2_EVENTS;
+	} else {
+		if (!strcmp(pmu_name, "cpu_core"))
+			topdown_events = TOPDOWN_L1_EVENTS_CORE;
+		else
+			topdown_events = TOPDOWN_L1_EVENTS;
+	}
+
+	return parse_events(evlist, topdown_events, NULL);
+}
--- a/tools/perf/arch/x86/util/topdown.h
+++ b/tools/perf/arch/x86/util/topdown.h
@ -3,5 +3,6 @@
 #define _TOPDOWN_H 1

 bool topdown_sys_has_perf_metrics(void);
+int topdown_parse_events(struct evlist *evlist);

 #endif
--- a/tools/perf/arch/x86/util/tsc.c
+++ b/tools/perf/arch/x86/util/tsc.c
@ -1,7 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/types.h>
+#include <math.h>
+#include <string.h>

+#include "../../../util/debug.h"
 #include "../../../util/tsc.h"
+#include "cpuid.h"

 u64 rdtsc(void)
 {
@ -11,3 +15,76 @@ u64 rdtsc(void)

 	return low | ((u64)high) << 32;
 }
+
+/*
+ * Derive the TSC frequency in Hz from the /proc/cpuinfo, for example:
+ * ...
+ * model name      : Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
+ * ...
+ * will return 3000000000.
+ */
+static double cpuinfo_tsc_freq(void)
+{
+	double result = 0;
+	FILE *cpuinfo;
+	char *line = NULL;
+	size_t len = 0;
+
+	cpuinfo = fopen("/proc/cpuinfo", "r");
+	if (!cpuinfo) {
+		pr_err("Failed to read /proc/cpuinfo for TSC frequency");
+		return NAN;
+	}
+	while (getline(&line, &len, cpuinfo) > 0) {
+		if (!strncmp(line, "model name", 10)) {
+			char *pos = strstr(line + 11, " @ ");
+
+			if (pos && sscanf(pos, " @ %lfGHz", &result) == 1) {
+				result *= 1000000000;
+				goto out;
+			}
+		}
+	}
+out:
+	if (fpclassify(result) == FP_ZERO)
+		pr_err("Failed to find TSC frequency in /proc/cpuinfo");
+
+	free(line);
+	fclose(cpuinfo);
+	return result;
+}
+
+double arch_get_tsc_freq(void)
+{
+	unsigned int a, b, c, d, lvl;
+	static bool cached;
+	static double tsc;
+	char vendor[16];
+
+	if (cached)
+		return tsc;
+
+	cached = true;
+	get_cpuid_0(vendor, &lvl);
+	if (!strstr(vendor, "Intel"))
+		return 0;
+
+	/*
+	 * Don't support Time Stamp Counter and
+	 * Nominal Core Crystal Clock Information Leaf.
+	 */
+	if (lvl < 0x15) {
+		tsc = cpuinfo_tsc_freq();
+		return tsc;
+	}
+
+	cpuid(0x15, 0, &a, &b, &c, &d);
+	/* TSC frequency is not enumerated */
+	if (!a || !b || !c) {
+		tsc = cpuinfo_tsc_freq();
+		return tsc;
+	}
+
+	tsc = (double)c * (double)b / (double)a;
+	return tsc;
+}
--- a/tools/perf/builtin-annotate.c
+++ b/tools/perf/builtin-annotate.c
@ -50,7 +50,9 @@ struct perf_annotate {
 	bool	   use_tui;
 #endif
 	bool	   use_stdio, use_stdio2;
+#ifdef HAVE_GTK2_SUPPORT
 	bool	   use_gtk;
+#endif
 	bool	   skip_missing;
 	bool	   has_br_stack;
 	bool	   group_set;
@ -526,7 +528,9 @@ int cmd_annotate(int argc, const char **argv)
 	OPT_BOOLEAN('q', "quiet", &quiet, "do now show any message"),
 	OPT_BOOLEAN('D', "dump-raw-trace", &dump_trace,
 		    "dump raw trace in ASCII"),
+#ifdef HAVE_GTK2_SUPPORT
 	OPT_BOOLEAN(0, "gtk", &annotate.use_gtk, "Use the GTK interface"),
+#endif
 #ifdef HAVE_SLANG_SUPPORT
 	OPT_BOOLEAN(0, "tui", &annotate.use_tui, "Use the TUI interface"),
 #endif
@ -614,10 +618,12 @@ int cmd_annotate(int argc, const char **argv)
 	if (annotate_check_args(&annotate.opts) < 0)
 		return -EINVAL;

+#ifdef HAVE_GTK2_SUPPORT
 	if (symbol_conf.show_nr_samples && annotate.use_gtk) {
 		pr_err("--show-nr-samples is not available in --gtk mode at this time\n");
 		return ret;
 	}
+#endif

 	ret = symbol__validate_sym_arguments();
 	if (ret)
@ -656,8 +662,10 @@ int cmd_annotate(int argc, const char **argv)
 	else if (annotate.use_tui)
 		use_browser = 1;
 #endif
+#ifdef HAVE_GTK2_SUPPORT
 	else if (annotate.use_gtk)
 		use_browser = 2;
+#endif

 	setup_browser(true);

--- a/tools/perf/builtin-buildid-list.c
+++ b/tools/perf/builtin-buildid-list.c
@ -12,14 +12,44 @@
 #include "util/build-id.h"
 #include "util/debug.h"
 #include "util/dso.h"
+#include "util/map.h"
 #include <subcmd/pager.h>
 #include <subcmd/parse-options.h>
 #include "util/session.h"
 #include "util/symbol.h"
 #include "util/data.h"
 #include <errno.h>
+#include <inttypes.h>
 #include <linux/err.h>

+static int buildid__map_cb(struct map *map, void *arg __maybe_unused)
+{
+	const struct dso *dso = map->dso;
+	char bid_buf[SBUILD_ID_SIZE];
+
+	memset(bid_buf, 0, sizeof(bid_buf));
+	if (dso->has_build_id)
+		build_id__sprintf(&dso->bid, bid_buf);
+	printf("%s %16" PRIx64 " %16" PRIx64, bid_buf, map->start, map->end);
+	if (dso->long_name != NULL) {
+		printf(" %s", dso->long_name);
+	} else if (dso->short_name != NULL) {
+		printf(" %s", dso->short_name);
+	}
+	printf("\n");
+
+	return 0;
+}
+
+static void buildid__show_kernel_maps(void)
+{
+	struct machine *machine;
+
+	machine = machine__new_host();
+	machine__for_each_kernel_map(machine, buildid__map_cb, NULL);
+	machine__delete(machine);
+}
+
 static int sysfs__fprintf_build_id(FILE *fp)
 {
 	char sbuild_id[SBUILD_ID_SIZE];
@ -99,6 +129,7 @@ out:
 int cmd_buildid_list(int argc, const char **argv)
 {
 	bool show_kernel = false;
+	bool show_kernel_maps = false;
 	bool with_hits = false;
 	bool force = false;
 	const struct option options[] = {
@ -106,6 +137,8 @@ int cmd_buildid_list(int argc, const char **argv)
 	OPT_STRING('i', "input", &input_name, "file", "input file name"),
 	OPT_BOOLEAN('f', "force", &force, "don't complain, do it"),
 	OPT_BOOLEAN('k', "kernel", &show_kernel, "Show current kernel build id"),
+	OPT_BOOLEAN('m', "kernel-maps", &show_kernel_maps,
+	    "Show build id of current kernel + modules"),
 	OPT_INCR('v', "verbose", &verbose, "be more verbose"),
 	OPT_END()
 	};
@ -117,8 +150,12 @@ int cmd_buildid_list(int argc, const char **argv)
 	argc = parse_options(argc, argv, options, buildid_list_usage, 0);
 	setup_pager();

-	if (show_kernel)
+	if (show_kernel) {
 		return !(sysfs__fprintf_build_id(stdout) > 0);
+	} else if (show_kernel_maps) {
+		buildid__show_kernel_maps();
+		return 0;
+	}

 	return perf_session__list_build_ids(force, with_hits);
 }
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
--- a/tools/perf/builtin-kwork.c
+++ b/tools/perf/builtin-kwork.c
--- a/tools/perf/builtin-list.c
+++ b/tools/perf/builtin-list.c
@ -10,7 +10,7 @@
 */
 #include "builtin.h"

-#include "util/parse-events.h"
+#include "util/print-events.h"
 #include "util/pmu.h"
 #include "util/pmu-hybrid.h"
 #include "util/debug.h"
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@ -1388,6 +1388,11 @@ static struct perf_event_header finished_round_event = {
 	.type = PERF_RECORD_FINISHED_ROUND,
 };

+static struct perf_event_header finished_init_event = {
+	.size = sizeof(struct perf_event_header),
+	.type = PERF_RECORD_FINISHED_INIT,
+};
+
 static void record__adjust_affinity(struct record *rec, struct mmap *map)
 {
 	if (rec->opts.affinity != PERF_AFFINITY_SYS &&
@ -1696,6 +1701,14 @@ static int record__synthesize_workload(struct record *rec, bool tail)
 	return err;
 }

+static int write_finished_init(struct record *rec, bool tail)
+{
+	if (rec->opts.tail_synthesize != tail)
+		return 0;
+
+	return record__write(rec, NULL, &finished_init_event, sizeof(finished_init_event));
+}
+
 static int record__synthesize(struct record *rec, bool tail);

 static int
@ -1710,6 +1723,8 @@ record__switch_output(struct record *rec, bool at_exit)

 	record__aio_mmap_read_sync(rec);

+	write_finished_init(rec, true);
+
 	record__synthesize(rec, true);
 	if (target__none(&rec->opts.target))
 		record__synthesize_workload(rec, true);
@ -1764,6 +1779,7 @@ record__switch_output(struct record *rec, bool at_exit)
 		 */
 		if (target__none(&rec->opts.target))
 			record__synthesize_workload(rec, false);
+		write_finished_init(rec, false);
 	}
 	return fd;
 }
@ -1834,13 +1850,11 @@ static int record__synthesize(struct record *rec, bool tail)
 		goto out;

 	/* Synthesize id_index before auxtrace_info */
-	if (rec->opts.auxtrace_sample_mode || rec->opts.full_auxtrace) {
-		err = perf_event__synthesize_id_index(tool,
-						      process_synthesized_event,
-						      session->evlist, machine);
-		if (err)
-			goto out;
-	}
+	err = perf_event__synthesize_id_index(tool,
+					      process_synthesized_event,
+					      session->evlist, machine);
+	if (err)
+		goto out;

 	if (rec->opts.full_auxtrace) {
 		err = perf_event__synthesize_auxtrace_info(rec->itr, tool,
@ -2421,6 +2435,15 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 	trigger_ready(&auxtrace_snapshot_trigger);
 	trigger_ready(&switch_output_trigger);
 	perf_hooks__invoke_record_start();
+
+	/*
+	 * Must write FINISHED_INIT so it will be seen after all other
+	 * synthesized user events, but before any regular events.
+	 */
+	err = write_finished_init(rec, false);
+	if (err < 0)
+		goto out_child;
+
 	for (;;) {
 		unsigned long long hits = thread->samples;

@ -2565,6 +2588,8 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		fprintf(stderr, "[ perf record: Woken up %ld times to write data ]\n",
 			record__waking(rec));

+	write_finished_init(rec, true);
+
 	if (target__none(&rec->opts.target))
 		record__synthesize_workload(rec, true);

@ -3193,6 +3218,8 @@ static struct option __record_options[] = {
 	OPT_BOOLEAN(0, "code-page-size", &record.opts.sample_code_page_size,
 		    "Record the sampled code address (ip) page size"),
 	OPT_BOOLEAN(0, "sample-cpu", &record.opts.sample_cpu, "Record the sample cpu"),
+	OPT_BOOLEAN(0, "sample-identifier", &record.opts.sample_identifier,
+		    "Record the sample identifier"),
 	OPT_BOOLEAN_SET('T', "timestamp", &record.opts.sample_time,
 			&record.opts.sample_time_set,
 			"Record the sample timestamps"),
@ -3805,6 +3832,9 @@ int cmd_record(int argc, const char **argv)
 		goto out_opts;
 	}

+	if (rec->opts.kcore)
+		rec->opts.text_poke = true;
+
 	if (rec->opts.kcore || record__threads_enabled(rec))
 		rec->data.is_dir = true;

--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@ -74,7 +74,9 @@ struct report {
 #ifdef HAVE_SLANG_SUPPORT
 	bool			use_tui;
 #endif
+#ifdef HAVE_GTK2_SUPPORT
 	bool			use_gtk;
+#endif
 	bool			use_stdio;
 	bool			show_full_info;
 	bool			show_threads;
@ -1227,7 +1229,9 @@ int cmd_report(int argc, const char **argv)
 #ifdef HAVE_SLANG_SUPPORT
 	OPT_BOOLEAN(0, "tui", &report.use_tui, "Use the TUI interface"),
 #endif
+#ifdef HAVE_GTK2_SUPPORT
 	OPT_BOOLEAN(0, "gtk", &report.use_gtk, "Use the GTK2 interface"),
+#endif
 	OPT_BOOLEAN(0, "stdio", &report.use_stdio,
 		    "Use the stdio interface"),
 	OPT_BOOLEAN(0, "header", &report.header, "Show data header."),
@ -1516,8 +1520,10 @@ repeat:
 	else if (report.use_tui)
 		use_browser = 1;
 #endif
+#ifdef HAVE_GTK2_SUPPORT
 	else if (report.use_gtk)
 		use_browser = 2;
+#endif

 	/* Force tty output for header output and per-thread stat. */
 	if (report.header || report.header_only || report.show_threads)
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@ -125,6 +125,8 @@ enum perf_output_field {
 	PERF_OUTPUT_CODE_PAGE_SIZE  = 1ULL << 34,
 	PERF_OUTPUT_INS_LAT         = 1ULL << 35,
 	PERF_OUTPUT_BRSTACKINSNLEN  = 1ULL << 36,
+	PERF_OUTPUT_MACHINE_PID     = 1ULL << 37,
+	PERF_OUTPUT_VCPU            = 1ULL << 38,
 };

 struct perf_script {
@ -193,6 +195,8 @@ struct output_option {
 	{.str = "code_page_size", .field = PERF_OUTPUT_CODE_PAGE_SIZE},
 	{.str = "ins_lat", .field = PERF_OUTPUT_INS_LAT},
 	{.str = "brstackinsnlen", .field = PERF_OUTPUT_BRSTACKINSNLEN},
+	{.str = "machine_pid", .field = PERF_OUTPUT_MACHINE_PID},
+	{.str = "vcpu", .field = PERF_OUTPUT_VCPU},
 };

 enum {
@ -746,6 +750,13 @@ static int perf_sample__fprintf_start(struct perf_script *script,
 	int printed = 0;
 	char tstr[128];

+	if (PRINT_FIELD(MACHINE_PID) && sample->machine_pid)
+		printed += fprintf(fp, "VM:%5d ", sample->machine_pid);
+
+	/* Print VCPU only for guest events i.e. with machine_pid */
+	if (PRINT_FIELD(VCPU) && sample->machine_pid)
+		printed += fprintf(fp, "VCPU:%03d ", sample->vcpu);
+
 	if (PRINT_FIELD(COMM)) {
 		const char *comm = thread ? thread__comm_str(thread) : ":-1";

@ -3633,6 +3644,9 @@ int process_thread_map_event(struct perf_session *session,
 	struct perf_tool *tool = session->tool;
 	struct perf_script *script = container_of(tool, struct perf_script, tool);

+	if (dump_trace)
+		perf_event__fprintf_thread_map(event, stdout);
+
 	if (script->threads) {
 		pr_warning("Extra thread map event, ignoring.\n");
 		return 0;
@ -3652,6 +3666,9 @@ int process_cpu_map_event(struct perf_session *session,
 	struct perf_tool *tool = session->tool;
 	struct perf_script *script = container_of(tool, struct perf_script, tool);

+	if (dump_trace)
+		perf_event__fprintf_cpu_map(event, stdout);
+
 	if (script->cpus) {
 		pr_warning("Extra cpu map event, ignoring.\n");
 		return 0;
@ -3740,6 +3757,7 @@ int cmd_script(int argc, const char **argv)
 	bool header = false;
 	bool header_only = false;
 	bool script_started = false;
+	bool unsorted_dump = false;
 	char *rec_script_path = NULL;
 	char *rep_script_path = NULL;
 	struct perf_session *session;
@ -3788,6 +3806,8 @@ int cmd_script(int argc, const char **argv)
 	const struct option options[] = {
 	OPT_BOOLEAN('D', "dump-raw-trace", &dump_trace,
 		    "dump raw trace in ASCII"),
+	OPT_BOOLEAN(0, "dump-unsorted-raw-trace", &unsorted_dump,
+		    "dump unsorted raw trace in ASCII"),
 	OPT_INCR('v', "verbose", &verbose,
 		 "be more verbose (show symbol address, etc)"),
 	OPT_BOOLEAN('L', "Latency", &latency_format,
@ -3950,6 +3970,11 @@ int cmd_script(int argc, const char **argv)
 	data.path  = input_name;
 	data.force = symbol_conf.force;

+	if (unsorted_dump) {
+		dump_trace = true;
+		script.tool.ordered_events = false;
+	}
+
 	if (symbol__validate_sym_arguments())
 		return -1;

--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@ -71,6 +71,7 @@
 #include "util/bpf_counter.h"
 #include "util/iostat.h"
 #include "util/pmu-hybrid.h"
+ #include "util/topdown.h"
 #include "asm/bug.h"

 #include <linux/time64.h>
@ -966,18 +967,18 @@ try_again_reset:
 			return err;
 	}

-	/*
-	 * Enable counters and exec the command:
-	 */
-	if (forks) {
-		err = enable_counters();
-		if (err)
-			return -1;
+	err = enable_counters();
+	if (err)
+		return -1;
+
+	/* Exec the command, if any */
+	if (forks)
 		evlist__start_workload(evsel_list);

-		t0 = rdclock();
-		clock_gettime(CLOCK_MONOTONIC, &ref_time);
+	t0 = rdclock();
+	clock_gettime(CLOCK_MONOTONIC, &ref_time);

+	if (forks) {
 		if (interval || timeout || evlist__ctlfd_initialized(evsel_list))
 			status = dispatch_events(forks, timeout, interval, &times);
 		if (child_pid != -1) {
@ -995,13 +996,6 @@ try_again_reset:
 		if (WIFSIGNALED(status))
 			psignal(WTERMSIG(status), argv[0]);
 	} else {
-		err = enable_counters();
-		if (err)
-			return -1;
-
-		t0 = rdclock();
-		clock_gettime(CLOCK_MONOTONIC, &ref_time);
-
 		status = dispatch_events(forks, timeout, interval, &times);
 	}

@ -1685,12 +1679,6 @@ static int add_default_attributes(void)
  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_BRANCH_INSTRUCTIONS	},
  { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_BRANCH_MISSES		},

-};
-	struct perf_event_attr default_sw_attrs[] = {
-  { .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_TASK_CLOCK		},
-  { .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_CONTEXT_SWITCHES	},
-  { .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_CPU_MIGRATIONS		},
-  { .type = PERF_TYPE_SOFTWARE, .config = PERF_COUNT_SW_PAGE_FAULTS		},
 };

 /*
@ -1783,6 +1771,9 @@ static int add_default_attributes(void)
 	(PERF_COUNT_HW_CACHE_OP_PREFETCH	<<  8) |
 	(PERF_COUNT_HW_CACHE_RESULT_MISS	<< 16)				},
 };
+
+	struct perf_event_attr default_null_attrs[] = {};
+
 	/* Set attrs if no event is selected and !null_run: */
 	if (stat_config.null_run)
 		return 0;
@ -1861,22 +1852,11 @@ static int add_default_attributes(void)
 		unsigned int max_level = 1;
 		char *str = NULL;
 		bool warn = false;
-		const char *pmu_name = "cpu";
+		const char *pmu_name = arch_get_topdown_pmu_name(evsel_list, true);

 		if (!force_metric_only)
 			stat_config.metric_only = true;

-		if (perf_pmu__has_hybrid()) {
-			if (!evsel_list->hybrid_pmu_name) {
-				pr_warning("WARNING: default to use cpu_core topdown events\n");
-				evsel_list->hybrid_pmu_name = perf_pmu__hybrid_type_to_pmu("core");
-			}
-
-			pmu_name = evsel_list->hybrid_pmu_name;
-			if (!pmu_name)
-				return -1;
-		}
-
 		if (pmu_have_event(pmu_name, topdown_metric_L2_attrs[5])) {
 			metric_attrs = topdown_metric_L2_attrs;
 			max_level = 2;
@ -1947,30 +1927,6 @@ setup_metrics:
 	}

 	if (!evsel_list->core.nr_entries) {
-		if (perf_pmu__has_hybrid()) {
-			struct parse_events_error errinfo;
-			const char *hybrid_str = "cycles,instructions,branches,branch-misses";
-
-			if (target__has_cpu(&target))
-				default_sw_attrs[0].config = PERF_COUNT_SW_CPU_CLOCK;
-
-			if (evlist__add_default_attrs(evsel_list,
-						      default_sw_attrs) < 0) {
-				return -1;
-			}
-
-			parse_events_error__init(&errinfo);
-			err = parse_events(evsel_list, hybrid_str, &errinfo);
-			if (err) {
-				fprintf(stderr,
-					"Cannot set up hybrid events %s: %d\n",
-					hybrid_str, err);
-				parse_events_error__print(&errinfo, hybrid_str);
-			}
-			parse_events_error__exit(&errinfo);
-			return err ? -1 : 0;
-		}
-
 		if (target__has_cpu(&target))
 			default_attrs0[0].config = PERF_COUNT_SW_CPU_CLOCK;

@ -1988,7 +1944,8 @@ setup_metrics:
 			return -1;

 		stat_config.topdown_level = TOPDOWN_MAX_LEVEL;
-		if (arch_evlist__add_default_attrs(evsel_list) < 0)
+		/* Platform specific attrs */
+		if (evlist__add_default_attrs(evsel_list, default_null_attrs) < 0)
 			return -1;
 	}

--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@ -36,6 +36,7 @@
 #include "util/data.h"
 #include "util/debug.h"
 #include "util/string2.h"
+#include "util/tracepoint.h"
 #include <linux/err.h>

 #ifdef LACKS_OPEN_MEMSTREAM_PROTOTYPE
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@ -53,6 +53,7 @@
 #include "trace-event.h"
 #include "util/parse-events.h"
 #include "util/bpf-loader.h"
+#include "util/tracepoint.h"
 #include "callchain.h"
 #include "print_binary.h"
 #include "string2.h"
--- a/tools/perf/builtin.h
+++ b/tools/perf/builtin.h
@ -38,6 +38,7 @@ int cmd_mem(int argc, const char **argv);
 int cmd_data(int argc, const char **argv);
 int cmd_ftrace(int argc, const char **argv);
 int cmd_daemon(int argc, const char **argv);
+int cmd_kwork(int argc, const char **argv);

 int find_scripts(char **scripts_array, char **scripts_path_array, int num,
 		 int pathlen);
--- a/tools/perf/command-list.txt
+++ b/tools/perf/command-list.txt
@ -18,6 +18,7 @@ perf-iostat			mainporcelain common
 perf-kallsyms			mainporcelain common
 perf-kmem			mainporcelain common
 perf-kvm			mainporcelain common
+perf-kwork			mainporcelain common
 perf-list			mainporcelain common
 perf-lock			mainporcelain common
 perf-mem			mainporcelain common
--- a/tools/perf/include/perf/perf_dlfilter.h
+++ b/tools/perf/include/perf/perf_dlfilter.h
@ -9,6 +9,12 @@
 #include <linux/perf_event.h>
 #include <linux/types.h>

+/*
+ * The following macro can be used to determine if this header defines
+ * perf_dlfilter_sample machine_pid and vcpu.
+ */
+#define PERF_DLFILTER_HAS_MACHINE_PID
+
 /* Definitions for perf_dlfilter_sample flags */
 enum {
 	PERF_DLFILTER_FLAG_BRANCH	= 1ULL << 0,
@ -62,6 +68,8 @@ struct perf_dlfilter_sample {
 	__u64 raw_callchain_nr;	/* Number of raw_callchain entries */
 	const __u64 *raw_callchain; /* Refer <linux/perf_event.h> */
 	const char *event;
+	__s32 machine_pid;
+	__s32 vcpu;
 };

 /*
--- a/tools/perf/perf.c
+++ b/tools/perf/perf.c
@ -91,6 +91,7 @@ static struct cmd_struct commands[] = {
 	{ "data",	cmd_data,	0 },
 	{ "ftrace",	cmd_ftrace,	0 },
 	{ "daemon",	cmd_daemon,	0 },
+	{ "kwork",	cmd_kwork,	0 },
 };

 struct pager_config {
--- a/tools/perf/pmu-events/Build
+++ b/tools/perf/pmu-events/Build
@ -1,7 +1,3 @@
-hostprogs := jevents
-
-jevents-y	+= json.o jsmn.o jevents.o
-HOSTCFLAGS_jevents.o	= -I$(srctree)/tools/include
 pmu-events-y	+= pmu-events.o
 JDIR		=  pmu-events/arch/$(SRCARCH)
 JSON		=  $(shell [ -d $(JDIR) ] &&				\
@ -9,10 +5,19 @@ JSON		=  $(shell [ -d $(JDIR) ] &&				\
 JDIR_TEST	=  pmu-events/arch/test
 JSON_TEST	=  $(shell [ -d $(JDIR_TEST) ] &&			\
 			find $(JDIR_TEST) -name '*.json')
+JEVENTS_PY	=  pmu-events/jevents.py

 #
 # Locate/process JSON files in pmu-events/arch/
 # directory and create tables in pmu-events.c.
 #
-$(OUTPUT)pmu-events/pmu-events.c: $(JSON) $(JSON_TEST) $(JEVENTS)
-	$(Q)$(call echo-cmd,gen)$(JEVENTS) $(SRCARCH) pmu-events/arch $(OUTPUT)pmu-events/pmu-events.c $(V)
+
+ifeq ($(NO_JEVENTS),1)
+$(OUTPUT)pmu-events/pmu-events.c: pmu-events/empty-pmu-events.c
+	$(call rule_mkdir)
+	$(Q)$(call echo-cmd,gen)cp $< $@
+else
+$(OUTPUT)pmu-events/pmu-events.c: $(JSON) $(JSON_TEST) $(JEVENTS_PY)
+	$(call rule_mkdir)
+	$(Q)$(call echo-cmd,gen)$(PYTHON) $(JEVENTS_PY) $(SRCARCH) pmu-events/arch $@
+endif
--- a/tools/perf/pmu-events/arch/arm64/mapfile.csv
+++ b/tools/perf/pmu-events/arch/arm64/mapfile.csv
@ -27,7 +27,9 @@
 0x00000000410fd0d0,v1,arm/cortex-a77,core
 0x00000000410fd400,v1,arm/neoverse-v1,core
 0x00000000410fd410,v1,arm/cortex-a78,core
+0x00000000410fd4b0,v1,arm/cortex-a78,core
 0x00000000410fd440,v1,arm/cortex-x1,core
+0x00000000410fd4c0,v1,arm/cortex-x1,core
 0x00000000410fd460,v1,arm/cortex-a510,core
 0x00000000410fd470,v1,arm/cortex-a710,core
 0x00000000410fd480,v1,arm/cortex-x2,core
--- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
@ -592,13 +592,13 @@
        "Unit": "cpu_atom"
    },
    {
-        "BriefDescription": "Instructions per Branch (lower number means higher occurrence rate)",
+        "BriefDescription": "Instructions per Branch (lower number means higher occurance rate)",
        "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES",
        "MetricName": "IpBranch",
        "Unit": "cpu_atom"
    },
    {
-        "BriefDescription": "Instruction per (near) call (lower number means higher occurrence rate)",
+        "BriefDescription": "Instruction per (near) call (lower number means higher occurance rate)",
        "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.CALL",
        "MetricName": "IpCall",
        "Unit": "cpu_atom"
--- a/tools/perf/pmu-events/arch/x86/alderlake/cache.json
+++ b/tools/perf/pmu-events/arch/x86/alderlake/cache.json
@ -1,45 +1,49 @@
 [
    {
-        "BriefDescription": "Counts the number of cycles the core is stalled due to an instruction cache or tlb miss which hit in the L2, LLC, DRAM or MMIO (Non-DRAM).",
+        "BriefDescription": "Counts the number of cycles the core is stalled due to an instruction cache or TLB miss which hit in the L2, LLC, DRAM or MMIO (Non-DRAM).",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5",
        "EventCode": "0x34",
        "EventName": "MEM_BOUND_STALLS.IFETCH",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0x38",
        "Unit": "cpu_atom"
    },
    {
-        "BriefDescription": "Counts the number of cycles the core is stalled due to an instruction cache or tlb miss which hit in DRAM or MMIO (Non-DRAM).",
+        "BriefDescription": "Counts the number of cycles the core is stalled due to an instruction cache or TLB miss which hit in DRAM or MMIO (Non-DRAM).",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5",
        "EventCode": "0x34",
        "EventName": "MEM_BOUND_STALLS.IFETCH_DRAM_HIT",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0x20",
        "Unit": "cpu_atom"
    },
    {
-        "BriefDescription": "Counts the number of cycles the core is stalled due to an instruction cache or tlb miss which hit in the L2 cache.",
+        "BriefDescription": "Counts the number of cycles the core is stalled due to an instruction cache or TLB miss which hit in the L2 cache.",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5",
        "EventCode": "0x34",
        "EventName": "MEM_BOUND_STALLS.IFETCH_L2_HIT",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0x8",
        "Unit": "cpu_atom"
    },
    {
-        "BriefDescription": "Counts the number of cycles the core is stalled due to an instruction cache or tlb miss which hit in the last level cache or other core with HITE/F/M.",
+        "BriefDescription": "Counts the number of cycles the core is stalled due to an instruction cache or TLB miss which hit in the LLC or other core with HITE/F/M.",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5",
        "EventCode": "0x34",
        "EventName": "MEM_BOUND_STALLS.IFETCH_LLC_HIT",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0x10",
        "Unit": "cpu_atom"
    },
@ -51,6 +55,7 @@
        "EventName": "MEM_BOUND_STALLS.LOAD",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0x7",
        "Unit": "cpu_atom"
    },
@ -62,6 +67,7 @@
        "EventName": "MEM_BOUND_STALLS.LOAD_DRAM_HIT",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0x4",
        "Unit": "cpu_atom"
    },
@ -73,6 +79,7 @@
        "EventName": "MEM_BOUND_STALLS.LOAD_L2_HIT",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0x1",
        "Unit": "cpu_atom"
    },
@ -84,11 +91,12 @@
        "EventName": "MEM_BOUND_STALLS.LOAD_LLC_HIT",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0x2",
        "Unit": "cpu_atom"
    },
    {
-        "BriefDescription": "Counts the number of load ops retired that hit in DRAM.",
+        "BriefDescription": "Counts the number of load uops retired that hit in DRAM.",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5",
        "Data_LA": "1",
@ -101,7 +109,7 @@
        "Unit": "cpu_atom"
    },
    {
-        "BriefDescription": "Counts the number of load ops retired that hit in the L2 cache.",
+        "BriefDescription": "Counts the number of load uops retired that hit in the L2 cache.",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5",
        "Data_LA": "1",
@ -114,9 +122,10 @@
        "Unit": "cpu_atom"
    },
    {
-        "BriefDescription": "Counts the number of load ops retired that hit in the L3 cache.",
+        "BriefDescription": "Counts the number of load uops retired that hit in the L3 cache.",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5",
+        "Data_LA": "1",
        "EventCode": "0xd1",
        "EventName": "MEM_LOAD_UOPS_RETIRED.L3_HIT",
        "PEBS": "1",
@ -133,6 +142,7 @@
        "EventName": "MEM_SCHEDULER_BLOCK.ALL",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "20003",
+        "Speculative": "1",
        "UMask": "0x7",
        "Unit": "cpu_atom"
    },
@ -144,6 +154,7 @@
        "EventName": "MEM_SCHEDULER_BLOCK.LD_BUF",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "20003",
+        "Speculative": "1",
        "UMask": "0x2",
        "Unit": "cpu_atom"
    },
@ -155,6 +166,7 @@
        "EventName": "MEM_SCHEDULER_BLOCK.RSV",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "20003",
+        "Speculative": "1",
        "UMask": "0x4",
        "Unit": "cpu_atom"
    },
@ -166,6 +178,7 @@
        "EventName": "MEM_SCHEDULER_BLOCK.ST_BUF",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "20003",
+        "Speculative": "1",
        "UMask": "0x1",
        "Unit": "cpu_atom"
    },
@ -202,6 +215,7 @@
        "Data_LA": "1",
        "EventCode": "0xd0",
        "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_128",
+        "L1_Hit_Indication": "1",
        "MSRIndex": "0x3F6",
        "MSRValue": "0x80",
        "PEBS": "2",
@ -218,6 +232,7 @@
        "Data_LA": "1",
        "EventCode": "0xd0",
        "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_16",
+        "L1_Hit_Indication": "1",
        "MSRIndex": "0x3F6",
        "MSRValue": "0x10",
        "PEBS": "2",
@ -234,6 +249,7 @@
        "Data_LA": "1",
        "EventCode": "0xd0",
        "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_256",
+        "L1_Hit_Indication": "1",
        "MSRIndex": "0x3F6",
        "MSRValue": "0x100",
        "PEBS": "2",
@ -250,6 +266,7 @@
        "Data_LA": "1",
        "EventCode": "0xd0",
        "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_32",
+        "L1_Hit_Indication": "1",
        "MSRIndex": "0x3F6",
        "MSRValue": "0x20",
        "PEBS": "2",
@ -266,6 +283,7 @@
        "Data_LA": "1",
        "EventCode": "0xd0",
        "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_4",
+        "L1_Hit_Indication": "1",
        "MSRIndex": "0x3F6",
        "MSRValue": "0x4",
        "PEBS": "2",
@ -282,6 +300,7 @@
        "Data_LA": "1",
        "EventCode": "0xd0",
        "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_512",
+        "L1_Hit_Indication": "1",
        "MSRIndex": "0x3F6",
        "MSRValue": "0x200",
        "PEBS": "2",
@ -298,6 +317,7 @@
        "Data_LA": "1",
        "EventCode": "0xd0",
        "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_64",
+        "L1_Hit_Indication": "1",
        "MSRIndex": "0x3F6",
        "MSRValue": "0x40",
        "PEBS": "2",
@ -314,6 +334,7 @@
        "Data_LA": "1",
        "EventCode": "0xd0",
        "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_8",
+        "L1_Hit_Indication": "1",
        "MSRIndex": "0x3F6",
        "MSRValue": "0x8",
        "PEBS": "2",
@ -324,7 +345,7 @@
        "Unit": "cpu_atom"
    },
    {
-        "BriefDescription": "Counts all the retired split loads.",
+        "BriefDescription": "Counts the number of retired split load uops.",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5",
        "Data_LA": "1",
@ -338,11 +359,13 @@
    },
    {
        "BriefDescription": "Counts the number of stores uops retired. Counts with or without PEBS enabled.",
-        "CollectPEBSRecord": "2",
+        "CollectPEBSRecord": "3",
        "Counter": "0,1,2,3,4,5",
+        "Data_LA": "1",
        "EventCode": "0xd0",
        "EventName": "MEM_UOPS_RETIRED.STORE_LATENCY",
-        "PEBS": "1",
+        "L1_Hit_Indication": "1",
+        "PEBS": "2",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "1000003",
        "UMask": "0x6",
@ -350,7 +373,7 @@
    },
    {
        "BriefDescription": "Counts demand reads for ownership (RFO) and software prefetches for exclusive ownership (PREFETCHW) that were supplied by the L3 cache where a snoop was sent, the snoop hit, and modified data was forwarded.",
-        "Counter": "0,1,2,3",
+        "Counter": "0,1,2,3,4,5",
        "EventCode": "0xB7",
        "EventName": "OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM",
        "MSRIndex": "0x1a6,0x1a7",
@ -367,9 +390,22 @@
        "EventName": "TOPDOWN_FE_BOUND.ICACHE",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x20",
        "Unit": "cpu_atom"
    },
+    {
+        "BriefDescription": "L1D.HWPF_MISS",
+        "CollectPEBSRecord": "2",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x51",
+        "EventName": "L1D.HWPF_MISS",
+        "PEBScounters": "0,1,2,3",
+        "SampleAfterValue": "1000003",
+        "Speculative": "1",
+        "UMask": "0x20",
+        "Unit": "cpu_core"
+    },
    {
        "BriefDescription": "Counts the number of cache lines replaced in L1 data cache.",
        "CollectPEBSRecord": "2",
@ -378,6 +414,7 @@
        "EventName": "L1D.REPLACEMENT",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x1",
        "Unit": "cpu_core"
    },
@ -389,6 +426,7 @@
        "EventName": "L1D_PEND_MISS.FB_FULL",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x2",
        "Unit": "cpu_core"
    },
@ -402,6 +440,7 @@
        "EventName": "L1D_PEND_MISS.FB_FULL_PERIODS",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x2",
        "Unit": "cpu_core"
    },
@ -413,6 +452,7 @@
        "EventName": "L1D_PEND_MISS.L2_STALL",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x4",
        "Unit": "cpu_core"
    },
@ -424,6 +464,7 @@
        "EventName": "L1D_PEND_MISS.L2_STALLS",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x4",
        "Unit": "cpu_core"
    },
@ -435,6 +476,7 @@
        "EventName": "L1D_PEND_MISS.PENDING",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x1",
        "Unit": "cpu_core"
    },
@ -447,6 +489,7 @@
        "EventName": "L1D_PEND_MISS.PENDING_CYCLES",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x1",
        "Unit": "cpu_core"
    },
@ -458,17 +501,31 @@
        "EventName": "L2_LINES_IN.ALL",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x1f",
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "All L2 requests.[This event is alias to L2_RQSTS.REFERENCES]",
+        "BriefDescription": "Cache lines that have been L2 hardware prefetched but not used by demand accesses",
+        "CollectPEBSRecord": "2",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x26",
+        "EventName": "L2_LINES_OUT.USELESS_HWPF",
+        "PEBScounters": "0,1,2,3",
+        "SampleAfterValue": "200003",
+        "Speculative": "1",
+        "UMask": "0x4",
+        "Unit": "cpu_core"
+    },
+    {
+        "BriefDescription": "All accesses to L2 cache[This event is alias to L2_RQSTS.REFERENCES]",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3",
        "EventCode": "0x24",
        "EventName": "L2_REQUEST.ALL",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0xff",
        "Unit": "cpu_core"
    },
@ -480,6 +537,7 @@
        "EventName": "L2_REQUEST.MISS",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0x3f",
        "Unit": "cpu_core"
    },
@ -491,17 +549,19 @@
        "EventName": "L2_RQSTS.ALL_CODE_RD",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0xe4",
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "Demand Data Read requests",
+        "BriefDescription": "Demand Data Read access L2 cache",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3",
        "EventCode": "0x24",
        "EventName": "L2_RQSTS.ALL_DEMAND_DATA_RD",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0xe1",
        "Unit": "cpu_core"
    },
@ -513,9 +573,22 @@
        "EventName": "L2_RQSTS.ALL_DEMAND_MISS",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0x27",
        "Unit": "cpu_core"
    },
+    {
+        "BriefDescription": "L2_RQSTS.ALL_HWPF",
+        "CollectPEBSRecord": "2",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x24",
+        "EventName": "L2_RQSTS.ALL_HWPF",
+        "PEBScounters": "0,1,2,3",
+        "SampleAfterValue": "200003",
+        "Speculative": "1",
+        "UMask": "0xf0",
+        "Unit": "cpu_core"
+    },
    {
        "BriefDescription": "RFO requests to L2 cache.",
        "CollectPEBSRecord": "2",
@ -524,6 +597,7 @@
        "EventName": "L2_RQSTS.ALL_RFO",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0xe2",
        "Unit": "cpu_core"
    },
@ -535,6 +609,7 @@
        "EventName": "L2_RQSTS.CODE_RD_HIT",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0xc4",
        "Unit": "cpu_core"
    },
@ -546,6 +621,7 @@
        "EventName": "L2_RQSTS.CODE_RD_MISS",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0x24",
        "Unit": "cpu_core"
    },
@ -557,20 +633,34 @@
        "EventName": "L2_RQSTS.DEMAND_DATA_RD_HIT",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0xc1",
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "Demand Data Read miss L2, no rejects",
+        "BriefDescription": "Demand Data Read miss L2 cache",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3",
        "EventCode": "0x24",
        "EventName": "L2_RQSTS.DEMAND_DATA_RD_MISS",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0x21",
        "Unit": "cpu_core"
    },
+    {
+        "BriefDescription": "L2_RQSTS.HWPF_MISS",
+        "CollectPEBSRecord": "2",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x24",
+        "EventName": "L2_RQSTS.HWPF_MISS",
+        "PEBScounters": "0,1,2,3",
+        "SampleAfterValue": "200003",
+        "Speculative": "1",
+        "UMask": "0x30",
+        "Unit": "cpu_core"
+    },
    {
        "BriefDescription": "Read requests with true-miss in L2 cache.[This event is alias to L2_REQUEST.MISS]",
        "CollectPEBSRecord": "2",
@ -579,17 +669,19 @@
        "EventName": "L2_RQSTS.MISS",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0x3f",
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "All L2 requests.[This event is alias to L2_REQUEST.ALL]",
+        "BriefDescription": "All accesses to L2 cache[This event is alias to L2_REQUEST.ALL]",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3",
        "EventCode": "0x24",
        "EventName": "L2_RQSTS.REFERENCES",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0xff",
        "Unit": "cpu_core"
    },
@ -601,6 +693,7 @@
        "EventName": "L2_RQSTS.RFO_HIT",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0xc2",
        "Unit": "cpu_core"
    },
@ -612,6 +705,7 @@
        "EventName": "L2_RQSTS.RFO_MISS",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0x22",
        "Unit": "cpu_core"
    },
@ -623,6 +717,7 @@
        "EventName": "L2_RQSTS.SWPF_HIT",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0xc8",
        "Unit": "cpu_core"
    },
@ -634,22 +729,36 @@
        "EventName": "L2_RQSTS.SWPF_MISS",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0x28",
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "TBD",
+        "BriefDescription": "Core-originated cacheable requests that missed L3  (Except hardware prefetches to the L3)",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5,6,7",
        "EventCode": "0x2e",
        "EventName": "LONGEST_LAT_CACHE.MISS",
        "PEBScounters": "0,1,2,3,4,5,6,7",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x41",
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "All retired load instructions.",
+        "BriefDescription": "Core-originated cacheable requests that refer to L3 (Except hardware prefetches to the L3)",
+        "CollectPEBSRecord": "2",
+        "Counter": "0,1,2,3,4,5,6,7",
+        "EventCode": "0x2e",
+        "EventName": "LONGEST_LAT_CACHE.REFERENCE",
+        "PEBScounters": "0,1,2,3,4,5,6,7",
+        "SampleAfterValue": "100003",
+        "Speculative": "1",
+        "UMask": "0x4f",
+        "Unit": "cpu_core"
+    },
+    {
+        "BriefDescription": "Retired load instructions.",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3",
        "Data_LA": "1",
@ -662,7 +771,7 @@
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "All retired store instructions.",
+        "BriefDescription": "Retired store instructions.",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3",
        "Data_LA": "1",
@ -764,6 +873,7 @@
        "EventName": "MEM_LOAD_COMPLETED.L1_MISS_ANY",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0xfd",
        "Unit": "cpu_core"
    },
@ -961,7 +1071,7 @@
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "TBD",
+        "BriefDescription": "MEM_STORE_RETIRED.L2_HIT",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3",
        "EventCode": "0x44",
@ -983,7 +1093,7 @@
    },
    {
        "BriefDescription": "Counts demand data reads that resulted in a snoop hit in another cores caches, data forwarding is required as the data is modified.",
-        "Counter": "0,1,2,3",
+        "Counter": "0,1,2,3,4,5,6,7",
        "EventCode": "0x2A,0x2B",
        "EventName": "OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM",
        "MSRIndex": "0x1a6,0x1a7",
@ -993,8 +1103,8 @@
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "DEMAND_DATA_RD & L3_HIT & SNOOP_HIT_WITH_FWD",
-        "Counter": "0,1,2,3",
+        "BriefDescription": "OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD",
+        "Counter": "0,1,2,3,4,5,6,7",
        "EventCode": "0x2A,0x2B",
        "EventName": "OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD",
        "MSRIndex": "0x1a6,0x1a7",
@ -1005,7 +1115,7 @@
    },
    {
        "BriefDescription": "Counts demand read for ownership (RFO) requests and software prefetches for exclusive ownership (PREFETCHW) that resulted in a snoop hit in another cores caches, data forwarding is required as the data is modified.",
-        "Counter": "0,1,2,3",
+        "Counter": "0,1,2,3,4,5,6,7",
        "EventCode": "0x2A,0x2B",
        "EventName": "OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM",
        "MSRIndex": "0x1a6,0x1a7",
@ -1015,13 +1125,14 @@
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "TBD",
+        "BriefDescription": "OFFCORE_REQUESTS.ALL_REQUESTS",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3",
        "EventCode": "0x21",
        "EventName": "OFFCORE_REQUESTS.ALL_REQUESTS",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x80",
        "Unit": "cpu_core"
    },
@ -1033,6 +1144,7 @@
        "EventName": "OFFCORE_REQUESTS.DATA_RD",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x8",
        "Unit": "cpu_core"
    },
@ -1044,6 +1156,7 @@
        "EventName": "OFFCORE_REQUESTS.DEMAND_DATA_RD",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x1",
        "Unit": "cpu_core"
    },
@ -1051,22 +1164,26 @@
        "BriefDescription": "This event is deprecated. Refer to new event OFFCORE_REQUESTS_OUTSTANDING.DATA_RD",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3",
+        "Errata": "ADL038",
        "EventCode": "0x20",
        "EventName": "OFFCORE_REQUESTS_OUTSTANDING.ALL_DATA_RD",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x8",
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "TBD",
+        "BriefDescription": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3",
        "CounterMask": "1",
+        "Errata": "ADL038",
        "EventCode": "0x20",
        "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x8",
        "Unit": "cpu_core"
    },
@ -1079,17 +1196,20 @@
        "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x4",
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "TBD",
+        "BriefDescription": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3",
+        "Errata": "ADL038",
        "EventCode": "0x20",
        "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x8",
        "Unit": "cpu_core"
    },
@ -1101,6 +1221,7 @@
        "EventName": "SW_PREFETCH_ACCESS.NTA",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x1",
        "Unit": "cpu_core"
    },
@ -1112,6 +1233,7 @@
        "EventName": "SW_PREFETCH_ACCESS.PREFETCHW",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x8",
        "Unit": "cpu_core"
    },
@ -1123,6 +1245,7 @@
        "EventName": "SW_PREFETCH_ACCESS.T0",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x2",
        "Unit": "cpu_core"
    },
@ -1134,7 +1257,8 @@
        "EventName": "SW_PREFETCH_ACCESS.T1_T2",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x4",
        "Unit": "cpu_core"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/alderlake/floating-point.json
+++ b/tools/perf/pmu-events/arch/x86/alderlake/floating-point.json
@ -7,6 +7,7 @@
        "EventName": "MACHINE_CLEARS.FP_ASSIST",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "20003",
+        "Speculative": "1",
        "UMask": "0x4",
        "Unit": "cpu_atom"
    },
@ -23,7 +24,7 @@
        "Unit": "cpu_atom"
    },
    {
-        "BriefDescription": "TBD",
+        "BriefDescription": "ARITH.FPDIV_ACTIVE",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5,6,7",
        "CounterMask": "1",
@ -31,6 +32,7 @@
        "EventName": "ARITH.FPDIV_ACTIVE",
        "PEBScounters": "0,1,2,3,4,5,6,7",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x1",
        "Unit": "cpu_core"
    },
@ -42,50 +44,55 @@
        "EventName": "ASSISTS.FP",
        "PEBScounters": "0,1,2,3,4,5,6,7",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x2",
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "TBD",
+        "BriefDescription": "ASSISTS.SSE_AVX_MIX",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5,6,7",
        "EventCode": "0xc1",
        "EventName": "ASSISTS.SSE_AVX_MIX",
        "PEBScounters": "0,1,2,3,4,5,6,7",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x10",
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "TBD",
+        "BriefDescription": "FP_ARITH_DISPATCHED.PORT_0",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5,6,7",
        "EventCode": "0xb3",
        "EventName": "FP_ARITH_DISPATCHED.PORT_0",
        "PEBScounters": "0,1,2,3,4,5,6,7",
        "SampleAfterValue": "2000003",
+        "Speculative": "1",
        "UMask": "0x1",
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "TBD",
+        "BriefDescription": "FP_ARITH_DISPATCHED.PORT_1",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5,6,7",
        "EventCode": "0xb3",
        "EventName": "FP_ARITH_DISPATCHED.PORT_1",
        "PEBScounters": "0,1,2,3,4,5,6,7",
        "SampleAfterValue": "2000003",
+        "Speculative": "1",
        "UMask": "0x2",
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "TBD",
+        "BriefDescription": "FP_ARITH_DISPATCHED.PORT_5",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5,6,7",
        "EventCode": "0xb3",
        "EventName": "FP_ARITH_DISPATCHED.PORT_5",
        "PEBScounters": "0,1,2,3,4,5,6,7",
        "SampleAfterValue": "2000003",
+        "Speculative": "1",
        "UMask": "0x4",
        "Unit": "cpu_core"
    },
@ -155,4 +162,4 @@
        "UMask": "0x2",
        "Unit": "cpu_core"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/alderlake/frontend.json
+++ b/tools/perf/pmu-events/arch/x86/alderlake/frontend.json
@ -7,6 +7,7 @@
        "EventName": "BACLEARS.ANY",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x1",
        "Unit": "cpu_atom"
    },
@ -18,6 +19,7 @@
        "EventName": "ICACHE.ACCESSES",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0x3",
        "Unit": "cpu_atom"
    },
@ -29,6 +31,7 @@
        "EventName": "ICACHE.MISSES",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0x2",
        "Unit": "cpu_atom"
    },
@ -40,6 +43,7 @@
        "EventName": "DECODE.LCP",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "500009",
+        "Speculative": "1",
        "UMask": "0x1",
        "Unit": "cpu_core"
    },
@ -51,6 +55,7 @@
        "EventName": "DSB2MITE_SWITCHES.PENALTY_CYCLES",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x2",
        "Unit": "cpu_core"
    },
@ -294,6 +299,21 @@
        "UMask": "0x1",
        "Unit": "cpu_core"
    },
+    {
+        "BriefDescription": "FRONTEND_RETIRED.MS_FLOWS",
+        "CollectPEBSRecord": "2",
+        "Counter": "0,1,2,3,4,5,6,7",
+        "EventCode": "0xc6",
+        "EventName": "FRONTEND_RETIRED.MS_FLOWS",
+        "MSRIndex": "0x3F7",
+        "MSRValue": "0x8",
+        "PEBS": "1",
+        "PEBScounters": "0,1,2,3,4,5,6,7",
+        "SampleAfterValue": "100007",
+        "TakenAlone": "1",
+        "UMask": "0x1",
+        "Unit": "cpu_core"
+    },
    {
        "BriefDescription": "Retired Instructions who experienced STLB (2nd level TLB) true miss.",
        "CollectPEBSRecord": "2",
@ -310,7 +330,7 @@
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "TBD",
+        "BriefDescription": "FRONTEND_RETIRED.UNKNOWN_BRANCH",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5,6,7",
        "EventCode": "0xc6",
@ -332,6 +352,7 @@
        "EventName": "ICACHE_DATA.STALLS",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "500009",
+        "Speculative": "1",
        "UMask": "0x4",
        "Unit": "cpu_core"
    },
@ -343,6 +364,7 @@
        "EventName": "ICACHE_TAG.STALLS",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0x4",
        "Unit": "cpu_core"
    },
@ -355,6 +377,7 @@
        "EventName": "IDQ.DSB_CYCLES_ANY",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "2000003",
+        "Speculative": "1",
        "UMask": "0x8",
        "Unit": "cpu_core"
    },
@ -367,6 +390,7 @@
        "EventName": "IDQ.DSB_CYCLES_OK",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "2000003",
+        "Speculative": "1",
        "UMask": "0x8",
        "Unit": "cpu_core"
    },
@ -378,6 +402,7 @@
        "EventName": "IDQ.DSB_UOPS",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "2000003",
+        "Speculative": "1",
        "UMask": "0x8",
        "Unit": "cpu_core"
    },
@ -390,6 +415,7 @@
        "EventName": "IDQ.MITE_CYCLES_ANY",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "2000003",
+        "Speculative": "1",
        "UMask": "0x4",
        "Unit": "cpu_core"
    },
@ -402,6 +428,7 @@
        "EventName": "IDQ.MITE_CYCLES_OK",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "2000003",
+        "Speculative": "1",
        "UMask": "0x4",
        "Unit": "cpu_core"
    },
@ -413,6 +440,7 @@
        "EventName": "IDQ.MITE_UOPS",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "2000003",
+        "Speculative": "1",
        "UMask": "0x4",
        "Unit": "cpu_core"
    },
@ -425,6 +453,7 @@
        "EventName": "IDQ.MS_CYCLES_ANY",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "2000003",
+        "Speculative": "1",
        "UMask": "0x20",
        "Unit": "cpu_core"
    },
@ -438,6 +467,7 @@
        "EventName": "IDQ.MS_SWITCHES",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x20",
        "Unit": "cpu_core"
    },
@ -449,6 +479,7 @@
        "EventName": "IDQ.MS_UOPS",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x20",
        "Unit": "cpu_core"
    },
@ -460,6 +491,7 @@
        "EventName": "IDQ_UOPS_NOT_DELIVERED.CORE",
        "PEBScounters": "0,1,2,3,4,5,6,7",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x1",
        "Unit": "cpu_core"
    },
@ -472,6 +504,7 @@
        "EventName": "IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE",
        "PEBScounters": "0,1,2,3,4,5,6,7",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x1",
        "Unit": "cpu_core"
    },
@ -485,7 +518,8 @@
        "Invert": "1",
        "PEBScounters": "0,1,2,3,4,5,6,7",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x1",
        "Unit": "cpu_core"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/alderlake/memory.json
+++ b/tools/perf/pmu-events/arch/x86/alderlake/memory.json
@ -1,52 +1,61 @@
 [
    {
        "BriefDescription": "Counts the number of cycles that the head (oldest load) of the load buffer is stalled due to any number of reasons, including an L1 miss, WCB full, pagewalk, store address block or store data block, on a load that retires.",
+        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5",
        "EventCode": "0x05",
        "EventName": "LD_HEAD.ANY_AT_RET",
+        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0xff",
        "Unit": "cpu_atom"
    },
    {
        "BriefDescription": "Counts the number of cycles that the head (oldest load) of the load buffer is stalled due to a core bound stall including a store address match, a DTLB miss or a page walk that detains the load from retiring.",
-        "Counter": "0,1,2,3",
+        "CollectPEBSRecord": "2",
+        "Counter": "0,1,2,3,4,5",
        "EventCode": "0x05",
        "EventName": "LD_HEAD.L1_BOUND_AT_RET",
+        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0xf4",
        "Unit": "cpu_atom"
    },
    {
-        "BriefDescription": "Counts the number of cycles that the head (oldest load) of the load buffer is stalled due to other block cases when load subsequently retires when load subsequently retires.",
+        "BriefDescription": "Counts the number of cycles that the head (oldest load) of the load buffer and retirement are both stalled due to other block cases.",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5",
        "EventCode": "0x05",
        "EventName": "LD_HEAD.OTHER_AT_RET",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0xc0",
        "Unit": "cpu_atom"
    },
    {
-        "BriefDescription": "Counts the number of cycles that the head (oldest load) of the load buffer is stalled due to a pagewalk when load subsequently retires.",
+        "BriefDescription": "Counts the number of cycles that the head (oldest load) of the load buffer and retirement are both stalled due to a pagewalk.",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5",
        "EventCode": "0x05",
        "EventName": "LD_HEAD.PGWALK_AT_RET",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0xa0",
        "Unit": "cpu_atom"
    },
    {
-        "BriefDescription": "Counts the number of cycles that the head (oldest load) of the load buffer is stalled due to a store address match when load subsequently retires.",
+        "BriefDescription": "Counts the number of cycles that the head (oldest load) of the load buffer and retirement are both stalled due to a store address match.",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5",
        "EventCode": "0x05",
        "EventName": "LD_HEAD.ST_ADDR_AT_RET",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x84",
        "Unit": "cpu_atom"
    },
@ -58,12 +67,13 @@
        "EventName": "MACHINE_CLEARS.MEMORY_ORDERING",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "20003",
+        "Speculative": "1",
        "UMask": "0x2",
        "Unit": "cpu_atom"
    },
    {
        "BriefDescription": "Counts demand data reads that were not supplied by the L3 cache.",
-        "Counter": "0,1,2,3",
+        "Counter": "0,1,2,3,4,5",
        "EventCode": "0xB7",
        "EventName": "OCR.DEMAND_DATA_RD.L3_MISS",
        "MSRIndex": "0x1a6,0x1a7",
@ -74,7 +84,7 @@
    },
    {
        "BriefDescription": "Counts demand reads for ownership (RFO) and software prefetches for exclusive ownership (PREFETCHW) that were not supplied by the L3 cache.",
-        "Counter": "0,1,2,3",
+        "Counter": "0,1,2,3,4,5",
        "EventCode": "0xB7",
        "EventName": "OCR.DEMAND_RFO.L3_MISS",
        "MSRIndex": "0x1a6,0x1a7",
@ -92,6 +102,7 @@
        "EventName": "CYCLE_ACTIVITY.STALLS_L3_MISS",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x6",
        "Unit": "cpu_core"
    },
@ -103,6 +114,7 @@
        "EventName": "MACHINE_CLEARS.MEMORY_ORDERING",
        "PEBScounters": "0,1,2,3,4,5,6,7",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x2",
        "Unit": "cpu_core"
    },
@ -115,6 +127,7 @@
        "EventName": "MEMORY_ACTIVITY.CYCLES_L1D_MISS",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x2",
        "Unit": "cpu_core"
    },
@ -127,11 +140,12 @@
        "EventName": "MEMORY_ACTIVITY.STALLS_L1D_MISS",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x3",
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "TBD",
+        "BriefDescription": "MEMORY_ACTIVITY.STALLS_L2_MISS",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3",
        "CounterMask": "5",
@ -139,11 +153,12 @@
        "EventName": "MEMORY_ACTIVITY.STALLS_L2_MISS",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x5",
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "TBD",
+        "BriefDescription": "MEMORY_ACTIVITY.STALLS_L3_MISS",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3",
        "CounterMask": "9",
@ -151,6 +166,7 @@
        "EventName": "MEMORY_ACTIVITY.STALLS_L3_MISS",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x9",
        "Unit": "cpu_core"
    },
@ -283,7 +299,7 @@
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "Retired instructions with at least 1 store uop. This PEBS event is the trigger for stores sampled by the PEBS Store Facility.",
+        "BriefDescription": "Retired memory store access operations. A PDist event for PEBS Store Latency Facility.",
        "CollectPEBSRecord": "2",
        "Data_LA": "1",
        "EventCode": "0xcd",
@ -295,7 +311,7 @@
    },
    {
        "BriefDescription": "Counts demand data reads that were not supplied by the L3 cache.",
-        "Counter": "0,1,2,3",
+        "Counter": "0,1,2,3,4,5,6,7",
        "EventCode": "0x2A,0x2B",
        "EventName": "OCR.DEMAND_DATA_RD.L3_MISS",
        "MSRIndex": "0x1a6,0x1a7",
@ -306,7 +322,7 @@
    },
    {
        "BriefDescription": "Counts demand read for ownership (RFO) requests and software prefetches for exclusive ownership (PREFETCHW) that were not supplied by the L3 cache.",
-        "Counter": "0,1,2,3",
+        "Counter": "0,1,2,3,4,5,6,7",
        "EventCode": "0x2A,0x2B",
        "EventName": "OCR.DEMAND_RFO.L3_MISS",
        "MSRIndex": "0x1a6,0x1a7",
@ -315,4 +331,4 @@
        "UMask": "0x1",
        "Unit": "cpu_core"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/alderlake/other.json
+++ b/tools/perf/pmu-events/arch/x86/alderlake/other.json
@ -1,7 +1,7 @@
 [
    {
        "BriefDescription": "Counts demand data reads that have any type of response.",
-        "Counter": "0,1,2,3",
+        "Counter": "0,1,2,3,4,5",
        "EventCode": "0xB7",
        "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE",
        "MSRIndex": "0x1a6,0x1a7",
@ -12,7 +12,7 @@
    },
    {
        "BriefDescription": "Counts demand reads for ownership (RFO) and software prefetches for exclusive ownership (PREFETCHW) that have any type of response.",
-        "Counter": "0,1,2,3",
+        "Counter": "0,1,2,3,4,5",
        "EventCode": "0xB7",
        "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE",
        "MSRIndex": "0x1a6,0x1a7",
@ -23,7 +23,7 @@
    },
    {
        "BriefDescription": "Counts streaming stores that have any type of response.",
-        "Counter": "0,1,2,3",
+        "Counter": "0,1,2,3,4,5",
        "EventCode": "0xB7",
        "EventName": "OCR.STREAMING_WR.ANY_RESPONSE",
        "MSRIndex": "0x1a6,0x1a7",
@ -33,74 +33,68 @@
        "Unit": "cpu_atom"
    },
    {
-        "BriefDescription": "Number of occurrences where a microcode assist is invoked by hardware.",
-        "CollectPEBSRecord": "2",
-        "Counter": "0,1,2,3,4,5,6,7",
-        "EventCode": "0xc1",
-        "EventName": "ASSISTS.ANY",
-        "PEBScounters": "0,1,2,3,4,5,6,7",
-        "SampleAfterValue": "100003",
-        "UMask": "0x1f",
-        "Unit": "cpu_core"
-    },
-    {
-        "BriefDescription": "Count all other microcode assist beyond FP, AVX_TILE_MIX and A/D assists (counted by their own sub-events). This includes assists at uop writeback like AVX* load/store (non-FP) assists, Null Assist in SNC (due to lack of FP precision format convert with FMA3x3 uarch) or assists generated by ROB (like assists to due to Missprediction for FSW register - fixed in SNC)",
+        "BriefDescription": "ASSISTS.HARDWARE",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5,6,7",
        "EventCode": "0xc1",
        "EventName": "ASSISTS.HARDWARE",
        "PEBScounters": "0,1,2,3,4,5,6,7",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x4",
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "TBD",
+        "BriefDescription": "ASSISTS.PAGE_FAULT",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5,6,7",
        "EventCode": "0xc1",
        "EventName": "ASSISTS.PAGE_FAULT",
        "PEBScounters": "0,1,2,3,4,5,6,7",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x8",
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "TBD",
+        "BriefDescription": "CORE_POWER.LICENSE_1",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3",
        "EventCode": "0x28",
        "EventName": "CORE_POWER.LICENSE_1",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0x2",
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "TBD",
+        "BriefDescription": "CORE_POWER.LICENSE_2",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3",
        "EventCode": "0x28",
        "EventName": "CORE_POWER.LICENSE_2",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0x4",
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "TBD",
+        "BriefDescription": "CORE_POWER.LICENSE_3",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3",
        "EventCode": "0x28",
        "EventName": "CORE_POWER.LICENSE_3",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0x8",
        "Unit": "cpu_core"
    },
    {
        "BriefDescription": "Counts demand data reads that have any type of response.",
-        "Counter": "0,1,2,3",
+        "Counter": "0,1,2,3,4,5,6,7",
        "EventCode": "0x2A,0x2B",
        "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE",
        "MSRIndex": "0x1a6,0x1a7",
@ -111,7 +105,7 @@
    },
    {
        "BriefDescription": "Counts demand read for ownership (RFO) requests and software prefetches for exclusive ownership (PREFETCHW) that have any type of response.",
-        "Counter": "0,1,2,3",
+        "Counter": "0,1,2,3,4,5,6,7",
        "EventCode": "0x2A,0x2B",
        "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE",
        "MSRIndex": "0x1a6,0x1a7",
@ -122,7 +116,7 @@
    },
    {
        "BriefDescription": "Counts streaming stores that have any type of response.",
-        "Counter": "0,1,2,3",
+        "Counter": "0,1,2,3,4,5,6,7",
        "EventCode": "0x2A,0x2B",
        "EventName": "OCR.STREAMING_WR.ANY_RESPONSE",
        "MSRIndex": "0x1a6,0x1a7",
@ -132,7 +126,61 @@
        "Unit": "cpu_core"
    },
    {
-        "BriefDescription": "TBD",
+        "BriefDescription": "Cycles when Reservation Station (RS) is empty for the thread.",
+        "CollectPEBSRecord": "2",
+        "Counter": "0,1,2,3,4,5,6,7",
+        "EventCode": "0xa5",
+        "EventName": "RS.EMPTY",
+        "PEBScounters": "0,1,2,3,4,5,6,7",
+        "SampleAfterValue": "1000003",
+        "Speculative": "1",
+        "UMask": "0x7",
+        "Unit": "cpu_core"
+    },
+    {
+        "BriefDescription": "Counts end of periods where the Reservation Station (RS) was empty.",
+        "CollectPEBSRecord": "2",
+        "Counter": "0,1,2,3,4,5,6,7",
+        "CounterMask": "1",
+        "EdgeDetect": "1",
+        "EventCode": "0xa5",
+        "EventName": "RS.EMPTY_COUNT",
+        "Invert": "1",
+        "PEBScounters": "0,1,2,3,4,5,6,7",
+        "SampleAfterValue": "100003",
+        "Speculative": "1",
+        "UMask": "0x7",
+        "Unit": "cpu_core"
+    },
+    {
+        "BriefDescription": "This event is deprecated. Refer to new event RS.EMPTY_COUNT",
+        "CollectPEBSRecord": "2",
+        "Counter": "0,1,2,3,4,5,6,7",
+        "CounterMask": "1",
+        "EdgeDetect": "1",
+        "EventCode": "0xa5",
+        "EventName": "RS_EMPTY.COUNT",
+        "Invert": "1",
+        "PEBScounters": "0,1,2,3,4,5,6,7",
+        "SampleAfterValue": "100003",
+        "Speculative": "1",
+        "UMask": "0x7",
+        "Unit": "cpu_core"
+    },
+    {
+        "BriefDescription": "This event is deprecated. Refer to new event RS.EMPTY",
+        "CollectPEBSRecord": "2",
+        "Counter": "0,1,2,3,4,5,6,7",
+        "EventCode": "0xa5",
+        "EventName": "RS_EMPTY.CYCLES",
+        "PEBScounters": "0,1,2,3,4,5,6,7",
+        "SampleAfterValue": "1000003",
+        "Speculative": "1",
+        "UMask": "0x7",
+        "Unit": "cpu_core"
+    },
+    {
+        "BriefDescription": "XQ.FULL_CYCLES",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3",
        "CounterMask": "1",
@ -140,7 +188,8 @@
        "EventName": "XQ.FULL_CYCLES",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x1",
        "Unit": "cpu_core"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/alderlake/pipeline.json
+++ b/tools/perf/pmu-events/arch/x86/alderlake/pipeline.json
--- a/tools/perf/pmu-events/arch/x86/alderlake/uncore-other.json
+++ b/tools/perf/pmu-events/arch/x86/alderlake/uncore-other.json
@ -3,7 +3,7 @@
        "BriefDescription": "This 48-bit fixed counter counts the UCLK cycles",
        "Counter": "Fixed",
        "CounterType": "PGMABLE",
-	"EventCode": "0xff",
+        "EventCode": "0xff",
        "EventName": "UNC_CLOCK.SOCKET",
        "PerPkg": "1",
        "Unit": "CLOCK"
--- a/tools/perf/pmu-events/arch/x86/alderlake/virtual-memory.json
+++ b/tools/perf/pmu-events/arch/x86/alderlake/virtual-memory.json
@ -7,6 +7,7 @@
        "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "200003",
+        "Speculative": "1",
        "UMask": "0xe",
        "Unit": "cpu_atom"
    },
@ -18,17 +19,55 @@
        "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "2000003",
+        "Speculative": "1",
        "UMask": "0xe",
        "Unit": "cpu_atom"
    },
    {
-        "BriefDescription": "Counts the number of cycles that the head (oldest load) of the load buffer is stalled due to a DTLB miss when load subsequently retires.",
+        "BriefDescription": "Counts the number of page walks initiated by a instruction fetch that missed the first and second level TLBs.",
+        "CollectPEBSRecord": "2",
+        "Counter": "0,1,2,3,4,5",
+        "EventCode": "0x85",
+        "EventName": "ITLB_MISSES.MISS_CAUSED_WALK",
+        "PEBScounters": "0,1,2,3,4,5",
+        "SampleAfterValue": "1000003",
+        "Speculative": "1",
+        "UMask": "0x1",
+        "Unit": "cpu_atom"
+    },
+    {
+        "BriefDescription": "Counts the number of page walks due to an instruction fetch that miss the PDE (Page Directory Entry) cache.",
+        "CollectPEBSRecord": "2",
+        "Counter": "0,1,2,3,4,5",
+        "EventCode": "0x85",
+        "EventName": "ITLB_MISSES.PDE_CACHE_MISS",
+        "PEBScounters": "0,1,2,3,4,5",
+        "SampleAfterValue": "2000003",
+        "Speculative": "1",
+        "UMask": "0x80",
+        "Unit": "cpu_atom"
+    },
+    {
+        "BriefDescription": "Counts the number of page walks completed due to instruction fetch misses to any page size.",
+        "CollectPEBSRecord": "2",
+        "Counter": "0,1,2,3,4,5",
+        "EventCode": "0x85",
+        "EventName": "ITLB_MISSES.WALK_COMPLETED",
+        "PEBScounters": "0,1,2,3,4,5",
+        "SampleAfterValue": "200003",
+        "Speculative": "1",
+        "UMask": "0xe",
+        "Unit": "cpu_atom"
+    },
+    {
+        "BriefDescription": "Counts the number of cycles that the head (oldest load) of the load buffer and retirement are both stalled due to a DTLB miss.",
        "CollectPEBSRecord": "2",
        "Counter": "0,1,2,3,4,5",
        "EventCode": "0x05",
        "EventName": "LD_HEAD.DTLB_MISS_AT_RET",
        "PEBScounters": "0,1,2,3,4,5",
        "SampleAfterValue": "1000003",
+        "Speculative": "1",
        "UMask": "0x90",
        "Unit": "cpu_atom"
    },
@ -40,6 +79,7 @@
        "EventName": "DTLB_LOAD_MISSES.STLB_HIT",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x20",
        "Unit": "cpu_core"
    },
@ -52,6 +92,7 @@
        "EventName": "DTLB_LOAD_MISSES.WALK_ACTIVE",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x10",
        "Unit": "cpu_core"
    },
@ -63,6 +104,7 @@
        "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0xe",
        "Unit": "cpu_core"
    },
@ -74,6 +116,7 @@
        "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_1G",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x8",
        "Unit": "cpu_core"
    },
@ -85,6 +128,7 @@
        "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x4",
        "Unit": "cpu_core"
    },
@ -96,6 +140,7 @@
        "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_4K",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x2",
        "Unit": "cpu_core"
    },
@ -107,6 +152,7 @@
        "EventName": "DTLB_LOAD_MISSES.WALK_PENDING",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x10",
        "Unit": "cpu_core"
    },
@ -118,6 +164,7 @@
        "EventName": "DTLB_STORE_MISSES.STLB_HIT",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x20",
        "Unit": "cpu_core"
    },
@ -130,6 +177,7 @@
        "EventName": "DTLB_STORE_MISSES.WALK_ACTIVE",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x10",
        "Unit": "cpu_core"
    },
@ -141,6 +189,7 @@
        "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0xe",
        "Unit": "cpu_core"
    },
@ -152,6 +201,7 @@
        "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_1G",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x8",
        "Unit": "cpu_core"
    },
@ -163,6 +213,7 @@
        "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x4",
        "Unit": "cpu_core"
    },
@ -174,6 +225,7 @@
        "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_4K",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x2",
        "Unit": "cpu_core"
    },
@ -185,6 +237,7 @@
        "EventName": "DTLB_STORE_MISSES.WALK_PENDING",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x10",
        "Unit": "cpu_core"
    },
@ -196,6 +249,7 @@
        "EventName": "ITLB_MISSES.STLB_HIT",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x20",
        "Unit": "cpu_core"
    },
@ -208,6 +262,7 @@
        "EventName": "ITLB_MISSES.WALK_ACTIVE",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x10",
        "Unit": "cpu_core"
    },
@ -219,6 +274,7 @@
        "EventName": "ITLB_MISSES.WALK_COMPLETED",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0xe",
        "Unit": "cpu_core"
    },
@ -230,6 +286,7 @@
        "EventName": "ITLB_MISSES.WALK_COMPLETED_2M_4M",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x4",
        "Unit": "cpu_core"
    },
@ -241,6 +298,7 @@
        "EventName": "ITLB_MISSES.WALK_COMPLETED_4K",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x2",
        "Unit": "cpu_core"
    },
@ -252,7 +310,8 @@
        "EventName": "ITLB_MISSES.WALK_PENDING",
        "PEBScounters": "0,1,2,3",
        "SampleAfterValue": "100003",
+        "Speculative": "1",
        "UMask": "0x10",
        "Unit": "cpu_core"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/bonnell/cache.json
+++ b/tools/perf/pmu-events/arch/x86/bonnell/cache.json
@ -743,4 +743,4 @@
        "SampleAfterValue": "10000",
        "UMask": "0x2"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/bonnell/floating-point.json
+++ b/tools/perf/pmu-events/arch/x86/bonnell/floating-point.json
@ -258,4 +258,4 @@
        "SampleAfterValue": "2000000",
        "UMask": "0x2"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/bonnell/frontend.json
+++ b/tools/perf/pmu-events/arch/x86/bonnell/frontend.json
@ -88,4 +88,4 @@
        "SampleAfterValue": "2000000",
        "UMask": "0x1"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/bonnell/memory.json
+++ b/tools/perf/pmu-events/arch/x86/bonnell/memory.json
@ -151,4 +151,4 @@
        "SampleAfterValue": "200000",
        "UMask": "0x86"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/bonnell/other.json
+++ b/tools/perf/pmu-events/arch/x86/bonnell/other.json
@ -447,4 +447,4 @@
        "SampleAfterValue": "200000",
        "UMask": "0xc0"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/bonnell/pipeline.json
+++ b/tools/perf/pmu-events/arch/x86/bonnell/pipeline.json
@ -353,4 +353,4 @@
        "SampleAfterValue": "2000000",
        "UMask": "0x10"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/bonnell/virtual-memory.json
+++ b/tools/perf/pmu-events/arch/x86/bonnell/virtual-memory.json
@ -121,4 +121,4 @@
        "SampleAfterValue": "200000",
        "UMask": "0x3"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json
@ -130,43 +130,25 @@
        "MetricName": "FLOPc_SMT"
    },
    {
-        "BriefDescription": "Actual per-core usage of the Floating Point execution units (regardless of the vector width)",
+        "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)",
        "MetricExpr": "( (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE) + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) ) / ( 2 * CPU_CLK_UNHALTED.THREAD )",
        "MetricGroup": "Cor;Flops;HPC",
        "MetricName": "FP_Arith_Utilization",
-        "PublicDescription": "Actual per-core usage of the Floating Point execution units (regardless of the vector width). Values > 1 are possible due to Fused-Multiply Add (FMA) counting."
+        "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)."
    },
    {
-        "BriefDescription": "Actual per-core usage of the Floating Point execution units (regardless of the vector width). SMT version; use when SMT is enabled and measuring per logical CPU.",
+        "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). SMT version; use when SMT is enabled and measuring per logical CPU.",
        "MetricExpr": "( (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE) + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) ) / ( 2 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) ) )",
        "MetricGroup": "Cor;Flops;HPC_SMT",
        "MetricName": "FP_Arith_Utilization_SMT",
-        "PublicDescription": "Actual per-core usage of the Floating Point execution units (regardless of the vector width). Values > 1 are possible due to Fused-Multiply Add (FMA) counting. SMT version; use when SMT is enabled and measuring per logical CPU."
+        "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common). SMT version; use when SMT is enabled and measuring per logical CPU."
    },
    {
-        "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is at least 1 uop executed)",
+        "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core",
        "MetricExpr": "UOPS_EXECUTED.THREAD / (( cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ / 2 ) if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)",
        "MetricGroup": "Backend;Cor;Pipeline;PortsUtil",
        "MetricName": "ILP"
    },
-    {
-        "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)",
-        "MetricExpr": " ( ((BR_MISP_RETIRED.ALL_BRANCHES / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT )) * (( UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES ) / (4 * CPU_CLK_UNHALTED.THREAD))) + (4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * CPU_CLK_UNHALTED.THREAD)) * (BR_MISP_RETIRED.ALL_BRANCHES * (12 * ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY ) / CPU_CLK_UNHALTED.THREAD) / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY )) / #(4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * CPU_CLK_UNHALTED.THREAD)) ) * (4 * CPU_CLK_UNHALTED.THREAD) / BR_MISP_RETIRED.ALL_BRANCHES",
-        "MetricGroup": "Bad;BrMispredicts",
-        "MetricName": "Branch_Misprediction_Cost"
-    },
-    {
-        "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)",
-        "MetricExpr": " ( ((BR_MISP_RETIRED.ALL_BRANCHES / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT )) * (( UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * ( INT_MISC.RECOVERY_CYCLES_ANY / 2 ) ) / (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) )))) + (4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) ))) * (BR_MISP_RETIRED.ALL_BRANCHES * (12 * ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY ) / CPU_CLK_UNHALTED.THREAD) / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY )) / #(4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) ))) ) * (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) )) / BR_MISP_RETIRED.ALL_BRANCHES",
-        "MetricGroup": "Bad;BrMispredicts_SMT",
-        "MetricName": "Branch_Misprediction_Cost_SMT"
-    },
-    {
-        "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear)",
-        "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES",
-        "MetricGroup": "Bad;BadSpec;BrMispredicts",
-        "MetricName": "IpMispredict"
-    },
    {
        "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core",
        "MetricExpr": "( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) )",
@ -256,6 +238,18 @@
        "MetricGroup": "Summary;TmaL1",
        "MetricName": "Instructions"
    },
+    {
+        "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.",
+        "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@",
+        "MetricGroup": "Pipeline;Ret",
+        "MetricName": "Retire"
+    },
+    {
+        "BriefDescription": "",
+        "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@",
+        "MetricGroup": "Cor;Pipeline;PortsUtil;SMT",
+        "MetricName": "Execute"
+    },
    {
        "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)",
        "MetricExpr": "IDQ.DSB_UOPS / (( IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS ) )",
@ -263,11 +257,28 @@
        "MetricName": "DSB_Coverage"
    },
    {
-        "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load instructions (in core cycles)",
+        "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)",
+        "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES",
+        "MetricGroup": "Bad;BadSpec;BrMispredicts",
+        "MetricName": "IpMispredict"
+    },
+    {
+        "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)",
+        "MetricExpr": " ( ((BR_MISP_RETIRED.ALL_BRANCHES / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT )) * (( UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES ) / (4 * CPU_CLK_UNHALTED.THREAD))) + (4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * CPU_CLK_UNHALTED.THREAD)) * (BR_MISP_RETIRED.ALL_BRANCHES * (12 * ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY ) / CPU_CLK_UNHALTED.THREAD) / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY )) / #(4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * CPU_CLK_UNHALTED.THREAD)) ) * (4 * CPU_CLK_UNHALTED.THREAD) / BR_MISP_RETIRED.ALL_BRANCHES",
+        "MetricGroup": "Bad;BrMispredicts",
+        "MetricName": "Branch_Misprediction_Cost"
+    },
+    {
+        "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)",
+        "MetricExpr": " ( ((BR_MISP_RETIRED.ALL_BRANCHES / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT )) * (( UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * ( INT_MISC.RECOVERY_CYCLES_ANY / 2 ) ) / (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) )))) + (4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) ))) * (BR_MISP_RETIRED.ALL_BRANCHES * (12 * ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY ) / CPU_CLK_UNHALTED.THREAD) / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY )) / #(4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) ))) ) * (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) )) / BR_MISP_RETIRED.ALL_BRANCHES",
+        "MetricGroup": "Bad;BrMispredicts_SMT",
+        "MetricName": "Branch_Misprediction_Cost_SMT"
+    },
+    {
+        "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)",
        "MetricExpr": "L1D_PEND_MISS.PENDING / ( MEM_LOAD_UOPS_RETIRED.L1_MISS + mem_load_uops_retired.hit_lfb )",
        "MetricGroup": "Mem;MemoryBound;MemoryLat",
-        "MetricName": "Load_Miss_Real_Latency",
-        "PublicDescription": "Actual Average Latency for L1 data-cache miss demand load instructions (in core cycles). Latency may be overestimated for multi-load instructions - e.g. repeat strings."
+        "MetricName": "Load_Miss_Real_Latency"
    },
    {
        "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)",
@ -275,24 +286,6 @@
        "MetricGroup": "Mem;MemoryBound;MemoryBW",
        "MetricName": "MLP"
    },
-    {
-        "BriefDescription": "Average data fill bandwidth to the L1 data cache [GB / sec]",
-        "MetricExpr": "64 * L1D.REPLACEMENT / 1000000000 / duration_time",
-        "MetricGroup": "Mem;MemoryBW",
-        "MetricName": "L1D_Cache_Fill_BW"
-    },
-    {
-        "BriefDescription": "Average data fill bandwidth to the L2 cache [GB / sec]",
-        "MetricExpr": "64 * L2_LINES_IN.ALL / 1000000000 / duration_time",
-        "MetricGroup": "Mem;MemoryBW",
-        "MetricName": "L2_Cache_Fill_BW"
-    },
-    {
-        "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]",
-        "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1000000000 / duration_time",
-        "MetricGroup": "Mem;MemoryBW",
-        "MetricName": "L3_Cache_Fill_BW"
-    },
    {
        "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads",
        "MetricExpr": "1000 * MEM_LOAD_UOPS_RETIRED.L1_MISS / INST_RETIRED.ANY",
@ -306,13 +299,13 @@
        "MetricName": "L2MPKI"
    },
    {
-        "BriefDescription": "L2 cache misses per kilo instruction for all request types (including speculative)",
+        "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)",
        "MetricExpr": "1000 * L2_RQSTS.MISS / INST_RETIRED.ANY",
        "MetricGroup": "Mem;CacheMisses;Offcore",
        "MetricName": "L2MPKI_All"
    },
    {
-        "BriefDescription": "L2 cache misses per kilo instruction for all demand loads  (including speculative)",
+        "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads  (including speculative)",
        "MetricExpr": "1000 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY",
        "MetricGroup": "Mem;CacheMisses",
        "MetricName": "L2MPKI_Load"
@ -348,6 +341,48 @@
        "MetricGroup": "Mem;MemoryTLB_SMT",
        "MetricName": "Page_Walks_Utilization_SMT"
    },
+    {
+        "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]",
+        "MetricExpr": "64 * L1D.REPLACEMENT / 1000000000 / duration_time",
+        "MetricGroup": "Mem;MemoryBW",
+        "MetricName": "L1D_Cache_Fill_BW"
+    },
+    {
+        "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]",
+        "MetricExpr": "64 * L2_LINES_IN.ALL / 1000000000 / duration_time",
+        "MetricGroup": "Mem;MemoryBW",
+        "MetricName": "L2_Cache_Fill_BW"
+    },
+    {
+        "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]",
+        "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1000000000 / duration_time",
+        "MetricGroup": "Mem;MemoryBW",
+        "MetricName": "L3_Cache_Fill_BW"
+    },
+    {
+        "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]",
+        "MetricExpr": "(64 * L1D.REPLACEMENT / 1000000000 / duration_time)",
+        "MetricGroup": "Mem;MemoryBW",
+        "MetricName": "L1D_Cache_Fill_BW_1T"
+    },
+    {
+        "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]",
+        "MetricExpr": "(64 * L2_LINES_IN.ALL / 1000000000 / duration_time)",
+        "MetricGroup": "Mem;MemoryBW",
+        "MetricName": "L2_Cache_Fill_BW_1T"
+    },
+    {
+        "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]",
+        "MetricExpr": "(64 * LONGEST_LAT_CACHE.MISS / 1000000000 / duration_time)",
+        "MetricGroup": "Mem;MemoryBW",
+        "MetricName": "L3_Cache_Fill_BW_1T"
+    },
+    {
+        "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]",
+        "MetricExpr": "0",
+        "MetricGroup": "Mem;MemoryBW;Offcore",
+        "MetricName": "L3_Cache_Access_BW_1T"
+    },
    {
        "BriefDescription": "Average CPU Utilization",
        "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / msr@tsc@",
@ -364,7 +399,8 @@
        "BriefDescription": "Giga Floating Point Operations Per Second",
        "MetricExpr": "( ( 1 * ( FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE ) + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * ( FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE ) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE ) / 1000000000 ) / duration_time",
        "MetricGroup": "Cor;Flops;HPC",
-        "MetricName": "GFLOPs"
+        "MetricName": "GFLOPs",
+        "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine."
    },
    {
        "BriefDescription": "Average Frequency Utilization relative nominal frequency",
--- a/tools/perf/pmu-events/arch/x86/broadwell/cache.json
+++ b/tools/perf/pmu-events/arch/x86/broadwell/cache.json
@ -3407,4 +3407,4 @@
        "SampleAfterValue": "100003",
        "UMask": "0x10"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/broadwell/floating-point.json
+++ b/tools/perf/pmu-events/arch/x86/broadwell/floating-point.json
@ -190,4 +190,4 @@
        "SampleAfterValue": "2000003",
        "UMask": "0x3"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/broadwell/frontend.json
+++ b/tools/perf/pmu-events/arch/x86/broadwell/frontend.json
@ -292,4 +292,4 @@
        "SampleAfterValue": "2000003",
        "UMask": "0x1"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/broadwell/memory.json
+++ b/tools/perf/pmu-events/arch/x86/broadwell/memory.json
@ -3050,4 +3050,4 @@
        "SampleAfterValue": "2000003",
        "UMask": "0x40"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/broadwell/other.json
+++ b/tools/perf/pmu-events/arch/x86/broadwell/other.json
@ -41,4 +41,4 @@
        "SampleAfterValue": "2000003",
        "UMask": "0x1"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/broadwell/pipeline.json
+++ b/tools/perf/pmu-events/arch/x86/broadwell/pipeline.json
@ -1377,4 +1377,4 @@
        "SampleAfterValue": "2000003",
        "UMask": "0x1"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/broadwell/uncore-cache.json
+++ b/tools/perf/pmu-events/arch/x86/broadwell/uncore-cache.json
@ -0,0 +1,152 @@
+[
+    {
+        "BriefDescription": "L3 Lookup any request that access cache and found line in E or S-state",
+        "Counter": "0,1",
+        "EventCode": "0x34",
+        "EventName": "UNC_CBO_CACHE_LOOKUP.ANY_ES",
+        "PerPkg": "1",
+        "PublicDescription": "L3 Lookup any request that access cache and found line in E or S-state.",
+        "UMask": "0x86",
+        "Unit": "CBO"
+    },
+    {
+        "BriefDescription": "L3 Lookup any request that access cache and found line in I-state",
+        "Counter": "0,1",
+        "EventCode": "0x34",
+        "EventName": "UNC_CBO_CACHE_LOOKUP.ANY_I",
+        "PerPkg": "1",
+        "PublicDescription": "L3 Lookup any request that access cache and found line in I-state.",
+        "UMask": "0x88",
+        "Unit": "CBO"
+    },
+    {
+        "BriefDescription": "L3 Lookup any request that access cache and found line in M-state",
+        "Counter": "0,1",
+        "EventCode": "0x34",
+        "EventName": "UNC_CBO_CACHE_LOOKUP.ANY_M",
+        "PerPkg": "1",
+        "PublicDescription": "L3 Lookup any request that access cache and found line in M-state.",
+        "UMask": "0x81",
+        "Unit": "CBO"
+    },
+    {
+        "BriefDescription": "L3 Lookup any request that access cache and found line in MESI-state",
+        "Counter": "0,1",
+        "EventCode": "0x34",
+        "EventName": "UNC_CBO_CACHE_LOOKUP.ANY_MESI",
+        "PerPkg": "1",
+        "PublicDescription": "L3 Lookup any request that access cache and found line in MESI-state.",
+        "UMask": "0x8f",
+        "Unit": "CBO"
+    },
+    {
+        "BriefDescription": "L3 Lookup read request that access cache and found line in E or S-state",
+        "Counter": "0,1",
+        "EventCode": "0x34",
+        "EventName": "UNC_CBO_CACHE_LOOKUP.READ_ES",
+        "PerPkg": "1",
+        "PublicDescription": "L3 Lookup read request that access cache and found line in E or S-state.",
+        "UMask": "0x16",
+        "Unit": "CBO"
+    },
+    {
+        "BriefDescription": "L3 Lookup read request that access cache and found line in I-state",
+        "Counter": "0,1",
+        "EventCode": "0x34",
+        "EventName": "UNC_CBO_CACHE_LOOKUP.READ_I",
+        "PerPkg": "1",
+        "PublicDescription": "L3 Lookup read request that access cache and found line in I-state.",
+        "UMask": "0x18",
+        "Unit": "CBO"
+    },
+    {
+        "BriefDescription": "L3 Lookup read request that access cache and found line in M-state",
+        "Counter": "0,1",
+        "EventCode": "0x34",
+        "EventName": "UNC_CBO_CACHE_LOOKUP.READ_M",
+        "PerPkg": "1",
+        "PublicDescription": "L3 Lookup read request that access cache and found line in M-state.",
+        "UMask": "0x11",
+        "Unit": "CBO"
+    },
+    {
+        "BriefDescription": "L3 Lookup read request that access cache and found line in any MESI-state",
+        "Counter": "0,1",
+        "EventCode": "0x34",
+        "EventName": "UNC_CBO_CACHE_LOOKUP.READ_MESI",
+        "PerPkg": "1",
+        "PublicDescription": "L3 Lookup read request that access cache and found line in any MESI-state.",
+        "UMask": "0x1f",
+        "Unit": "CBO"
+    },
+    {
+        "BriefDescription": "L3 Lookup write request that access cache and found line in E or S-state",
+        "Counter": "0,1",
+        "EventCode": "0x34",
+        "EventName": "UNC_CBO_CACHE_LOOKUP.WRITE_ES",
+        "PerPkg": "1",
+        "PublicDescription": "L3 Lookup write request that access cache and found line in E or S-state.",
+        "UMask": "0x26",
+        "Unit": "CBO"
+    },
+    {
+        "BriefDescription": "L3 Lookup write request that access cache and found line in M-state",
+        "Counter": "0,1",
+        "EventCode": "0x34",
+        "EventName": "UNC_CBO_CACHE_LOOKUP.WRITE_M",
+        "PerPkg": "1",
+        "PublicDescription": "L3 Lookup write request that access cache and found line in M-state.",
+        "UMask": "0x21",
+        "Unit": "CBO"
+    },
+    {
+        "BriefDescription": "L3 Lookup write request that access cache and found line in MESI-state",
+        "Counter": "0,1",
+        "EventCode": "0x34",
+        "EventName": "UNC_CBO_CACHE_LOOKUP.WRITE_MESI",
+        "PerPkg": "1",
+        "PublicDescription": "L3 Lookup write request that access cache and found line in MESI-state.",
+        "UMask": "0x2f",
+        "Unit": "CBO"
+    },
+    {
+        "BriefDescription": "A cross-core snoop initiated by this Cbox due to processor core memory request which hits a modified line in some processor core.",
+        "Counter": "0,1",
+        "EventCode": "0x22",
+        "EventName": "UNC_CBO_XSNP_RESPONSE.HITM_XCORE",
+        "PerPkg": "1",
+        "PublicDescription": "A cross-core snoop initiated by this Cbox due to processor core memory request which hits a modified line in some processor core.",
+        "UMask": "0x48",
+        "Unit": "CBO"
+    },
+    {
+        "BriefDescription": "A cross-core snoop initiated by this Cbox due to processor core memory request which hits a non-modified line in some processor core.",
+        "Counter": "0,1",
+        "EventCode": "0x22",
+        "EventName": "UNC_CBO_XSNP_RESPONSE.HIT_XCORE",
+        "PerPkg": "1",
+        "PublicDescription": "A cross-core snoop initiated by this Cbox due to processor core memory request which hits a non-modified line in some processor core.",
+        "UMask": "0x44",
+        "Unit": "CBO"
+    },
+    {
+        "BriefDescription": "A cross-core snoop resulted from L3 Eviction which misses in some processor core.",
+        "Counter": "0,1",
+        "EventCode": "0x22",
+        "EventName": "UNC_CBO_XSNP_RESPONSE.MISS_EVICTION",
+        "PerPkg": "1",
+        "PublicDescription": "A cross-core snoop resulted from L3 Eviction which misses in some processor core.",
+        "UMask": "0x81",
+        "Unit": "CBO"
+    },
+    {
+        "BriefDescription": "A cross-core snoop initiated by this Cbox due to processor core memory request which misses in some processor core.",
+        "Counter": "0,1",
+        "EventCode": "0x22",
+        "EventName": "UNC_CBO_XSNP_RESPONSE.MISS_XCORE",
+        "PerPkg": "1",
+        "PublicDescription": "A cross-core snoop initiated by this Cbox due to processor core memory request which misses in some processor core.",
+        "UMask": "0x41",
+        "Unit": "CBO"
+    }
+]
--- a/tools/perf/pmu-events/arch/x86/broadwell/uncore-other.json
+++ b/tools/perf/pmu-events/arch/x86/broadwell/uncore-other.json
@ -0,0 +1,82 @@
+[
+    {
+        "BriefDescription": "Number of entries allocated. Account for Any type: e.g. Snoop, Core aperture, etc.",
+        "Counter": "0,1",
+        "EventCode": "0x84",
+        "EventName": "UNC_ARB_COH_TRK_REQUESTS.ALL",
+        "PerPkg": "1",
+        "PublicDescription": "Number of entries allocated. Account for Any type: e.g. Snoop, Core aperture, etc.",
+        "UMask": "0x01",
+        "Unit": "ARB"
+    },
+    {
+        "BriefDescription": "Each cycle count number of all Core outgoing valid entries. Such entry is defined as valid from it's allocation till first of IDI0 or DRS0 messages is sent out. Accounts for Coherent and non-coherent traffic.",
+        "Counter": "0,",
+        "EventCode": "0x80",
+        "EventName": "UNC_ARB_TRK_OCCUPANCY.ALL",
+        "PerPkg": "1",
+        "PublicDescription": "Each cycle count number of all Core outgoing valid entries. Such entry is defined as valid from it's allocation till first of IDI0 or DRS0 messages is sent out. Accounts for Coherent and non-coherent traffic.",
+        "UMask": "0x01",
+        "Unit": "ARB"
+    },
+    {
+        "BriefDescription": "Cycles with at least one request outstanding is waiting for data return from memory controller. Account for coherent and non-coherent requests initiated by IA Cores, Processor Graphics Unit, or LLC.;",
+        "Counter": "0,",
+        "CounterMask": "1",
+        "EventCode": "0x80",
+        "EventName": "UNC_ARB_TRK_OCCUPANCY.CYCLES_WITH_ANY_REQUEST",
+        "PerPkg": "1",
+        "PublicDescription": "Cycles with at least one request outstanding is waiting for data return from memory controller. Account for coherent and non-coherent requests initiated by IA Cores, Processor Graphics Unit, or LLC.",
+        "UMask": "0x01",
+        "Unit": "ARB"
+    },
+    {
+        "BriefDescription": "Each cycle count number of 'valid' coherent Data Read entries that are in DirectData mode. Such entry is defined as valid when it is allocated till data sent to Core (first chunk, IDI0). Applicable for IA Cores' requests in normal case.",
+        "Counter": "0,",
+        "EventCode": "0x80",
+        "EventName": "UNC_ARB_TRK_OCCUPANCY.DRD_DIRECT",
+        "PerPkg": "1",
+        "PublicDescription": "Each cycle count number of valid coherent Data Read entries that are in DirectData mode. Such entry is defined as valid when it is allocated till data sent to Core (first chunk, IDI0). Applicable for IA Cores' requests in normal case.",
+        "UMask": "0x02",
+        "Unit": "ARB"
+    },
+    {
+        "BriefDescription": "Total number of Core outgoing entries allocated. Accounts for Coherent and non-coherent traffic.",
+        "Counter": "0,1",
+        "EventCode": "0x81",
+        "EventName": "UNC_ARB_TRK_REQUESTS.ALL",
+        "PerPkg": "1",
+        "PublicDescription": "Total number of Core outgoing entries allocated. Accounts for Coherent and non-coherent traffic.",
+        "UMask": "0x01",
+        "Unit": "ARB"
+    },
+    {
+        "BriefDescription": "Number of Core coherent Data Read entries allocated in DirectData mode",
+        "Counter": "0,1",
+        "EventCode": "0x81",
+        "EventName": "UNC_ARB_TRK_REQUESTS.DRD_DIRECT",
+        "PerPkg": "1",
+        "PublicDescription": "Number of Core coherent Data Read entries allocated in DirectData mode.",
+        "UMask": "0x02",
+        "Unit": "ARB"
+    },
+    {
+        "BriefDescription": "Number of Writes allocated - any write transactions: full/partials writes and evictions.",
+        "Counter": "0,1",
+        "EventCode": "0x81",
+        "EventName": "UNC_ARB_TRK_REQUESTS.WRITES",
+        "PerPkg": "1",
+        "PublicDescription": "Number of Writes allocated - any write transactions: full/partials writes and evictions.",
+        "UMask": "0x20",
+        "Unit": "ARB"
+    },
+    {
+        "BriefDescription": "This 48-bit fixed counter counts the UCLK cycles",
+        "Counter": "FIXED",
+        "EventCode": "0xff",
+        "EventName": "UNC_CLOCK.SOCKET",
+        "PerPkg": "1",
+        "PublicDescription": "This 48-bit fixed counter counts the UCLK cycles.",
+        "Unit": "CLOCK"
+    }
+]
--- a/tools/perf/pmu-events/arch/x86/broadwell/uncore.json
+++ b/tools/perf/pmu-events/arch/x86/broadwell/uncore.json
@ -1,278 +0,0 @@
-[
-  {
-    "Unit": "CBO",
-    "EventCode": "0x22",
-    "UMask": "0x41",
-    "EventName": "UNC_CBO_XSNP_RESPONSE.MISS_XCORE",
-    "BriefDescription": "A cross-core snoop initiated by this Cbox due to processor core memory request which misses in some processor core.",
-    "PublicDescription": "A cross-core snoop initiated by this Cbox due to processor core memory request which misses in some processor core.",
-    "Counter": "0,1",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "CBO",
-    "EventCode": "0x22",
-    "UMask": "0x81",
-    "EventName": "UNC_CBO_XSNP_RESPONSE.MISS_EVICTION",
-    "BriefDescription": "A cross-core snoop resulted from L3 Eviction which misses in some processor core.",
-    "PublicDescription": "A cross-core snoop resulted from L3 Eviction which misses in some processor core.",
-    "Counter": "0,1",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "CBO",
-    "EventCode": "0x22",
-    "UMask": "0x44",
-    "EventName": "UNC_CBO_XSNP_RESPONSE.HIT_XCORE",
-    "BriefDescription": "A cross-core snoop initiated by this Cbox due to processor core memory request which hits a non-modified line in some processor core.",
-    "PublicDescription": "A cross-core snoop initiated by this Cbox due to processor core memory request which hits a non-modified line in some processor core.",
-    "Counter": "0,1",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "CBO",
-    "EventCode": "0x22",
-    "UMask": "0x48",
-    "EventName": "UNC_CBO_XSNP_RESPONSE.HITM_XCORE",
-    "BriefDescription": "A cross-core snoop initiated by this Cbox due to processor core memory request which hits a modified line in some processor core.",
-    "PublicDescription": "A cross-core snoop initiated by this Cbox due to processor core memory request which hits a modified line in some processor core.",
-    "Counter": "0,1",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "CBO",
-    "EventCode": "0x34",
-    "UMask": "0x11",
-    "EventName": "UNC_CBO_CACHE_LOOKUP.READ_M",
-    "BriefDescription": "L3 Lookup read request that access cache and found line in M-state",
-    "PublicDescription": "L3 Lookup read request that access cache and found line in M-state.",
-    "Counter": "0,1",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "CBO",
-    "EventCode": "0x34",
-    "UMask": "0x21",
-    "EventName": "UNC_CBO_CACHE_LOOKUP.WRITE_M",
-    "BriefDescription": "L3 Lookup write request that access cache and found line in M-state",
-    "PublicDescription": "L3 Lookup write request that access cache and found line in M-state.",
-    "Counter": "0,1",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "CBO",
-    "EventCode": "0x34",
-    "UMask": "0x81",
-    "EventName": "UNC_CBO_CACHE_LOOKUP.ANY_M",
-    "BriefDescription": "L3 Lookup any request that access cache and found line in M-state",
-    "PublicDescription": "L3 Lookup any request that access cache and found line in M-state.",
-    "Counter": "0,1",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "CBO",
-    "EventCode": "0x34",
-    "UMask": "0x18",
-    "EventName": "UNC_CBO_CACHE_LOOKUP.READ_I",
-    "BriefDescription": "L3 Lookup read request that access cache and found line in I-state",
-    "PublicDescription": "L3 Lookup read request that access cache and found line in I-state.",
-    "Counter": "0,1",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "CBO",
-    "EventCode": "0x34",
-    "UMask": "0x88",
-    "EventName": "UNC_CBO_CACHE_LOOKUP.ANY_I",
-    "BriefDescription": "L3 Lookup any request that access cache and found line in I-state",
-    "PublicDescription": "L3 Lookup any request that access cache and found line in I-state.",
-    "Counter": "0,1",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "CBO",
-    "EventCode": "0x34",
-    "UMask": "0x1f",
-    "EventName": "UNC_CBO_CACHE_LOOKUP.READ_MESI",
-    "BriefDescription": "L3 Lookup read request that access cache and found line in any MESI-state",
-    "PublicDescription": "L3 Lookup read request that access cache and found line in any MESI-state.",
-    "Counter": "0,1",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "CBO",
-    "EventCode": "0x34",
-    "UMask": "0x2f",
-    "EventName": "UNC_CBO_CACHE_LOOKUP.WRITE_MESI",
-    "BriefDescription": "L3 Lookup write request that access cache and found line in MESI-state",
-    "PublicDescription": "L3 Lookup write request that access cache and found line in MESI-state.",
-    "Counter": "0,1",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "CBO",
-    "EventCode": "0x34",
-    "UMask": "0x8f",
-    "EventName": "UNC_CBO_CACHE_LOOKUP.ANY_MESI",
-    "BriefDescription": "L3 Lookup any request that access cache and found line in MESI-state",
-    "PublicDescription": "L3 Lookup any request that access cache and found line in MESI-state.",
-    "Counter": "0,1",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "CBO",
-    "EventCode": "0x34",
-    "UMask": "0x86",
-    "EventName": "UNC_CBO_CACHE_LOOKUP.ANY_ES",
-    "BriefDescription": "L3 Lookup any request that access cache and found line in E or S-state",
-    "PublicDescription": "L3 Lookup any request that access cache and found line in E or S-state.",
-    "Counter": "0,1",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "CBO",
-    "EventCode": "0x34",
-    "UMask": "0x16",
-    "EventName": "UNC_CBO_CACHE_LOOKUP.READ_ES",
-    "BriefDescription": "L3 Lookup read request that access cache and found line in E or S-state",
-    "PublicDescription": "L3 Lookup read request that access cache and found line in E or S-state.",
-    "Counter": "0,1",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "CBO",
-    "EventCode": "0x34",
-    "UMask": "0x26",
-    "EventName": "UNC_CBO_CACHE_LOOKUP.WRITE_ES",
-    "BriefDescription": "L3 Lookup write request that access cache and found line in E or S-state",
-    "PublicDescription": "L3 Lookup write request that access cache and found line in E or S-state.",
-    "Counter": "0,1",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "iMPH-U",
-    "EventCode": "0x80",
-    "UMask": "0x01",
-    "EventName": "UNC_ARB_TRK_OCCUPANCY.ALL",
-    "BriefDescription": "Each cycle count number of all Core outgoing valid entries. Such entry is defined as valid from it's allocation till first of IDI0 or DRS0 messages is sent out. Accounts for Coherent and non-coherent traffic.",
-    "PublicDescription": "Each cycle count number of all Core outgoing valid entries. Such entry is defined as valid from it's allocation till first of IDI0 or DRS0 messages is sent out. Accounts for Coherent and non-coherent traffic.",
-    "Counter": "0,",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "iMPH-U",
-    "EventCode": "0x80",
-    "UMask": "0x02",
-    "EventName": "UNC_ARB_TRK_OCCUPANCY.DRD_DIRECT",
-    "BriefDescription": "Each cycle count number of 'valid' coherent Data Read entries that are in DirectData mode. Such entry is defined as valid when it is allocated till data sent to Core (first chunk, IDI0). Applicable for IA Cores' requests in normal case.",
-    "PublicDescription": "Each cycle count number of 'valid' coherent Data Read entries that are in DirectData mode. Such entry is defined as valid when it is allocated till data sent to Core (first chunk, IDI0). Applicable for IA Cores' requests in normal case.",
-    "Counter": "0,",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "iMPH-U",
-    "EventCode": "0x81",
-    "UMask": "0x01",
-    "EventName": "UNC_ARB_TRK_REQUESTS.ALL",
-    "BriefDescription": "Total number of Core outgoing entries allocated. Accounts for Coherent and non-coherent traffic.",
-    "PublicDescription": "Total number of Core outgoing entries allocated. Accounts for Coherent and non-coherent traffic.",
-    "Counter": "0,1",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "iMPH-U",
-    "EventCode": "0x81",
-    "UMask": "0x02",
-    "EventName": "UNC_ARB_TRK_REQUESTS.DRD_DIRECT",
-    "BriefDescription": "Number of Core coherent Data Read entries allocated in DirectData mode",
-    "PublicDescription": "Number of Core coherent Data Read entries allocated in DirectData mode.",
-    "Counter": "0,1",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "iMPH-U",
-    "EventCode": "0x81",
-    "UMask": "0x20",
-    "EventName": "UNC_ARB_TRK_REQUESTS.WRITES",
-    "BriefDescription": "Number of Writes allocated - any write transactions: full/partials writes and evictions.",
-    "PublicDescription": "Number of Writes allocated - any write transactions: full/partials writes and evictions.",
-    "Counter": "0,1",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "iMPH-U",
-    "EventCode": "0x84",
-    "UMask": "0x01",
-    "EventName": "UNC_ARB_COH_TRK_REQUESTS.ALL",
-    "BriefDescription": "Number of entries allocated. Account for Any type: e.g. Snoop, Core aperture, etc.",
-    "PublicDescription": "Number of entries allocated. Account for Any type: e.g. Snoop, Core aperture, etc.",
-    "Counter": "0,1",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "iMPH-U",
-    "EventCode": "0x80",
-    "UMask": "0x01",
-    "EventName": "UNC_ARB_TRK_OCCUPANCY.CYCLES_WITH_ANY_REQUEST",
-    "BriefDescription": "Cycles with at least one request outstanding is waiting for data return from memory controller. Account for coherent and non-coherent requests initiated by IA Cores, Processor Graphics Unit, or LLC.;",
-    "PublicDescription": "Cycles with at least one request outstanding is waiting for data return from memory controller. Account for coherent and non-coherent requests initiated by IA Cores, Processor Graphics Unit, or LLC.",
-    "Counter": "0,",
-    "CounterMask": "1",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  },
-  {
-    "Unit": "NCU",
-    "EventCode": "0x0",
-    "UMask": "0x01",
-    "EventName": "UNC_CLOCK.SOCKET",
-    "BriefDescription": "This 48-bit fixed counter counts the UCLK cycles",
-    "PublicDescription": "This 48-bit fixed counter counts the UCLK cycles.",
-    "Counter": "FIXED",
-    "CounterMask": "0",
-    "Invert": "0",
-    "EdgeDetect": "0"
-  }
-]
--- a/tools/perf/pmu-events/arch/x86/broadwell/virtual-memory.json
+++ b/tools/perf/pmu-events/arch/x86/broadwell/virtual-memory.json
@ -385,4 +385,4 @@
        "SampleAfterValue": "100007",
        "UMask": "0x20"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/broadwellde/bdwde-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/broadwellde/bdwde-metrics.json
@ -47,7 +47,7 @@
        "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / (4 * CPU_CLK_UNHALTED.THREAD)",
        "MetricGroup": "TopdownL1",
        "MetricName": "Retiring",
-        "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided."
+        "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. "
    },
    {
        "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. SMT version; use when SMT is enabled and measuring per logical CPU.",
@ -130,43 +130,25 @@
        "MetricName": "FLOPc_SMT"
    },
    {
-        "BriefDescription": "Actual per-core usage of the Floating Point execution units (regardless of the vector width)",
+        "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)",
        "MetricExpr": "( (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE) + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) ) / ( 2 * CPU_CLK_UNHALTED.THREAD )",
        "MetricGroup": "Cor;Flops;HPC",
        "MetricName": "FP_Arith_Utilization",
-        "PublicDescription": "Actual per-core usage of the Floating Point execution units (regardless of the vector width). Values > 1 are possible due to Fused-Multiply Add (FMA) counting."
+        "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)."
    },
    {
-        "BriefDescription": "Actual per-core usage of the Floating Point execution units (regardless of the vector width). SMT version; use when SMT is enabled and measuring per logical CPU.",
+        "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). SMT version; use when SMT is enabled and measuring per logical CPU.",
        "MetricExpr": "( (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE) + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) ) / ( 2 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) ) )",
        "MetricGroup": "Cor;Flops;HPC_SMT",
        "MetricName": "FP_Arith_Utilization_SMT",
-        "PublicDescription": "Actual per-core usage of the Floating Point execution units (regardless of the vector width). Values > 1 are possible due to Fused-Multiply Add (FMA) counting. SMT version; use when SMT is enabled and measuring per logical CPU."
+        "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common). SMT version; use when SMT is enabled and measuring per logical CPU."
    },
    {
-        "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is at least 1 uop executed)",
+        "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core",
        "MetricExpr": "UOPS_EXECUTED.THREAD / (( cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ / 2 ) if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)",
        "MetricGroup": "Backend;Cor;Pipeline;PortsUtil",
        "MetricName": "ILP"
    },
-    {
-        "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)",
-        "MetricExpr": "( ((BR_MISP_RETIRED.ALL_BRANCHES / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT )) * (( UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES ) / (4 * CPU_CLK_UNHALTED.THREAD))) + (4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * CPU_CLK_UNHALTED.THREAD)) * (BR_MISP_RETIRED.ALL_BRANCHES * (12 * ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY ) / CPU_CLK_UNHALTED.THREAD) / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY )) / #(4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * CPU_CLK_UNHALTED.THREAD)) ) * (4 * CPU_CLK_UNHALTED.THREAD) / BR_MISP_RETIRED.ALL_BRANCHES",
-        "MetricGroup": "Bad;BrMispredicts",
-        "MetricName": "Branch_Misprediction_Cost"
-    },
-    {
-        "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)",
-        "MetricExpr": "( ((BR_MISP_RETIRED.ALL_BRANCHES / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT )) * (( UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * ( INT_MISC.RECOVERY_CYCLES_ANY / 2 ) ) / (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) )))) + (4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) ))) * (BR_MISP_RETIRED.ALL_BRANCHES * (12 * ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY ) / CPU_CLK_UNHALTED.THREAD) / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY )) / #(4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) ))) ) * (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) )) / BR_MISP_RETIRED.ALL_BRANCHES",
-        "MetricGroup": "Bad;BrMispredicts_SMT",
-        "MetricName": "Branch_Misprediction_Cost_SMT"
-    },
-    {
-        "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear)",
-        "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES",
-        "MetricGroup": "Bad;BadSpec;BrMispredicts",
-        "MetricName": "IpMispredict"
-    },
    {
        "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core",
        "MetricExpr": "( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) )",
@ -204,7 +186,7 @@
        "MetricName": "IpTB"
    },
    {
-        "BriefDescription": "Branch instructions per taken branch.",
+        "BriefDescription": "Branch instructions per taken branch. ",
        "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR_TAKEN",
        "MetricGroup": "Branches;Fed;PGO",
        "MetricName": "BpTkBranch"
@ -256,18 +238,47 @@
        "MetricGroup": "Summary;TmaL1",
        "MetricName": "Instructions"
    },
+    {
+        "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.",
+        "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@",
+        "MetricGroup": "Pipeline;Ret",
+        "MetricName": "Retire"
+    },
+    {
+        "BriefDescription": "",
+        "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@",
+        "MetricGroup": "Cor;Pipeline;PortsUtil;SMT",
+        "MetricName": "Execute"
+    },
    {
        "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)",
-        "MetricExpr": "IDQ.DSB_UOPS / ( IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS )",
+        "MetricExpr": "IDQ.DSB_UOPS / (( IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS ) )",
        "MetricGroup": "DSB;Fed;FetchBW",
        "MetricName": "DSB_Coverage"
    },
    {
-        "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load instructions (in core cycles)",
+        "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)",
+        "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES",
+        "MetricGroup": "Bad;BadSpec;BrMispredicts",
+        "MetricName": "IpMispredict"
+    },
+    {
+        "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)",
+        "MetricExpr": " ( ((BR_MISP_RETIRED.ALL_BRANCHES / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT )) * (( UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES ) / (4 * CPU_CLK_UNHALTED.THREAD))) + (4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * CPU_CLK_UNHALTED.THREAD)) * (BR_MISP_RETIRED.ALL_BRANCHES * (12 * ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY ) / CPU_CLK_UNHALTED.THREAD) / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY )) / #(4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * CPU_CLK_UNHALTED.THREAD)) ) * (4 * CPU_CLK_UNHALTED.THREAD) / BR_MISP_RETIRED.ALL_BRANCHES",
+        "MetricGroup": "Bad;BrMispredicts",
+        "MetricName": "Branch_Misprediction_Cost"
+    },
+    {
+        "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)",
+        "MetricExpr": " ( ((BR_MISP_RETIRED.ALL_BRANCHES / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT )) * (( UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * ( INT_MISC.RECOVERY_CYCLES_ANY / 2 ) ) / (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) )))) + (4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) ))) * (BR_MISP_RETIRED.ALL_BRANCHES * (12 * ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY ) / CPU_CLK_UNHALTED.THREAD) / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY )) / #(4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) ))) ) * (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) )) / BR_MISP_RETIRED.ALL_BRANCHES",
+        "MetricGroup": "Bad;BrMispredicts_SMT",
+        "MetricName": "Branch_Misprediction_Cost_SMT"
+    },
+    {
+        "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)",
        "MetricExpr": "L1D_PEND_MISS.PENDING / ( MEM_LOAD_UOPS_RETIRED.L1_MISS + mem_load_uops_retired.hit_lfb )",
        "MetricGroup": "Mem;MemoryBound;MemoryLat",
-        "MetricName": "Load_Miss_Real_Latency",
-        "PublicDescription": "Actual Average Latency for L1 data-cache miss demand load instructions (in core cycles). Latency may be overestimated for multi-load instructions - e.g. repeat strings."
+        "MetricName": "Load_Miss_Real_Latency"
    },
    {
        "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)",
@ -275,24 +286,6 @@
        "MetricGroup": "Mem;MemoryBound;MemoryBW",
        "MetricName": "MLP"
    },
-    {
-        "BriefDescription": "Average data fill bandwidth to the L1 data cache [GB / sec]",
-        "MetricExpr": "64 * L1D.REPLACEMENT / 1000000000 / duration_time",
-        "MetricGroup": "Mem;MemoryBW",
-        "MetricName": "L1D_Cache_Fill_BW"
-    },
-    {
-        "BriefDescription": "Average data fill bandwidth to the L2 cache [GB / sec]",
-        "MetricExpr": "64 * L2_LINES_IN.ALL / 1000000000 / duration_time",
-        "MetricGroup": "Mem;MemoryBW",
-        "MetricName": "L2_Cache_Fill_BW"
-    },
-    {
-        "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]",
-        "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1000000000 / duration_time",
-        "MetricGroup": "Mem;MemoryBW",
-        "MetricName": "L3_Cache_Fill_BW"
-    },
    {
        "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads",
        "MetricExpr": "1000 * MEM_LOAD_UOPS_RETIRED.L1_MISS / INST_RETIRED.ANY",
@ -306,13 +299,13 @@
        "MetricName": "L2MPKI"
    },
    {
-        "BriefDescription": "L2 cache misses per kilo instruction for all request types (including speculative)",
+        "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)",
        "MetricExpr": "1000 * L2_RQSTS.MISS / INST_RETIRED.ANY",
        "MetricGroup": "Mem;CacheMisses;Offcore",
        "MetricName": "L2MPKI_All"
    },
    {
-        "BriefDescription": "L2 cache misses per kilo instruction for all demand loads  (including speculative)",
+        "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads  (including speculative)",
        "MetricExpr": "1000 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY",
        "MetricGroup": "Mem;CacheMisses",
        "MetricName": "L2MPKI_Load"
@ -348,6 +341,48 @@
        "MetricGroup": "Mem;MemoryTLB_SMT",
        "MetricName": "Page_Walks_Utilization_SMT"
    },
+    {
+        "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]",
+        "MetricExpr": "64 * L1D.REPLACEMENT / 1000000000 / duration_time",
+        "MetricGroup": "Mem;MemoryBW",
+        "MetricName": "L1D_Cache_Fill_BW"
+    },
+    {
+        "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]",
+        "MetricExpr": "64 * L2_LINES_IN.ALL / 1000000000 / duration_time",
+        "MetricGroup": "Mem;MemoryBW",
+        "MetricName": "L2_Cache_Fill_BW"
+    },
+    {
+        "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]",
+        "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1000000000 / duration_time",
+        "MetricGroup": "Mem;MemoryBW",
+        "MetricName": "L3_Cache_Fill_BW"
+    },
+    {
+        "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]",
+        "MetricExpr": "(64 * L1D.REPLACEMENT / 1000000000 / duration_time)",
+        "MetricGroup": "Mem;MemoryBW",
+        "MetricName": "L1D_Cache_Fill_BW_1T"
+    },
+    {
+        "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]",
+        "MetricExpr": "(64 * L2_LINES_IN.ALL / 1000000000 / duration_time)",
+        "MetricGroup": "Mem;MemoryBW",
+        "MetricName": "L2_Cache_Fill_BW_1T"
+    },
+    {
+        "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]",
+        "MetricExpr": "(64 * LONGEST_LAT_CACHE.MISS / 1000000000 / duration_time)",
+        "MetricGroup": "Mem;MemoryBW",
+        "MetricName": "L3_Cache_Fill_BW_1T"
+    },
+    {
+        "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]",
+        "MetricExpr": "0",
+        "MetricGroup": "Mem;MemoryBW;Offcore",
+        "MetricName": "L3_Cache_Access_BW_1T"
+    },
    {
        "BriefDescription": "Average CPU Utilization",
        "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / msr@tsc@",
@ -364,7 +399,8 @@
        "BriefDescription": "Giga Floating Point Operations Per Second",
        "MetricExpr": "( ( 1 * ( FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE ) + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * ( FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE ) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE ) / 1000000000 ) / duration_time",
        "MetricGroup": "Cor;Flops;HPC",
-        "MetricName": "GFLOPs"
+        "MetricName": "GFLOPs",
+        "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine."
    },
    {
        "BriefDescription": "Average Frequency Utilization relative nominal frequency",
--- a/tools/perf/pmu-events/arch/x86/broadwellde/cache.json
+++ b/tools/perf/pmu-events/arch/x86/broadwellde/cache.json
@ -806,4 +806,4 @@
        "SampleAfterValue": "100003",
        "UMask": "0x10"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/broadwellde/floating-point.json
+++ b/tools/perf/pmu-events/arch/x86/broadwellde/floating-point.json
@ -190,4 +190,4 @@
        "SampleAfterValue": "2000003",
        "UMask": "0x3"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/broadwellde/frontend.json
+++ b/tools/perf/pmu-events/arch/x86/broadwellde/frontend.json
@ -292,4 +292,4 @@
        "SampleAfterValue": "2000003",
        "UMask": "0x1"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/broadwellde/memory.json
+++ b/tools/perf/pmu-events/arch/x86/broadwellde/memory.json
@ -429,4 +429,4 @@
        "SampleAfterValue": "2000003",
        "UMask": "0x40"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/broadwellde/other.json
+++ b/tools/perf/pmu-events/arch/x86/broadwellde/other.json
@ -41,4 +41,4 @@
        "SampleAfterValue": "2000003",
        "UMask": "0x1"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/broadwellde/pipeline.json
+++ b/tools/perf/pmu-events/arch/x86/broadwellde/pipeline.json
@ -1378,4 +1378,4 @@
        "SampleAfterValue": "2000003",
        "UMask": "0x1"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/broadwellde/uncore-cache.json
+++ b/tools/perf/pmu-events/arch/x86/broadwellde/uncore-cache.json
--- a/tools/perf/pmu-events/arch/x86/broadwellde/uncore-memory.json
+++ b/tools/perf/pmu-events/arch/x86/broadwellde/uncore-memory.json
--- a/tools/perf/pmu-events/arch/x86/broadwellde/uncore-other.json
+++ b/tools/perf/pmu-events/arch/x86/broadwellde/uncore-other.json
--- a/tools/perf/pmu-events/arch/x86/broadwellde/uncore-power.json
+++ b/tools/perf/pmu-events/arch/x86/broadwellde/uncore-power.json
@ -1,91 +1,511 @@
 [
    {
-        "BriefDescription": "PCU clock ticks. Use to get percentages of PCU cycles events",
+        "BriefDescription": "pclk Cycles",
        "Counter": "0,1,2,3",
        "EventName": "UNC_P_CLOCKTICKS",
        "PerPkg": "1",
+        "PublicDescription": "The PCU runs off a fixed 1 GHz clock.  This event counts the number of pclk cycles measured while the counter was enabled.  The pclk, like the Memory Controller's dclk, counts at a constant rate making it a good measure of actual wall time.",
        "Unit": "PCU"
    },
    {
-        "BriefDescription": "This is an occupancy event that tracks the number of cores that are in C0.  It can be used by itself to get the average number of cores in C0, with threshholding to generate histograms, or with other PCU events and occupancy triggering to capture other details",
+        "BriefDescription": "Core C State Transition Cycles",
        "Counter": "0,1,2,3",
-        "EventCode": "0x80",
-        "EventName": "UNC_P_POWER_STATE_OCCUPANCY.CORES_C0",
-        "Filter": "occ_sel=1",
-        "MetricExpr": "(UNC_P_POWER_STATE_OCCUPANCY.CORES_C0 / UNC_P_CLOCKTICKS) * 100.",
-        "MetricName": "power_state_occupancy.cores_c0 %",
+        "EventCode": "0x60",
+        "EventName": "UNC_P_CORE0_TRANSITION_CYCLES",
        "PerPkg": "1",
+        "PublicDescription": "Number of cycles spent performing core C state transitions.  There is one event per core.",
        "Unit": "PCU"
    },
    {
-        "BriefDescription": "This is an occupancy event that tracks the number of cores that are in C3.  It can be used by itself to get the average number of cores in C0, with threshholding to generate histograms, or with other PCU events and occupancy triggering to capture other details",
+        "BriefDescription": "Core C State Transition Cycles",
        "Counter": "0,1,2,3",
-        "EventCode": "0x80",
-        "EventName": "UNC_P_POWER_STATE_OCCUPANCY.CORES_C3",
-        "Filter": "occ_sel=2",
-        "MetricExpr": "(UNC_P_POWER_STATE_OCCUPANCY.CORES_C3 / UNC_P_CLOCKTICKS) * 100.",
-        "MetricName": "power_state_occupancy.cores_c3 %",
+        "EventCode": "0x6A",
+        "EventName": "UNC_P_CORE10_TRANSITION_CYCLES",
        "PerPkg": "1",
+        "PublicDescription": "Number of cycles spent performing core C state transitions.  There is one event per core.",
        "Unit": "PCU"
    },
    {
-        "BriefDescription": "This is an occupancy event that tracks the number of cores that are in C6.  It can be used by itself to get the average number of cores in C0, with threshholding to generate histograms, or with other PCU events ",
+        "BriefDescription": "Core C State Transition Cycles",
        "Counter": "0,1,2,3",
-        "EventCode": "0x80",
-        "EventName": "UNC_P_POWER_STATE_OCCUPANCY.CORES_C6",
-        "Filter": "occ_sel=3",
-        "MetricExpr": "(UNC_P_POWER_STATE_OCCUPANCY.CORES_C6 / UNC_P_CLOCKTICKS) * 100.",
-        "MetricName": "power_state_occupancy.cores_c6 %",
+        "EventCode": "0x6B",
+        "EventName": "UNC_P_CORE11_TRANSITION_CYCLES",
        "PerPkg": "1",
+        "PublicDescription": "Number of cycles spent performing core C state transitions.  There is one event per core.",
        "Unit": "PCU"
    },
    {
-        "BriefDescription": "Counts the number of cycles that we are in external PROCHOT mode.  This mode is triggered when a sensor off the die determines that something off-die (like DRAM) is too hot and must throttle to avoid damaging the chip",
+        "BriefDescription": "Core C State Transition Cycles",
        "Counter": "0,1,2,3",
-        "EventCode": "0xA",
-        "EventName": "UNC_P_PROCHOT_EXTERNAL_CYCLES",
-        "MetricExpr": "(UNC_P_PROCHOT_EXTERNAL_CYCLES / UNC_P_CLOCKTICKS) * 100.",
-        "MetricName": "prochot_external_cycles %",
+        "EventCode": "0x6C",
+        "EventName": "UNC_P_CORE12_TRANSITION_CYCLES",
        "PerPkg": "1",
+        "PublicDescription": "Number of cycles spent performing core C state transitions.  There is one event per core.",
        "Unit": "PCU"
    },
    {
-        "BriefDescription": "Counts the number of cycles when temperature is the upper limit on frequency",
+        "BriefDescription": "Core C State Transition Cycles",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x6D",
+        "EventName": "UNC_P_CORE13_TRANSITION_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Number of cycles spent performing core C state transitions.  There is one event per core.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Transition Cycles",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x6E",
+        "EventName": "UNC_P_CORE14_TRANSITION_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Number of cycles spent performing core C state transitions.  There is one event per core.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Transition Cycles",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x6F",
+        "EventName": "UNC_P_CORE15_TRANSITION_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Number of cycles spent performing core C state transitions.  There is one event per core.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Transition Cycles",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x70",
+        "EventName": "UNC_P_CORE16_TRANSITION_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Number of cycles spent performing core C state transitions.  There is one event per core.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Transition Cycles",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x71",
+        "EventName": "UNC_P_CORE17_TRANSITION_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Number of cycles spent performing core C state transitions.  There is one event per core.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Transition Cycles",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x61",
+        "EventName": "UNC_P_CORE1_TRANSITION_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Number of cycles spent performing core C state transitions.  There is one event per core.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Transition Cycles",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x62",
+        "EventName": "UNC_P_CORE2_TRANSITION_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Number of cycles spent performing core C state transitions.  There is one event per core.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Transition Cycles",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x63",
+        "EventName": "UNC_P_CORE3_TRANSITION_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Number of cycles spent performing core C state transitions.  There is one event per core.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Transition Cycles",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x64",
+        "EventName": "UNC_P_CORE4_TRANSITION_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Number of cycles spent performing core C state transitions.  There is one event per core.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Transition Cycles",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x65",
+        "EventName": "UNC_P_CORE5_TRANSITION_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Number of cycles spent performing core C state transitions.  There is one event per core.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Transition Cycles",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x66",
+        "EventName": "UNC_P_CORE6_TRANSITION_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Number of cycles spent performing core C state transitions.  There is one event per core.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Transition Cycles",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x67",
+        "EventName": "UNC_P_CORE7_TRANSITION_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Number of cycles spent performing core C state transitions.  There is one event per core.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Transition Cycles",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x68",
+        "EventName": "UNC_P_CORE8_TRANSITION_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Number of cycles spent performing core C state transitions.  There is one event per core.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Transition Cycles",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x69",
+        "EventName": "UNC_P_CORE9_TRANSITION_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Number of cycles spent performing core C state transitions.  There is one event per core.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Demotions",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x30",
+        "EventName": "UNC_P_DEMOTIONS_CORE0",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of times when a configurable cores had a C-state demotion",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Demotions",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x31",
+        "EventName": "UNC_P_DEMOTIONS_CORE1",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of times when a configurable cores had a C-state demotion",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Demotions",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x3A",
+        "EventName": "UNC_P_DEMOTIONS_CORE10",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of times when a configurable cores had a C-state demotion",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Demotions",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x3B",
+        "EventName": "UNC_P_DEMOTIONS_CORE11",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of times when a configurable cores had a C-state demotion",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Demotions",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x3C",
+        "EventName": "UNC_P_DEMOTIONS_CORE12",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of times when a configurable cores had a C-state demotion",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Demotions",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x3D",
+        "EventName": "UNC_P_DEMOTIONS_CORE13",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of times when a configurable cores had a C-state demotion",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Demotions",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x3E",
+        "EventName": "UNC_P_DEMOTIONS_CORE14",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of times when a configurable cores had a C-state demotion",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Demotions",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x3F",
+        "EventName": "UNC_P_DEMOTIONS_CORE15",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of times when a configurable cores had a C-state demotion",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Demotions",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x40",
+        "EventName": "UNC_P_DEMOTIONS_CORE16",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of times when a configurable cores had a C-state demotion",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Demotions",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x41",
+        "EventName": "UNC_P_DEMOTIONS_CORE17",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of times when a configurable cores had a C-state demotion",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Demotions",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x32",
+        "EventName": "UNC_P_DEMOTIONS_CORE2",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of times when a configurable cores had a C-state demotion",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Demotions",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x33",
+        "EventName": "UNC_P_DEMOTIONS_CORE3",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of times when a configurable cores had a C-state demotion",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Demotions",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x34",
+        "EventName": "UNC_P_DEMOTIONS_CORE4",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of times when a configurable cores had a C-state demotion",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Demotions",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x35",
+        "EventName": "UNC_P_DEMOTIONS_CORE5",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of times when a configurable cores had a C-state demotion",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Demotions",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x36",
+        "EventName": "UNC_P_DEMOTIONS_CORE6",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of times when a configurable cores had a C-state demotion",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Demotions",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x37",
+        "EventName": "UNC_P_DEMOTIONS_CORE7",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of times when a configurable cores had a C-state demotion",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Demotions",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x38",
+        "EventName": "UNC_P_DEMOTIONS_CORE8",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of times when a configurable cores had a C-state demotion",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Core C State Demotions",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x39",
+        "EventName": "UNC_P_DEMOTIONS_CORE9",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of times when a configurable cores had a C-state demotion",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Thermal Strongest Upper Limit Cycles",
        "Counter": "0,1,2,3",
        "EventCode": "0x4",
        "EventName": "UNC_P_FREQ_MAX_LIMIT_THERMAL_CYCLES",
-        "MetricExpr": "(UNC_P_FREQ_MAX_LIMIT_THERMAL_CYCLES / UNC_P_CLOCKTICKS) * 100.",
-        "MetricName": "freq_max_limit_thermal_cycles %",
        "PerPkg": "1",
+        "PublicDescription": "Counts the number of cycles when thermal conditions are the upper limit on frequency.  This is related to the THERMAL_THROTTLE CYCLES_ABOVE_TEMP event, which always counts cycles when we are above the thermal temperature.  This event (STRONGEST_UPPER_LIMIT) is sampled at the output of the algorithm that determines the actual frequency, while THERMAL_THROTTLE looks at the input.",
        "Unit": "PCU"
    },
    {
-        "BriefDescription": "Counts the number of cycles when the OS is the upper limit on frequency",
+        "BriefDescription": "OS Strongest Upper Limit Cycles",
        "Counter": "0,1,2,3",
        "EventCode": "0x6",
        "EventName": "UNC_P_FREQ_MAX_OS_CYCLES",
-        "MetricExpr": "(UNC_P_FREQ_MAX_OS_CYCLES / UNC_P_CLOCKTICKS) * 100.",
-        "MetricName": "freq_max_os_cycles %",
        "PerPkg": "1",
+        "PublicDescription": "Counts the number of cycles when the OS is the upper limit on frequency.",
        "Unit": "PCU"
    },
    {
-        "BriefDescription": "Counts the number of cycles when power is the upper limit on frequency",
+        "BriefDescription": "Power Strongest Upper Limit Cycles",
        "Counter": "0,1,2,3",
        "EventCode": "0x5",
        "EventName": "UNC_P_FREQ_MAX_POWER_CYCLES",
-        "MetricExpr": "(UNC_P_FREQ_MAX_POWER_CYCLES / UNC_P_CLOCKTICKS) * 100.",
-        "MetricName": "freq_max_power_cycles %",
        "PerPkg": "1",
+        "PublicDescription": "Counts the number of cycles when power is the upper limit on frequency.",
        "Unit": "PCU"
    },
    {
-        "BriefDescription": "Counts the number of cycles when current is the upper limit on frequency",
+        "BriefDescription": "IO P Limit Strongest Lower Limit Cycles",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x73",
+        "EventName": "UNC_P_FREQ_MIN_IO_P_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of cycles when IO P Limit is preventing us from dropping the frequency lower.  This algorithm monitors the needs to the IO subsystem on both local and remote sockets and will maintain a frequency high enough to maintain good IO BW.  This is necessary for when all the IA cores on a socket are idle but a user still would like to maintain high IO Bandwidth.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Cycles spent changing Frequency",
        "Counter": "0,1,2,3",
        "EventCode": "0x74",
        "EventName": "UNC_P_FREQ_TRANS_CYCLES",
-        "MetricExpr": "(UNC_P_FREQ_TRANS_CYCLES / UNC_P_CLOCKTICKS) * 100.",
-        "MetricName": "freq_trans_cycles %",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of cycles when the system is changing frequency.  This can not be filtered by thread ID.  One can also use it with the occupancy counter that monitors number of threads in C0 to estimate the performance impact that frequency transitions had on the system.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Memory Phase Shedding Cycles",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x2F",
+        "EventName": "UNC_P_MEMORY_PHASE_SHEDDING_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of cycles that the PCU has triggered memory phase shedding.  This is a mode that can be run in the iMC physicals that saves power at the expense of additional latency.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Package C State Residency - C0",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x2A",
+        "EventName": "UNC_P_PKG_RESIDENCY_C0_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of cycles when the package was in C0.  This event can be used in conjunction with edge detect to count C0 entrances (or exits using invert).  Residency events do not include transition times.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Package C State Residency - C1E",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x4E",
+        "EventName": "UNC_P_PKG_RESIDENCY_C1E_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of cycles when the package was in C1E.  This event can be used in conjunction with edge detect to count C1E entrances (or exits using invert).  Residency events do not include transition times.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Package C State Residency - C2E",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x2B",
+        "EventName": "UNC_P_PKG_RESIDENCY_C2E_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of cycles when the package was in C2E.  This event can be used in conjunction with edge detect to count C2E entrances (or exits using invert).  Residency events do not include transition times.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Package C State Residency - C3",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x2C",
+        "EventName": "UNC_P_PKG_RESIDENCY_C3_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of cycles when the package was in C3.  This event can be used in conjunction with edge detect to count C3 entrances (or exits using invert).  Residency events do not include transition times.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Package C State Residency - C6",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x2D",
+        "EventName": "UNC_P_PKG_RESIDENCY_C6_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of cycles when the package was in C6.  This event can be used in conjunction with edge detect to count C6 entrances (or exits using invert).  Residency events do not include transition times.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Package C7 State Residency",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x2E",
+        "EventName": "UNC_P_PKG_RESIDENCY_C7_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of cycles when the package was in C7.  This event can be used in conjunction with edge detect to count C7 entrances (or exits using invert).  Residency events do not include transition times.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Number of cores in C-State; C0 and C1",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x80",
+        "EventName": "UNC_P_POWER_STATE_OCCUPANCY.CORES_C0",
+        "PerPkg": "1",
+        "PublicDescription": "This is an occupancy event that tracks the number of cores that are in the chosen C-State.  It can be used by itself to get the average number of cores in that C-state with threshholding to generate histograms, or with other PCU events and occupancy triggering to capture other details.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Number of cores in C-State; C3",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x80",
+        "EventName": "UNC_P_POWER_STATE_OCCUPANCY.CORES_C3",
+        "PerPkg": "1",
+        "PublicDescription": "This is an occupancy event that tracks the number of cores that are in the chosen C-State.  It can be used by itself to get the average number of cores in that C-state with threshholding to generate histograms, or with other PCU events and occupancy triggering to capture other details.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Number of cores in C-State; C6 and C7",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x80",
+        "EventName": "UNC_P_POWER_STATE_OCCUPANCY.CORES_C6",
+        "PerPkg": "1",
+        "PublicDescription": "This is an occupancy event that tracks the number of cores that are in the chosen C-State.  It can be used by itself to get the average number of cores in that C-state with threshholding to generate histograms, or with other PCU events and occupancy triggering to capture other details.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "External Prochot",
+        "Counter": "0,1,2,3",
+        "EventCode": "0xA",
+        "EventName": "UNC_P_PROCHOT_EXTERNAL_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of cycles that we are in external PROCHOT mode.  This mode is triggered when a sensor off the die determines that something off-die (like DRAM) is too hot and must throttle to avoid damaging the chip.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Internal Prochot",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x9",
+        "EventName": "UNC_P_PROCHOT_INTERNAL_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Counts the number of cycles that we are in Interal PROCHOT mode.  This mode is triggered when a sensor on the die determines that we are too hot and must throttle to avoid damaging the chip.",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "Total Core C State Transition Cycles",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x72",
+        "EventName": "UNC_P_TOTAL_TRANSITION_CYCLES",
+        "PerPkg": "1",
+        "PublicDescription": "Number of cycles spent performing core C state transitions across all cores.",
+        "Unit": "PCU"
+    },
+    {
+        "Counter": "0,1,2,3",
+        "EventCode": "0x79",
+        "EventName": "UNC_P_UFS_TRANSITIONS_RING_GV",
+        "PerPkg": "1",
+        "PublicDescription": "Ring GV with same final and initial frequency",
+        "Unit": "PCU"
+    },
+    {
+        "BriefDescription": "VR Hot",
+        "Counter": "0,1,2,3",
+        "EventCode": "0x42",
+        "EventName": "UNC_P_VR_HOT_CYCLES",
        "PerPkg": "1",
        "Unit": "PCU"
    }
--- a/tools/perf/pmu-events/arch/x86/broadwellde/virtual-memory.json
+++ b/tools/perf/pmu-events/arch/x86/broadwellde/virtual-memory.json
@ -385,4 +385,4 @@
        "SampleAfterValue": "100007",
        "UMask": "0x20"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json
@ -74,12 +74,6 @@
        "MetricGroup": "Branches;Fed;FetchBW",
        "MetricName": "UpTB"
    },
-    {
-        "BriefDescription": "Cycles Per Instruction (per Logical Processor)",
-        "MetricExpr": "1 / (INST_RETIRED.ANY / CPU_CLK_UNHALTED.THREAD)",
-        "MetricGroup": "Pipeline;Mem",
-        "MetricName": "CPI"
-    },
    {
        "BriefDescription": "Per-Logical Processor actual clocks when the Logical Processor is active.",
        "MetricExpr": "CPU_CLK_UNHALTED.THREAD",
@ -130,43 +124,25 @@
        "MetricName": "FLOPc_SMT"
    },
    {
-        "BriefDescription": "Actual per-core usage of the Floating Point execution units (regardless of the vector width)",
+        "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width)",
        "MetricExpr": "( (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE) + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) ) / ( 2 * CPU_CLK_UNHALTED.THREAD )",
        "MetricGroup": "Cor;Flops;HPC",
        "MetricName": "FP_Arith_Utilization",
-        "PublicDescription": "Actual per-core usage of the Floating Point execution units (regardless of the vector width). Values > 1 are possible due to Fused-Multiply Add (FMA) counting."
+        "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common)."
    },
    {
-        "BriefDescription": "Actual per-core usage of the Floating Point execution units (regardless of the vector width). SMT version; use when SMT is enabled and measuring per logical CPU.",
+        "BriefDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). SMT version; use when SMT is enabled and measuring per logical CPU.",
        "MetricExpr": "( (FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE) + (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE) ) / ( 2 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) ) )",
        "MetricGroup": "Cor;Flops;HPC_SMT",
        "MetricName": "FP_Arith_Utilization_SMT",
-        "PublicDescription": "Actual per-core usage of the Floating Point execution units (regardless of the vector width). Values > 1 are possible due to Fused-Multiply Add (FMA) counting. SMT version; use when SMT is enabled and measuring per logical CPU."
+        "PublicDescription": "Actual per-core usage of the Floating Point non-X87 execution units (regardless of precision or vector-width). Values > 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less common). SMT version; use when SMT is enabled and measuring per logical CPU."
    },
    {
-        "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is at least 1 uop executed)",
+        "BriefDescription": "Instruction-Level-Parallelism (average number of uops executed when there is execution) per-core",
        "MetricExpr": "UOPS_EXECUTED.THREAD / (( cpu@UOPS_EXECUTED.CORE\\,cmask\\=1@ / 2 ) if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)",
        "MetricGroup": "Backend;Cor;Pipeline;PortsUtil",
        "MetricName": "ILP"
    },
-    {
-        "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)",
-        "MetricExpr": " ( ((BR_MISP_RETIRED.ALL_BRANCHES / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT )) * (( UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES ) / (4 * CPU_CLK_UNHALTED.THREAD))) + (4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * CPU_CLK_UNHALTED.THREAD)) * (BR_MISP_RETIRED.ALL_BRANCHES * (12 * ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY ) / CPU_CLK_UNHALTED.THREAD) / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY )) / #(4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * CPU_CLK_UNHALTED.THREAD)) ) * (4 * CPU_CLK_UNHALTED.THREAD) / BR_MISP_RETIRED.ALL_BRANCHES",
-        "MetricGroup": "Bad;BrMispredicts",
-        "MetricName": "Branch_Misprediction_Cost"
-    },
-    {
-        "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)",
-        "MetricExpr": " ( ((BR_MISP_RETIRED.ALL_BRANCHES / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT )) * (( UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * ( INT_MISC.RECOVERY_CYCLES_ANY / 2 ) ) / (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) )))) + (4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) ))) * (BR_MISP_RETIRED.ALL_BRANCHES * (12 * ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY ) / CPU_CLK_UNHALTED.THREAD) / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY )) / #(4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) ))) ) * (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) )) / BR_MISP_RETIRED.ALL_BRANCHES",
-        "MetricGroup": "Bad;BrMispredicts_SMT",
-        "MetricName": "Branch_Misprediction_Cost_SMT"
-    },
-    {
-        "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear)",
-        "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES",
-        "MetricGroup": "Bad;BadSpec;BrMispredicts",
-        "MetricName": "IpMispredict"
-    },
    {
        "BriefDescription": "Core actual clocks when any Logical Processor is active on the Physical Core",
        "MetricExpr": "( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) )",
@ -256,6 +232,18 @@
        "MetricGroup": "Summary;TmaL1",
        "MetricName": "Instructions"
    },
+    {
+        "BriefDescription": "Average number of Uops retired in cycles where at least one uop has retired.",
+        "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE_SLOTS\\,cmask\\=1@",
+        "MetricGroup": "Pipeline;Ret",
+        "MetricName": "Retire"
+    },
+    {
+        "BriefDescription": "",
+        "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,cmask\\=1@",
+        "MetricGroup": "Cor;Pipeline;PortsUtil;SMT",
+        "MetricName": "Execute"
+    },
    {
        "BriefDescription": "Fraction of Uops delivered by the DSB (aka Decoded ICache; or Uop Cache)",
        "MetricExpr": "IDQ.DSB_UOPS / (( IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS ) )",
@ -263,11 +251,28 @@
        "MetricName": "DSB_Coverage"
    },
    {
-        "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load instructions (in core cycles)",
+        "BriefDescription": "Number of Instructions per non-speculative Branch Misprediction (JEClear) (lower number means higher occurrence rate)",
+        "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES",
+        "MetricGroup": "Bad;BadSpec;BrMispredicts",
+        "MetricName": "IpMispredict"
+    },
+    {
+        "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)",
+        "MetricExpr": " ( ((BR_MISP_RETIRED.ALL_BRANCHES / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT )) * (( UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES ) / (4 * CPU_CLK_UNHALTED.THREAD))) + (4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * CPU_CLK_UNHALTED.THREAD)) * (BR_MISP_RETIRED.ALL_BRANCHES * (12 * ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY ) / CPU_CLK_UNHALTED.THREAD) / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY )) / #(4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * CPU_CLK_UNHALTED.THREAD)) ) * (4 * CPU_CLK_UNHALTED.THREAD) / BR_MISP_RETIRED.ALL_BRANCHES",
+        "MetricGroup": "Bad;BrMispredicts",
+        "MetricName": "Branch_Misprediction_Cost"
+    },
+    {
+        "BriefDescription": "Branch Misprediction Cost: Fraction of TMA slots wasted per non-speculative branch misprediction (retired JEClear)",
+        "MetricExpr": " ( ((BR_MISP_RETIRED.ALL_BRANCHES / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT )) * (( UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * ( INT_MISC.RECOVERY_CYCLES_ANY / 2 ) ) / (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) )))) + (4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) ))) * (BR_MISP_RETIRED.ALL_BRANCHES * (12 * ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY ) / CPU_CLK_UNHALTED.THREAD) / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY )) / #(4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) ))) ) * (4 * ( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) )) / BR_MISP_RETIRED.ALL_BRANCHES",
+        "MetricGroup": "Bad;BrMispredicts_SMT",
+        "MetricName": "Branch_Misprediction_Cost_SMT"
+    },
+    {
+        "BriefDescription": "Actual Average Latency for L1 data-cache miss demand load operations (in core cycles)",
        "MetricExpr": "L1D_PEND_MISS.PENDING / ( MEM_LOAD_UOPS_RETIRED.L1_MISS + mem_load_uops_retired.hit_lfb )",
        "MetricGroup": "Mem;MemoryBound;MemoryLat",
-        "MetricName": "Load_Miss_Real_Latency",
-        "PublicDescription": "Actual Average Latency for L1 data-cache miss demand load instructions (in core cycles). Latency may be overestimated for multi-load instructions - e.g. repeat strings."
+        "MetricName": "Load_Miss_Real_Latency"
    },
    {
        "BriefDescription": "Memory-Level-Parallelism (average number of L1 miss demand load when there is at least one such miss. Per-Logical Processor)",
@ -275,24 +280,6 @@
        "MetricGroup": "Mem;MemoryBound;MemoryBW",
        "MetricName": "MLP"
    },
-    {
-        "BriefDescription": "Average data fill bandwidth to the L1 data cache [GB / sec]",
-        "MetricExpr": "64 * L1D.REPLACEMENT / 1000000000 / duration_time",
-        "MetricGroup": "Mem;MemoryBW",
-        "MetricName": "L1D_Cache_Fill_BW"
-    },
-    {
-        "BriefDescription": "Average data fill bandwidth to the L2 cache [GB / sec]",
-        "MetricExpr": "64 * L2_LINES_IN.ALL / 1000000000 / duration_time",
-        "MetricGroup": "Mem;MemoryBW",
-        "MetricName": "L2_Cache_Fill_BW"
-    },
-    {
-        "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]",
-        "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1000000000 / duration_time",
-        "MetricGroup": "Mem;MemoryBW",
-        "MetricName": "L3_Cache_Fill_BW"
-    },
    {
        "BriefDescription": "L1 cache true misses per kilo instruction for retired demand loads",
        "MetricExpr": "1000 * MEM_LOAD_UOPS_RETIRED.L1_MISS / INST_RETIRED.ANY",
@ -306,13 +293,13 @@
        "MetricName": "L2MPKI"
    },
    {
-        "BriefDescription": "L2 cache misses per kilo instruction for all request types (including speculative)",
+        "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all request types (including speculative)",
        "MetricExpr": "1000 * L2_RQSTS.MISS / INST_RETIRED.ANY",
        "MetricGroup": "Mem;CacheMisses;Offcore",
        "MetricName": "L2MPKI_All"
    },
    {
-        "BriefDescription": "L2 cache misses per kilo instruction for all demand loads  (including speculative)",
+        "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instruction for all demand loads  (including speculative)",
        "MetricExpr": "1000 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.ANY",
        "MetricGroup": "Mem;CacheMisses",
        "MetricName": "L2MPKI_Load"
@ -348,6 +335,48 @@
        "MetricGroup": "Mem;MemoryTLB_SMT",
        "MetricName": "Page_Walks_Utilization_SMT"
    },
+    {
+        "BriefDescription": "Average per-core data fill bandwidth to the L1 data cache [GB / sec]",
+        "MetricExpr": "64 * L1D.REPLACEMENT / 1000000000 / duration_time",
+        "MetricGroup": "Mem;MemoryBW",
+        "MetricName": "L1D_Cache_Fill_BW"
+    },
+    {
+        "BriefDescription": "Average per-core data fill bandwidth to the L2 cache [GB / sec]",
+        "MetricExpr": "64 * L2_LINES_IN.ALL / 1000000000 / duration_time",
+        "MetricGroup": "Mem;MemoryBW",
+        "MetricName": "L2_Cache_Fill_BW"
+    },
+    {
+        "BriefDescription": "Average per-core data fill bandwidth to the L3 cache [GB / sec]",
+        "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1000000000 / duration_time",
+        "MetricGroup": "Mem;MemoryBW",
+        "MetricName": "L3_Cache_Fill_BW"
+    },
+    {
+        "BriefDescription": "Average per-thread data fill bandwidth to the L1 data cache [GB / sec]",
+        "MetricExpr": "(64 * L1D.REPLACEMENT / 1000000000 / duration_time)",
+        "MetricGroup": "Mem;MemoryBW",
+        "MetricName": "L1D_Cache_Fill_BW_1T"
+    },
+    {
+        "BriefDescription": "Average per-thread data fill bandwidth to the L2 cache [GB / sec]",
+        "MetricExpr": "(64 * L2_LINES_IN.ALL / 1000000000 / duration_time)",
+        "MetricGroup": "Mem;MemoryBW",
+        "MetricName": "L2_Cache_Fill_BW_1T"
+    },
+    {
+        "BriefDescription": "Average per-thread data fill bandwidth to the L3 cache [GB / sec]",
+        "MetricExpr": "(64 * LONGEST_LAT_CACHE.MISS / 1000000000 / duration_time)",
+        "MetricGroup": "Mem;MemoryBW",
+        "MetricName": "L3_Cache_Fill_BW_1T"
+    },
+    {
+        "BriefDescription": "Average per-thread data access bandwidth to the L3 cache [GB / sec]",
+        "MetricExpr": "0",
+        "MetricGroup": "Mem;MemoryBW;Offcore",
+        "MetricName": "L3_Cache_Access_BW_1T"
+    },
    {
        "BriefDescription": "Average CPU Utilization",
        "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / msr@tsc@",
@ -364,7 +393,8 @@
        "BriefDescription": "Giga Floating Point Operations Per Second",
        "MetricExpr": "( ( 1 * ( FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE ) + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * ( FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE ) + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE ) / 1000000000 ) / duration_time",
        "MetricGroup": "Cor;Flops;HPC",
-        "MetricName": "GFLOPs"
+        "MetricName": "GFLOPs",
+        "PublicDescription": "Giga Floating Point Operations Per Second. Aggregate across all supported options of: FP precisions, scalar and vector instructions, vector-width and AMX engine."
    },
    {
        "BriefDescription": "Average Frequency Utilization relative nominal frequency",
@ -461,5 +491,439 @@
        "MetricExpr": "(cstate_pkg@c7\\-residency@ / msr@tsc@) * 100",
        "MetricGroup": "Power",
        "MetricName": "C7_Pkg_Residency"
+    },
+    {
+        "BriefDescription": "CPU operating frequency (in GHz)",
+        "MetricExpr": "( CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC * #SYSTEM_TSC_FREQ ) / 1000000000",
+        "MetricGroup": "",
+        "MetricName": "cpu_operating_frequency",
+        "ScaleUnit": "1GHz"
+    },
+    {
+        "BriefDescription": "Cycles per instruction retired; indicating how much time each executed instruction took; in units of cycles.",
+        "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY",
+        "MetricGroup": "",
+        "MetricName": "cpi",
+        "ScaleUnit": "1per_instr"
+    },
+    {
+        "BriefDescription": "The ratio of number of completed memory load instructions to the total number completed instructions",
+        "MetricExpr": "MEM_UOPS_RETIRED.ALL_LOADS / INST_RETIRED.ANY",
+        "MetricGroup": "",
+        "MetricName": "loads_per_instr",
+        "ScaleUnit": "1per_instr"
+    },
+    {
+        "BriefDescription": "The ratio of number of completed memory store instructions to the total number completed instructions",
+        "MetricExpr": "MEM_UOPS_RETIRED.ALL_STORES / INST_RETIRED.ANY",
+        "MetricGroup": "",
+        "MetricName": "stores_per_instr",
+        "ScaleUnit": "1per_instr"
+    },
+    {
+        "BriefDescription": "Ratio of number of requests missing L1 data cache (includes data+rfo w/ prefetches) to the total number of completed instructions",
+        "MetricExpr": "L1D.REPLACEMENT / INST_RETIRED.ANY",
+        "MetricGroup": "",
+        "MetricName": "l1d_mpi_includes_data_plus_rfo_with_prefetches",
+        "ScaleUnit": "1per_instr"
+    },
+    {
+        "BriefDescription": "Ratio of number of demand load requests hitting in L1 data cache to the total number of completed instructions",
+        "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L1_HIT / INST_RETIRED.ANY",
+        "MetricGroup": "",
+        "MetricName": "l1d_demand_data_read_hits_per_instr",
+        "ScaleUnit": "1per_instr"
+    },
+    {
+        "BriefDescription": "Ratio of number of code read requests missing in L1 instruction cache (includes prefetches) to the total number of completed instructions",
+        "MetricExpr": "L2_RQSTS.ALL_CODE_RD / INST_RETIRED.ANY",
+        "MetricGroup": "",
+        "MetricName": "l1_i_code_read_misses_with_prefetches_per_instr",
+        "ScaleUnit": "1per_instr"
+    },
+    {
+        "BriefDescription": "Ratio of number of completed demand load requests hitting in L2 cache to the total number of completed instructions",
+        "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L2_HIT / INST_RETIRED.ANY",
+        "MetricGroup": "",
+        "MetricName": "l2_demand_data_read_hits_per_instr",
+        "ScaleUnit": "1per_instr"
+    },
+    {
+        "BriefDescription": "Ratio of number of requests missing L2 cache (includes code+data+rfo w/ prefetches) to the total number of completed instructions",
+        "MetricExpr": "L2_LINES_IN.ALL / INST_RETIRED.ANY",
+        "MetricGroup": "",
+        "MetricName": "l2_mpi_includes_code_plus_data_plus_rfo_with_prefetches",
+        "ScaleUnit": "1per_instr"
+    },
+    {
+        "BriefDescription": "Ratio of number of completed data read request missing L2 cache to the total number of completed instructions",
+        "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L2_MISS / INST_RETIRED.ANY",
+        "MetricGroup": "",
+        "MetricName": "l2_demand_data_read_mpi",
+        "ScaleUnit": "1per_instr"
+    },
+    {
+        "BriefDescription": "Ratio of number of code read request missing L2 cache to the total number of completed instructions",
+        "MetricExpr": "L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY",
+        "MetricGroup": "",
+        "MetricName": "l2_demand_code_mpi",
+        "ScaleUnit": "1per_instr"
+    },
+    {
+        "BriefDescription": "Ratio of number of data read requests missing last level core cache (includes demand w/ prefetches) to the total number of completed instructions",
+        "MetricExpr": "( cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x182@ + cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x192@ ) / INST_RETIRED.ANY",
+        "MetricGroup": "",
+        "MetricName": "llc_data_read_mpi_demand_plus_prefetch",
+        "ScaleUnit": "1per_instr"
+    },
+    {
+        "BriefDescription": "Ratio of number of code read requests missing last level core cache (includes demand w/ prefetches) to the total number of completed instructions",
+        "MetricExpr": "( cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x181@ + cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x191@ ) / INST_RETIRED.ANY",
+        "MetricGroup": "",
+        "MetricName": "llc_code_read_mpi_demand_plus_prefetch",
+        "ScaleUnit": "1per_instr"
+    },
+    {
+        "BriefDescription": "Average latency of a last level cache (LLC) demand and prefetch data read miss (read memory access) in nano seconds",
+        "MetricExpr": "( 1000000000 * ( cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filter_opc\\=0x182@ / cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x182@ ) / ( UNC_C_CLOCKTICKS / ( source_count(UNC_C_CLOCKTICKS) * #num_packages ) ) ) * duration_time",
+        "MetricGroup": "",
+        "MetricName": "llc_data_read_demand_plus_prefetch_miss_latency",
+        "ScaleUnit": "1ns"
+    },
+    {
+        "BriefDescription": "Average latency of a last level cache (LLC) demand and prefetch data read miss (read memory access) addressed to local memory in nano seconds",
+        "MetricExpr": "( 1000000000 * ( cbox@UNC_C_TOR_OCCUPANCY.MISS_LOCAL_OPCODE\\,filter_opc\\=0x182@ / cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x182@ ) / ( UNC_C_CLOCKTICKS / ( source_count(UNC_C_CLOCKTICKS) * #num_packages ) ) ) * duration_time",
+        "MetricGroup": "",
+        "MetricName": "llc_data_read_demand_plus_prefetch_miss_latency_for_local_requests",
+        "ScaleUnit": "1ns"
+    },
+    {
+        "BriefDescription": "Average latency of a last level cache (LLC) demand and prefetch data read miss (read memory access) addressed to remote memory in nano seconds",
+        "MetricExpr": "( 1000000000 * ( cbox@UNC_C_TOR_OCCUPANCY.MISS_REMOTE_OPCODE\\,filter_opc\\=0x182@ / cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x182@ ) / ( UNC_C_CLOCKTICKS / ( source_count(UNC_C_CLOCKTICKS) * #num_packages ) ) ) * duration_time",
+        "MetricGroup": "",
+        "MetricName": "llc_data_read_demand_plus_prefetch_miss_latency_for_remote_requests",
+        "ScaleUnit": "1ns"
+    },
+    {
+        "BriefDescription": "Ratio of number of completed page walks (for all page sizes) caused by a code fetch to the total number of completed instructions. This implies it missed in the ITLB (Instruction TLB) and further levels of TLB.",
+        "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY",
+        "MetricGroup": "",
+        "MetricName": "itlb_mpi",
+        "ScaleUnit": "1per_instr"
+    },
+    {
+        "BriefDescription": "Ratio of number of completed page walks (for 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total number of completed instructions. This implies it missed in the Instruction Translation Lookaside Buffer (ITLB) and further levels of TLB.",
+        "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY",
+        "MetricGroup": "",
+        "MetricName": "itlb_large_page_mpi",
+        "ScaleUnit": "1per_instr"
+    },
+    {
+        "BriefDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data loads to the total number of completed instructions. This implies it missed in the DTLB and further levels of TLB.",
+        "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY",
+        "MetricGroup": "",
+        "MetricName": "dtlb_load_mpi",
+        "ScaleUnit": "1per_instr"
+    },
+    {
+        "BriefDescription": "Ratio of number of completed page walks (for all page sizes) caused by demand data stores to the total number of completed instructions. This implies it missed in the DTLB and further levels of TLB.",
+        "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY",
+        "MetricGroup": "",
+        "MetricName": "dtlb_store_mpi",
+        "ScaleUnit": "1per_instr"
+    },
+    {
+        "BriefDescription": "Memory read that miss the last level cache (LLC) addressed to local DRAM as a percentage of total memory read accesses, does not include LLC prefetches.",
+        "MetricExpr": "100 * cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x182@ / ( cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x182@ + cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x182@ )",
+        "MetricGroup": "",
+        "MetricName": "numa_percent_reads_addressed_to_local_dram",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "Memory reads that miss the last level cache (LLC) addressed to remote DRAM as a percentage of total memory read accesses, does not include LLC prefetches.",
+        "MetricExpr": "100 * cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x182@ / ( cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x182@ + cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=0x182@ )",
+        "MetricGroup": "",
+        "MetricName": "numa_percent_reads_addressed_to_remote_dram",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "Uncore operating frequency in GHz",
+        "MetricExpr": "UNC_C_CLOCKTICKS / ( source_count(UNC_C_CLOCKTICKS) * #num_packages ) / 1000000000",
+        "MetricGroup": "",
+        "MetricName": "uncore_frequency",
+        "ScaleUnit": "1GHz"
+    },
+    {
+        "BriefDescription": "Intel(R) Quick Path Interconnect (QPI) data transmit bandwidth (MB/sec)",
+        "MetricExpr": "( UNC_Q_TxL_FLITS_G0.DATA * 8 / 1000000) / duration_time",
+        "MetricGroup": "",
+        "MetricName": "qpi_data_transmit_bw_only_data",
+        "ScaleUnit": "1MB/s"
+    },
+    {
+        "BriefDescription": "DDR memory read bandwidth (MB/sec)",
+        "MetricExpr": "( UNC_M_CAS_COUNT.RD * 64 / 1000000) / duration_time",
+        "MetricGroup": "",
+        "MetricName": "memory_bandwidth_read",
+        "ScaleUnit": "1MB/s"
+    },
+    {
+        "BriefDescription": "DDR memory write bandwidth (MB/sec)",
+        "MetricExpr": "( UNC_M_CAS_COUNT.WR * 64 / 1000000) / duration_time",
+        "MetricGroup": "",
+        "MetricName": "memory_bandwidth_write",
+        "ScaleUnit": "1MB/s"
+    },
+    {
+        "BriefDescription": "DDR memory bandwidth (MB/sec)",
+        "MetricExpr": "(( UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR ) * 64 / 1000000) / duration_time",
+        "MetricGroup": "",
+        "MetricName": "memory_bandwidth_total",
+        "ScaleUnit": "1MB/s"
+    },
+    {
+        "BriefDescription": "Bandwidth of IO reads that are initiated by end device controllers that are requesting memory from the CPU.",
+        "MetricExpr": "( cbox@UNC_C_TOR_INSERTS.OPCODE\\,filter_opc\\=0x19e@ * 64 / 1000000) / duration_time",
+        "MetricGroup": "",
+        "MetricName": "io_bandwidth_read",
+        "ScaleUnit": "1MB/s"
+    },
+    {
+        "BriefDescription": "Bandwidth of IO writes that are initiated by end device controllers that are writing memory to the CPU.",
+        "MetricExpr": "(( cbox@UNC_C_TOR_INSERTS.OPCODE\\,filter_opc\\=0x1c8\\,filter_tid\\=0x3e@ + cbox@UNC_C_TOR_INSERTS.OPCODE\\,filter_opc\\=0x180\\,filter_tid\\=0x3e@ ) * 64 / 1000000) / duration_time",
+        "MetricGroup": "",
+        "MetricName": "io_bandwidth_write",
+        "ScaleUnit": "1MB/s"
+    },
+    {
+        "BriefDescription": "Uops delivered from decoded instruction cache (decoded stream buffer or DSB) as a percent of total uops delivered to Instruction Decode Queue",
+        "MetricExpr": "100 * ( IDQ.DSB_UOPS / UOPS_ISSUED.ANY )",
+        "MetricGroup": "",
+        "MetricName": "percent_uops_delivered_from_decoded_icache_dsb",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "Uops delivered from legacy decode pipeline (Micro-instruction Translation Engine or MITE) as a percent of total uops delivered to Instruction Decode Queue",
+        "MetricExpr": "100 * ( IDQ.MITE_UOPS / UOPS_ISSUED.ANY )",
+        "MetricGroup": "",
+        "MetricName": "percent_uops_delivered_from_legacy_decode_pipeline_mite",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "Uops delivered from microcode sequencer (MS) as a percent of total uops delivered to Instruction Decode Queue",
+        "MetricExpr": "100 * ( IDQ.MS_UOPS / UOPS_ISSUED.ANY )",
+        "MetricGroup": "",
+        "MetricName": "percent_uops_delivered_from_microcode_sequencer_ms",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "Uops delivered from loop stream detector(LSD) as a percent of total uops delivered to Instruction Decode Queue",
+        "MetricExpr": "100 * ( LSD.UOPS / UOPS_ISSUED.ANY )",
+        "MetricGroup": "",
+        "MetricName": "percent_uops_delivered_from_loop_stream_detector_lsd",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Machine_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound.",
+        "MetricExpr": "100 * ( IDQ_UOPS_NOT_DELIVERED.CORE / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) )",
+        "MetricGroup": "TmaL1;PGO",
+        "MetricName": "tma_frontend_bound_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues.  For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period.",
+        "MetricExpr": "100 * ( ( 4 ) * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) )",
+        "MetricGroup": "Frontend;TmaL2;m_tma_frontend_bound_percent",
+        "MetricName": "tma_fetch_latency_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to instruction cache misses.",
+        "MetricExpr": "100 * ( ICACHE.IFDATA_STALL / ( CPU_CLK_UNHALTED.THREAD ) )",
+        "MetricGroup": "BigFoot;FetchLat;IcMiss;TmaL3;m_tma_fetch_latency_percent",
+        "MetricName": "tma_icache_misses_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Instruction TLB (ITLB) misses.",
+        "MetricExpr": "100 * ( ( 14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_DURATION\\,cmask\\=0x1@ + 7 * ITLB_MISSES.WALK_COMPLETED ) / ( CPU_CLK_UNHALTED.THREAD ) )",
+        "MetricGroup": "BigFoot;FetchLat;MemoryTLB;TmaL3;m_tma_fetch_latency_percent",
+        "MetricName": "tma_itlb_misses_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to Branch Resteers. Branch Resteers estimates the Frontend delay in fetching operations from corrected path; following all sorts of miss-predicted branches. For example; branchy code with lots of miss-predictions might get categorized under Branch Resteers. Note the value of this node may overlap with its siblings.",
+        "MetricExpr": "100 * ( ( 12 ) * ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY ) / ( CPU_CLK_UNHALTED.THREAD ) )",
+        "MetricGroup": "FetchLat;TmaL3;m_tma_fetch_latency_percent",
+        "MetricName": "tma_branch_resteers_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric represents fraction of cycles the CPU was stalled due to switches from DSB to MITE pipelines. The DSB (decoded i-cache) is a Uop Cache where the front-end directly delivers Uops (micro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter latency and delivered higher bandwidth than the MITE (legacy instruction decode pipeline). Switching between the two pipelines can cause penalties hence this metric measures the exposed penalty.",
+        "MetricExpr": "100 * ( DSB2MITE_SWITCHES.PENALTY_CYCLES / ( CPU_CLK_UNHALTED.THREAD ) )",
+        "MetricGroup": "DSBmiss;FetchLat;TmaL3;m_tma_fetch_latency_percent",
+        "MetricName": "tma_dsb_switches_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric represents fraction of cycles CPU was stalled due to Length Changing Prefixes (LCPs). Using proper compiler flags or Intel Compiler by default will certainly avoid this. #Link: Optimization Guide about LCP BKMs.",
+        "MetricExpr": "100 * ( ILD_STALL.LCP / ( CPU_CLK_UNHALTED.THREAD ) )",
+        "MetricGroup": "FetchLat;TmaL3;m_tma_fetch_latency_percent",
+        "MetricName": "tma_lcp_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric estimates the fraction of cycles when the CPU was stalled due to switches of uop delivery to the Microcode Sequencer (MS). Commonly used instructions are optimized for delivery by the DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Certain operations cannot be handled natively by the execution pipeline; and must be performed by microcode (small programs injected into the execution stream). Switching to the MS too often can negatively impact performance. The MS is designated to deliver long uop flows required by CISC instructions like CPUID; or uncommon conditions like Floating Point Assists when dealing with Denormals.",
+        "MetricExpr": "100 * ( ( 2 ) * IDQ.MS_SWITCHES / ( CPU_CLK_UNHALTED.THREAD ) )",
+        "MetricGroup": "FetchLat;MicroSeq;TmaL3;m_tma_fetch_latency_percent",
+        "MetricName": "tma_ms_switches_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues.  For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend.",
+        "MetricExpr": "100 * ( ( IDQ_UOPS_NOT_DELIVERED.CORE / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) - ( ( 4 ) * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) )",
+        "MetricGroup": "FetchBW;Frontend;TmaL2;m_tma_frontend_bound_percent",
+        "MetricName": "tma_fetch_bandwidth_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to the MITE pipeline (the legacy decode pipeline). This pipeline is used for code that was not pre-cached in the DSB or LSD. For example; inefficiencies due to asymmetric decoders; use of long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.",
+        "MetricExpr": "100 * ( ( IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES_4_UOPS ) / ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) / 2 )",
+        "MetricGroup": "DSBmiss;FetchBW;TmaL3;m_tma_fetch_bandwidth_percent",
+        "MetricName": "tma_mite_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline.  For example; inefficient utilization of the DSB cache structure or bank conflict when reading from it; are categorized here.",
+        "MetricExpr": "100 * ( ( IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4_UOPS ) / ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) / 2 )",
+        "MetricGroup": "DSB;FetchBW;TmaL3;m_tma_fetch_bandwidth_percent",
+        "MetricName": "tma_dsb_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
+        "MetricExpr": "100 * ( ( UOPS_ISSUED.ANY - ( UOPS_RETIRED.RETIRE_SLOTS ) + ( 4 ) * ( ( INT_MISC.RECOVERY_CYCLES_ANY / 2 ) if #SMT_on else INT_MISC.RECOVERY_CYCLES ) ) / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) )",
+        "MetricGroup": "TmaL1",
+        "MetricName": "tma_bad_speculation_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction.  These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path.",
+        "MetricExpr": "100 * ( ( BR_MISP_RETIRED.ALL_BRANCHES / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT ) ) * ( ( UOPS_ISSUED.ANY - ( UOPS_RETIRED.RETIRE_SLOTS ) + ( 4 ) * ( ( INT_MISC.RECOVERY_CYCLES_ANY / 2 ) if #SMT_on else INT_MISC.RECOVERY_CYCLES ) ) / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) )",
+        "MetricGroup": "BadSpec;BrMispredicts;TmaL2;m_tma_bad_speculation_percent",
+        "MetricName": "tma_branch_mispredicts_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears.  These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes.",
+        "MetricExpr": "100 * ( ( ( UOPS_ISSUED.ANY - ( UOPS_RETIRED.RETIRE_SLOTS ) + ( 4 ) * ( ( INT_MISC.RECOVERY_CYCLES_ANY / 2 ) if #SMT_on else INT_MISC.RECOVERY_CYCLES ) ) / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) - ( ( BR_MISP_RETIRED.ALL_BRANCHES / ( BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT ) ) * ( ( UOPS_ISSUED.ANY - ( UOPS_RETIRED.RETIRE_SLOTS ) + ( 4 ) * ( ( INT_MISC.RECOVERY_CYCLES_ANY / 2 ) if #SMT_on else INT_MISC.RECOVERY_CYCLES ) ) / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) ) )",
+        "MetricGroup": "BadSpec;MachineClears;TmaL2;m_tma_bad_speculation_percent",
+        "MetricName": "tma_machine_clears_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound.",
+        "MetricExpr": "100 * ( 1 - ( ( IDQ_UOPS_NOT_DELIVERED.CORE / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) + ( ( UOPS_ISSUED.ANY - ( UOPS_RETIRED.RETIRE_SLOTS ) + ( 4 ) * ( ( INT_MISC.RECOVERY_CYCLES_ANY / 2 ) if #SMT_on else INT_MISC.RECOVERY_CYCLES ) ) / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) + ( ( UOPS_RETIRED.RETIRE_SLOTS ) / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) ) )",
+        "MetricGroup": "TmaL1",
+        "MetricName": "tma_backend_bound_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck.  Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).",
+        "MetricExpr": "100 * ( ( ( CYCLE_ACTIVITY.STALLS_MEM_ANY + RESOURCE_STALLS.SB ) / ( ( CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - ( UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if ( ( INST_RETIRED.ANY / ( CPU_CLK_UNHALTED.THREAD ) ) > 1.8 ) else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC ) - ( RS_EVENTS.EMPTY_CYCLES if ( ( ( 4 ) * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) > 0.1 ) else 0 ) + RESOURCE_STALLS.SB ) ) ) * ( 1 - ( ( IDQ_UOPS_NOT_DELIVERED.CORE / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) + ( ( UOPS_ISSUED.ANY - ( UOPS_RETIRED.RETIRE_SLOTS ) + ( 4 ) * ( ( INT_MISC.RECOVERY_CYCLES_ANY / 2 ) if #SMT_on else INT_MISC.RECOVERY_CYCLES ) ) / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) + ( ( UOPS_RETIRED.RETIRE_SLOTS ) / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) ) ) )",
+        "MetricGroup": "Backend;TmaL2;m_tma_backend_bound_percent",
+        "MetricName": "tma_memory_bound_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric estimates how often the CPU was stalled without loads missing the L1 data cache.  The L1 data cache typically has the shortest latency.  However; in certain cases like loads blocked on older stores; a load might suffer due to high latency even though it is being satisfied by the L1. Another example is loads who miss in the TLB. These cases are characterized by execution unit stalls; while some non-completed demand load lives in the machine without having that demand load missing the L1 cache.",
+        "MetricExpr": "100 * ( max( ( CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY.STALLS_L1D_MISS ) / ( CPU_CLK_UNHALTED.THREAD ) , 0 ) )",
+        "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TmaL3;m_tma_memory_bound_percent",
+        "MetricName": "tma_l1_bound_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric estimates how often the CPU was stalled due to L2 cache accesses by loads.  Avoiding cache misses (i.e. L1 misses/L2 hits) can improve the latency and increase performance.",
+        "MetricExpr": "100 * ( ( CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS ) / ( CPU_CLK_UNHALTED.THREAD ) )",
+        "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TmaL3;m_tma_memory_bound_percent",
+        "MetricName": "tma_l2_bound_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric estimates how often the CPU was stalled due to loads accesses to L3 cache or contended with a sibling Core.  Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency and increase performance.",
+        "MetricExpr": "100 * ( ( MEM_LOAD_UOPS_RETIRED.L3_HIT / ( MEM_LOAD_UOPS_RETIRED.L3_HIT + ( 7 ) * MEM_LOAD_UOPS_RETIRED.L3_MISS ) ) * CYCLE_ACTIVITY.STALLS_L2_MISS / ( CPU_CLK_UNHALTED.THREAD ) )",
+        "MetricGroup": "CacheMisses;MemoryBound;TmaL3mem;TmaL3;m_tma_memory_bound_percent",
+        "MetricName": "tma_l3_bound_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric estimates how often the CPU was stalled on accesses to external memory (DRAM) by loads. Better caching can improve the latency and increase performance.",
+        "MetricExpr": "100 * ( min( ( ( 1 - ( MEM_LOAD_UOPS_RETIRED.L3_HIT / ( MEM_LOAD_UOPS_RETIRED.L3_HIT + ( 7 ) * MEM_LOAD_UOPS_RETIRED.L3_MISS ) ) ) * CYCLE_ACTIVITY.STALLS_L2_MISS / ( CPU_CLK_UNHALTED.THREAD ) ) , ( 1 ) ) )",
+        "MetricGroup": "MemoryBound;TmaL3mem;TmaL3;m_tma_memory_bound_percent",
+        "MetricName": "tma_dram_bound_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric estimates how often CPU was stalled  due to RFO store memory accesses; RFO store issue a read-for-ownership request before the write. Even though store accesses do not typically stall out-of-order CPUs; there are few cases where stores can lead to actual stalls. This metric will be flagged should RFO stores be a bottleneck.",
+        "MetricExpr": "100 * ( RESOURCE_STALLS.SB / ( CPU_CLK_UNHALTED.THREAD ) )",
+        "MetricGroup": "MemoryBound;TmaL3mem;TmaL3;m_tma_memory_bound_percent",
+        "MetricName": "tma_store_bound_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck.  Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).",
+        "MetricExpr": "100 * ( ( 1 - ( ( IDQ_UOPS_NOT_DELIVERED.CORE / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) + ( ( UOPS_ISSUED.ANY - ( UOPS_RETIRED.RETIRE_SLOTS ) + ( 4 ) * ( ( INT_MISC.RECOVERY_CYCLES_ANY / 2 ) if #SMT_on else INT_MISC.RECOVERY_CYCLES ) ) / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) + ( ( UOPS_RETIRED.RETIRE_SLOTS ) / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) ) ) - ( ( ( CYCLE_ACTIVITY.STALLS_MEM_ANY + RESOURCE_STALLS.SB ) / ( ( CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - ( UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if ( ( INST_RETIRED.ANY / ( CPU_CLK_UNHALTED.THREAD ) ) > 1.8 ) else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC ) - ( RS_EVENTS.EMPTY_CYCLES if ( ( ( 4 ) * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) > 0.1 ) else 0 ) + RESOURCE_STALLS.SB ) ) ) * ( 1 - ( ( IDQ_UOPS_NOT_DELIVERED.CORE / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) + ( ( UOPS_ISSUED.ANY - ( UOPS_RETIRED.RETIRE_SLOTS ) + ( 4 ) * ( ( INT_MISC.RECOVERY_CYCLES_ANY / 2 ) if #SMT_on else INT_MISC.RECOVERY_CYCLES ) ) / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) + ( ( UOPS_RETIRED.RETIRE_SLOTS ) / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) ) ) ) )",
+        "MetricGroup": "Backend;TmaL2;Compute;m_tma_backend_bound_percent",
+        "MetricName": "tma_core_bound_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric represents fraction of cycles where the Divider unit was active. Divide and square root instructions are performed by the Divider unit and can take considerably longer latency than integer or Floating Point addition; subtraction; or multiplication.",
+        "MetricExpr": "100 * ( ARITH.FPU_DIV_ACTIVE / ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) )",
+        "MetricGroup": "TmaL3;m_tma_core_bound_percent",
+        "MetricName": "tma_divider_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric estimates fraction of cycles the CPU performance was potentially limited due to Core computation issues (non divider-related).  Two distinct categories can be attributed into this metric: (1) heavy data-dependency among contiguous instructions would manifest in this metric - such cases are often referred to as low Instruction Level Parallelism (ILP). (2) Contention on some hardware execution unit other than Divider. For example; when there are too many multiply operations.",
+        "MetricExpr": "100 * ( ( ( ( CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - ( UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if ( ( INST_RETIRED.ANY / ( CPU_CLK_UNHALTED.THREAD ) ) > 1.8 ) else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC ) - ( RS_EVENTS.EMPTY_CYCLES if ( ( ( 4 ) * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) > 0.1 ) else 0 ) + RESOURCE_STALLS.SB ) ) - RESOURCE_STALLS.SB - CYCLE_ACTIVITY.STALLS_MEM_ANY ) / ( CPU_CLK_UNHALTED.THREAD ) )",
+        "MetricGroup": "PortsUtil;TmaL3;m_tma_core_bound_percent",
+        "MetricName": "tma_ports_utilization_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. ",
+        "MetricExpr": "100 * ( ( UOPS_RETIRED.RETIRE_SLOTS ) / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) )",
+        "MetricGroup": "TmaL1",
+        "MetricName": "tma_retiring_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved.",
+        "MetricExpr": "100 * ( ( ( UOPS_RETIRED.RETIRE_SLOTS ) / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) - ( ( ( ( UOPS_RETIRED.RETIRE_SLOTS ) / UOPS_ISSUED.ANY ) * IDQ.MS_UOPS / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) ) )",
+        "MetricGroup": "Retire;TmaL2;m_tma_retiring_percent",
+        "MetricName": "tma_light_operations_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric represents overall arithmetic floating-point (FP) operations fraction the CPU has executed (retired). Note this metric's value may exceed its parent due to use of \"Uops\" CountDomain and FMA double-counting.",
+        "MetricExpr": "100 * ( ( INST_RETIRED.X87 * ( ( UOPS_RETIRED.RETIRE_SLOTS ) / INST_RETIRED.ANY ) / ( UOPS_RETIRED.RETIRE_SLOTS ) ) + ( ( FP_ARITH_INST_RETIRED.SCALAR_SINGLE + FP_ARITH_INST_RETIRED.SCALAR_DOUBLE ) / ( UOPS_RETIRED.RETIRE_SLOTS ) ) + ( min( ( ( FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE ) / ( UOPS_RETIRED.RETIRE_SLOTS ) ) , ( 1 ) ) ) )",
+        "MetricGroup": "HPC;TmaL3;m_tma_light_operations_percent",
+        "MetricName": "tma_fp_arith_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or microcoded sequences. This highly-correlates with the uop length of these instructions/sequences.",
+        "MetricExpr": "100 * ( ( ( ( UOPS_RETIRED.RETIRE_SLOTS ) / UOPS_ISSUED.ANY ) * IDQ.MS_UOPS / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) ) )",
+        "MetricGroup": "Retire;TmaL2;m_tma_retiring_percent",
+        "MetricName": "tma_heavy_operations_percent",
+        "ScaleUnit": "1%"
+    },
+    {
+        "BriefDescription": "This metric represents fraction of slots the CPU was retiring uops fetched by the Microcode Sequencer (MS) unit.  The MS is used for CISC instructions not supported by the default decoders (like repeat move strings; or CPUID); or by microcode assists used to address some operation modes (like in Floating Point assists). These cases can often be avoided.",
+        "MetricExpr": "100 * ( ( ( UOPS_RETIRED.RETIRE_SLOTS ) / UOPS_ISSUED.ANY ) * IDQ.MS_UOPS / ( ( 4 ) * ( ( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else ( CPU_CLK_UNHALTED.THREAD ) ) ) )",
+        "MetricGroup": "MicroSeq;TmaL3;m_tma_heavy_operations_percent",
+        "MetricName": "tma_microcode_sequencer_percent",
+        "ScaleUnit": "1%"
    }
 ]
--- a/tools/perf/pmu-events/arch/x86/broadwellx/cache.json
+++ b/tools/perf/pmu-events/arch/x86/broadwellx/cache.json
@ -814,9 +814,8 @@
        "EventCode": "0xB7, 0xBB",
        "EventName": "OFFCORE_RESPONSE.ALL_CODE_RD.LLC_HIT.HIT_OTHER_CORE_NO_FWD",
        "MSRIndex": "0x1a6,0x1a7",
-        "MSRValue": "0x04003C0244",
+        "MSRValue": "0x4003C0244",
        "Offcore": "1",
-        "PublicDescription": "Counts all demand & prefetch code reads hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -829,7 +828,6 @@
        "MSRIndex": "0x1a6,0x1a7",
        "MSRValue": "0x10003C0091",
        "Offcore": "1",
-        "PublicDescription": "Counts all demand & prefetch data reads hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -840,9 +838,8 @@
        "EventCode": "0xB7, 0xBB",
        "EventName": "OFFCORE_RESPONSE.ALL_DATA_RD.LLC_HIT.HIT_OTHER_CORE_NO_FWD",
        "MSRIndex": "0x1a6,0x1a7",
-        "MSRValue": "0x04003C0091",
+        "MSRValue": "0x4003C0091",
        "Offcore": "1",
-        "PublicDescription": "Counts all demand & prefetch data reads hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -855,7 +852,6 @@
        "MSRIndex": "0x1a6,0x1a7",
        "MSRValue": "0x10003C07F7",
        "Offcore": "1",
-        "PublicDescription": "Counts all data/code/rfo reads (demand & prefetch) hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -866,9 +862,8 @@
        "EventCode": "0xB7, 0xBB",
        "EventName": "OFFCORE_RESPONSE.ALL_READS.LLC_HIT.HIT_OTHER_CORE_NO_FWD",
        "MSRIndex": "0x1a6,0x1a7",
-        "MSRValue": "0x04003C07F7",
+        "MSRValue": "0x4003C07F7",
        "Offcore": "1",
-        "PublicDescription": "Counts all data/code/rfo reads (demand & prefetch) hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -881,7 +876,6 @@
        "MSRIndex": "0x1a6,0x1a7",
        "MSRValue": "0x3F803C8FFF",
        "Offcore": "1",
-        "PublicDescription": "Counts all requests hit in the L3",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -894,7 +888,6 @@
        "MSRIndex": "0x1a6,0x1a7",
        "MSRValue": "0x10003C0122",
        "Offcore": "1",
-        "PublicDescription": "Counts all demand & prefetch RFOs hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -905,9 +898,8 @@
        "EventCode": "0xB7, 0xBB",
        "EventName": "OFFCORE_RESPONSE.ALL_RFO.LLC_HIT.HIT_OTHER_CORE_NO_FWD",
        "MSRIndex": "0x1a6,0x1a7",
-        "MSRValue": "0x04003C0122",
+        "MSRValue": "0x4003C0122",
        "Offcore": "1",
-        "PublicDescription": "Counts all demand & prefetch RFOs hit in the L3 and the snoops to sibling cores hit in either E/S state and the line is not forwarded",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -920,7 +912,6 @@
        "MSRIndex": "0x1a6,0x1a7",
        "MSRValue": "0x3F803C0002",
        "Offcore": "1",
-        "PublicDescription": "Counts all demand data writes (RFOs) hit in the L3",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -933,7 +924,6 @@
        "MSRIndex": "0x1a6,0x1a7",
        "MSRValue": "0x10003C0002",
        "Offcore": "1",
-        "PublicDescription": "Counts all demand data writes (RFOs) hit in the L3 and the snoop to one of the sibling cores hits the line in M state and the line is forwarded",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -946,7 +936,6 @@
        "MSRIndex": "0x1a6,0x1a7",
        "MSRValue": "0x3F803C0200",
        "Offcore": "1",
-        "PublicDescription": "Counts prefetch (that bring data to LLC only) code reads hit in the L3",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -959,7 +948,6 @@
        "MSRIndex": "0x1a6,0x1a7",
        "MSRValue": "0x3F803C0100",
        "Offcore": "1",
-        "PublicDescription": "Counts all prefetch (that bring data to LLC only) RFOs hit in the L3",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -973,4 +961,4 @@
        "SampleAfterValue": "100003",
        "UMask": "0x10"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/broadwellx/floating-point.json
+++ b/tools/perf/pmu-events/arch/x86/broadwellx/floating-point.json
@ -5,6 +5,7 @@
        "CounterHTOff": "0,1,2,3",
        "EventCode": "0xc7",
        "EventName": "FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE",
+        "PublicDescription": "Number of SSE/AVX computational 128-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below.  Each count represents 2 computation operations, one for each element.  Applies to SSE* and AVX* packed double precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX SQRT DPP FM(N)ADD/SUB.  DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.",
        "SampleAfterValue": "2000003",
        "UMask": "0x4"
    },
@ -14,6 +15,7 @@
        "CounterHTOff": "0,1,2,3",
        "EventCode": "0xc7",
        "EventName": "FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE",
+        "PublicDescription": "Number of SSE/AVX computational 128-bit packed single precision floating-point instructions retired; some instructions will count twice as noted below.  Each count represents 4 computation operations, one for each element.  Applies to SSE* and AVX* packed single precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX SQRT RSQRT RCP DPP FM(N)ADD/SUB.  DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.",
        "SampleAfterValue": "2000003",
        "UMask": "0x8"
    },
@ -23,6 +25,7 @@
        "CounterHTOff": "0,1,2,3",
        "EventCode": "0xc7",
        "EventName": "FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE",
+        "PublicDescription": "Number of SSE/AVX computational 256-bit packed double precision floating-point instructions retired; some instructions will count twice as noted below.  Each count represents 4 computation operations, one for each element.  Applies to SSE* and AVX* packed double precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX SQRT FM(N)ADD/SUB.  FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.",
        "SampleAfterValue": "2000003",
        "UMask": "0x10"
    },
@ -32,6 +35,7 @@
        "CounterHTOff": "0,1,2,3",
        "EventCode": "0xc7",
        "EventName": "FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE",
+        "PublicDescription": "Number of SSE/AVX computational 256-bit packed single precision floating-point instructions retired; some instructions will count twice as noted below.  Each count represents 8 computation operations, one for each element.  Applies to SSE* and AVX* packed single precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX SQRT RSQRT RCP DPP FM(N)ADD/SUB.  DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.",
        "SampleAfterValue": "2000003",
        "UMask": "0x20"
    },
@ -59,6 +63,7 @@
        "CounterHTOff": "0,1,2,3",
        "EventCode": "0xc7",
        "EventName": "FP_ARITH_INST_RETIRED.SCALAR",
+        "PublicDescription": "Number of SSE/AVX computational scalar single precision and double precision floating-point instructions retired; some instructions will count twice as noted below.  Each count represents 1 computational operation. Applies to SSE* and AVX* scalar single precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB.  FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.",
        "SampleAfterValue": "2000003",
        "UMask": "0x3"
    },
@ -68,6 +73,7 @@
        "CounterHTOff": "0,1,2,3",
        "EventCode": "0xc7",
        "EventName": "FP_ARITH_INST_RETIRED.SCALAR_DOUBLE",
+        "PublicDescription": "Number of SSE/AVX computational scalar double precision floating-point instructions retired; some instructions will count twice as noted below.  Each count represents 1 computational operation. Applies to SSE* and AVX* scalar double precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT FM(N)ADD/SUB.  FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.",
        "SampleAfterValue": "2000003",
        "UMask": "0x1"
    },
@ -77,6 +83,7 @@
        "CounterHTOff": "0,1,2,3",
        "EventCode": "0xc7",
        "EventName": "FP_ARITH_INST_RETIRED.SCALAR_SINGLE",
+        "PublicDescription": "Number of SSE/AVX computational scalar single precision floating-point instructions retired; some instructions will count twice as noted below.  Each count represents 1 computational operation. Applies to SSE* and AVX* scalar single precision floating-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB.  FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR register need to be set when using these events.",
        "SampleAfterValue": "2000003",
        "UMask": "0x2"
    },
@ -190,4 +197,4 @@
        "SampleAfterValue": "2000003",
        "UMask": "0x3"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/broadwellx/frontend.json
+++ b/tools/perf/pmu-events/arch/x86/broadwellx/frontend.json
@ -292,4 +292,4 @@
        "SampleAfterValue": "2000003",
        "UMask": "0x1"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/broadwellx/memory.json
+++ b/tools/perf/pmu-events/arch/x86/broadwellx/memory.json
@ -247,7 +247,6 @@
        "MSRIndex": "0x1a6,0x1a7",
        "MSRValue": "0x3FBFC00244",
        "Offcore": "1",
-        "PublicDescription": "Counts all demand & prefetch code reads miss in the L3",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -258,9 +257,8 @@
        "EventCode": "0xB7, 0xBB",
        "EventName": "OFFCORE_RESPONSE.ALL_CODE_RD.LLC_MISS.LOCAL_DRAM",
        "MSRIndex": "0x1a6,0x1a7",
-        "MSRValue": "0x0604000244",
+        "MSRValue": "0x604000244",
        "Offcore": "1",
-        "PublicDescription": "Counts all demand & prefetch code reads miss the L3 and the data is returned from local dram",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -273,7 +271,6 @@
        "MSRIndex": "0x1a6,0x1a7",
        "MSRValue": "0x3FBFC00091",
        "Offcore": "1",
-        "PublicDescription": "Counts all demand & prefetch data reads miss in the L3",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -284,9 +281,8 @@
        "EventCode": "0xB7, 0xBB",
        "EventName": "OFFCORE_RESPONSE.ALL_DATA_RD.LLC_MISS.LOCAL_DRAM",
        "MSRIndex": "0x1a6,0x1a7",
-        "MSRValue": "0x0604000091",
+        "MSRValue": "0x604000091",
        "Offcore": "1",
-        "PublicDescription": "Counts all demand & prefetch data reads miss the L3 and the data is returned from local dram",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -297,9 +293,8 @@
        "EventCode": "0xB7, 0xBB",
        "EventName": "OFFCORE_RESPONSE.ALL_DATA_RD.LLC_MISS.REMOTE_DRAM",
        "MSRIndex": "0x1a6,0x1a7",
-        "MSRValue": "0x063BC00091",
+        "MSRValue": "0x63BC00091",
        "Offcore": "1",
-        "PublicDescription": "Counts all demand & prefetch data reads miss the L3 and the data is returned from remote dram",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -312,7 +307,6 @@
        "MSRIndex": "0x1a6,0x1a7",
        "MSRValue": "0x103FC00091",
        "Offcore": "1",
-        "PublicDescription": "Counts all demand & prefetch data reads miss the L3 and the modified data is transferred from remote cache",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -323,9 +317,8 @@
        "EventCode": "0xB7, 0xBB",
        "EventName": "OFFCORE_RESPONSE.ALL_DATA_RD.LLC_MISS.REMOTE_HIT_FORWARD",
        "MSRIndex": "0x1a6,0x1a7",
-        "MSRValue": "0x087FC00091",
+        "MSRValue": "0x87FC00091",
        "Offcore": "1",
-        "PublicDescription": "Counts all demand & prefetch data reads miss the L3 and clean or shared data is transferred from remote cache",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -338,20 +331,18 @@
        "MSRIndex": "0x1a6,0x1a7",
        "MSRValue": "0x3FBFC007F7",
        "Offcore": "1",
-        "PublicDescription": "Counts all data/code/rfo reads (demand & prefetch) miss in the L3",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
    {
-        "BriefDescription": "Counts all data/code/rfo reads (demand & prefetch)miss the L3 and the data is returned from local dram",
+        "BriefDescription": "Counts all data/code/rfo reads (demand & prefetch) miss the L3 and the data is returned from local dram",
        "Counter": "0,1,2,3",
        "CounterHTOff": "0,1,2,3",
        "EventCode": "0xB7, 0xBB",
        "EventName": "OFFCORE_RESPONSE.ALL_READS.LLC_MISS.LOCAL_DRAM",
        "MSRIndex": "0x1a6,0x1a7",
-        "MSRValue": "0x06040007F7",
+        "MSRValue": "0x6040007F7",
        "Offcore": "1",
-        "PublicDescription": "Counts all data/code/rfo reads (demand & prefetch)miss the L3 and the data is returned from local dram",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -362,9 +353,8 @@
        "EventCode": "0xB7, 0xBB",
        "EventName": "OFFCORE_RESPONSE.ALL_READS.LLC_MISS.REMOTE_DRAM",
        "MSRIndex": "0x1a6,0x1a7",
-        "MSRValue": "0x063BC007F7",
+        "MSRValue": "0x63BC007F7",
        "Offcore": "1",
-        "PublicDescription": "Counts all data/code/rfo reads (demand & prefetch) miss the L3 and the data is returned from remote dram",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -377,7 +367,6 @@
        "MSRIndex": "0x1a6,0x1a7",
        "MSRValue": "0x103FC007F7",
        "Offcore": "1",
-        "PublicDescription": "Counts all data/code/rfo reads (demand & prefetch) miss the L3 and the modified data is transferred from remote cache",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -388,9 +377,8 @@
        "EventCode": "0xB7, 0xBB",
        "EventName": "OFFCORE_RESPONSE.ALL_READS.LLC_MISS.REMOTE_HIT_FORWARD",
        "MSRIndex": "0x1a6,0x1a7",
-        "MSRValue": "0x087FC007F7",
+        "MSRValue": "0x87FC007F7",
        "Offcore": "1",
-        "PublicDescription": "Counts all data/code/rfo reads (demand & prefetch) miss the L3 and clean or shared data is transferred from remote cache",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -403,7 +391,6 @@
        "MSRIndex": "0x1a6,0x1a7",
        "MSRValue": "0x3FBFC08FFF",
        "Offcore": "1",
-        "PublicDescription": "Counts all requests miss in the L3",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -416,7 +403,6 @@
        "MSRIndex": "0x1a6,0x1a7",
        "MSRValue": "0x3FBFC00122",
        "Offcore": "1",
-        "PublicDescription": "Counts all demand & prefetch RFOs miss in the L3",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -427,9 +413,8 @@
        "EventCode": "0xB7, 0xBB",
        "EventName": "OFFCORE_RESPONSE.ALL_RFO.LLC_MISS.LOCAL_DRAM",
        "MSRIndex": "0x1a6,0x1a7",
-        "MSRValue": "0x0604000122",
+        "MSRValue": "0x604000122",
        "Offcore": "1",
-        "PublicDescription": "Counts all demand & prefetch RFOs miss the L3 and the data is returned from local dram",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -442,7 +427,6 @@
        "MSRIndex": "0x1a6,0x1a7",
        "MSRValue": "0x3FBFC00002",
        "Offcore": "1",
-        "PublicDescription": "Counts all demand data writes (RFOs) miss in the L3",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -455,7 +439,6 @@
        "MSRIndex": "0x1a6,0x1a7",
        "MSRValue": "0x103FC00002",
        "Offcore": "1",
-        "PublicDescription": "Counts all demand data writes (RFOs) miss the L3 and the modified data is transferred from remote cache",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -468,7 +451,6 @@
        "MSRIndex": "0x1a6,0x1a7",
        "MSRValue": "0x3FBFC00200",
        "Offcore": "1",
-        "PublicDescription": "Counts prefetch (that bring data to LLC only) code reads miss in the L3",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -481,7 +463,6 @@
        "MSRIndex": "0x1a6,0x1a7",
        "MSRValue": "0x3FBFC00100",
        "Offcore": "1",
-        "PublicDescription": "Counts all prefetch (that bring data to LLC only) RFOs miss in the L3",
        "SampleAfterValue": "100003",
        "UMask": "0x1"
    },
@ -684,4 +665,4 @@
        "SampleAfterValue": "2000003",
        "UMask": "0x40"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/broadwellx/other.json
+++ b/tools/perf/pmu-events/arch/x86/broadwellx/other.json
@ -41,4 +41,4 @@
        "SampleAfterValue": "2000003",
        "UMask": "0x1"
    }
-]
+]
--- a/tools/perf/pmu-events/arch/x86/broadwellx/pipeline.json
+++ b/tools/perf/pmu-events/arch/x86/broadwellx/pipeline.json
@ -1369,7 +1369,7 @@
        "BriefDescription": "Cycles with less than 10 actually retired uops.",
        "Counter": "0,1,2,3",
        "CounterHTOff": "0,1,2,3",
-        "CounterMask": "10",
+        "CounterMask": "16",
        "EventCode": "0xC2",
        "EventName": "UOPS_RETIRED.TOTAL_CYCLES",
        "Invert": "1",
@ -1377,4 +1377,4 @@
        "SampleAfterValue": "2000003",
        "UMask": "0x1"
    }
-]
+]
--- a/Show More
+++ b/Show More