Commit Graph

1221843 Commits

Author SHA1 Message Date
Yosry Ahmed 1a9570e74d mm: workingset: move the stats flush into workingset_test_recent()
Upstream: commit b006847222623ac3cda8589d15379eac86a2bcb7
Conflicts: none
Backport-reason: mm: memcg: subtree stats flushing and thresholds

The workingset code flushes the stats in workingset_refault() to get
accurate stats of the eviction memcg.  In preparation for more scoped
flushed and passing the eviction memcg to the flush call, move the call to
workingset_test_recent() where we have a pointer to the eviction memcg.

The flush call is sleepable, and cannot be made in an rcu read section.
Hence, minimize the rcu read section by also moving it into
workingset_test_recent().  Furthermore, instead of holding the rcu read
lock throughout workingset_test_recent(), only hold it briefly to get a
ref on the eviction memcg.  This allows us to make the flush call after we
get the eviction memcg.

As for workingset_refault(), nothing else there appears to be protected by
rcu.  The memcg of the faulted folio (which is not necessarily the same as
the eviction memcg) is protected by the folio lock, which is held from all
callsites.  Add a VM_BUG_ON() to make sure this doesn't change from under
us.

No functional change intended.

Link: https://lkml.kernel.org/r/20231129032154.3710765-5-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ivan Babrou <ivan@cloudflare.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Wei Xu <weixugc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 16:58:48 +08:00
Yosry Ahmed 0f76dc379d mm: memcg: make stats flushing threshold per-memcg
Upstream: commit 8d59d2214c2362e7a9d185d80b613e632581af7b
Conflicts: none
Backport-reason: mm: memcg: subtree stats flushing and thresholds

A global counter for the magnitude of memcg stats update is maintained on
the memcg side to avoid invoking rstat flushes when the pending updates
are not significant.  This avoids unnecessary flushes, which are not very
cheap even if there isn't a lot of stats to flush.  It also avoids
unnecessary lock contention on the underlying global rstat lock.

Make this threshold per-memcg.  The scheme is followed where percpu (now
also per-memcg) counters are incremented in the update path, and only
propagated to per-memcg atomics when they exceed a certain threshold.

This provides two benefits: (a) On large machines with a lot of memcgs,
the global threshold can be reached relatively fast, so guarding the
underlying lock becomes less effective.  Making the threshold per-memcg
avoids this.

(b) Having a global threshold makes it hard to do subtree flushes, as we
cannot reset the global counter except for a full flush.  Per-memcg
counters removes this as a blocker from doing subtree flushes, which helps
avoid unnecessary work when the stats of a small subtree are needed.

Nothing is free, of course.  This comes at a cost: (a) A new per-cpu
counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 bytes.  The extra
memory usage is insigificant.

(b) More work on the update side, although in the common case it will only
be percpu counter updates.  The amount of work scales with the number of
ancestors (i.e.  tree depth).  This is not a new concept, adding a cgroup
to the rstat tree involves a parent loop, so is charging.  Testing results
below show no significant regressions.

(c) The error margin in the stats for the system as a whole increases from
NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * NR_MEMCGS.
This is probably fine because we have a similar per-memcg error in charges
coming from percpu stocks, and we have a periodic flusher that makes sure
we always flush all the stats every 2s anyway.

This patch was tested to make sure no significant regressions are
introduced on the update path as follows.  The following benchmarks were
ran in a cgroup that is 2 levels deep (/sys/fs/cgroup/a/b/):

(1) Running 22 instances of netperf on a 44 cpu machine with
hyperthreading disabled. All instances are run in a level 2 cgroup, as
well as netserver:
  # netserver -6
  # netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Averaging 20 runs, the numbers are as follows:
Base: 40198.0 mbps
Patched: 38629.7 mbps (-3.9%)

The regression is minimal, especially for 22 instances in the same
cgroup sharing all ancestors (so updating the same atomics).

(2) will-it-scale page_fault tests. These tests (specifically
per_process_ops in page_fault3 test) detected a 25.9% regression before
for a change in the stats update path [1]. These are the
numbers from 10 runs (+ is good) on a machine with 256 cpus:

             LABEL            |     MEAN    |   MEDIAN    |   STDDEV   |
------------------------------+-------------+-------------+-------------
  page_fault1_per_process_ops |             |             |            |
  (A) base                    | 270249.164  | 265437.000  | 13451.836  |
  (B) patched                 | 261368.709  | 255725.000  | 13394.767  |
                              | -3.29%      | -3.66%      |            |
  page_fault1_per_thread_ops  |             |             |            |
  (A) base                    | 242111.345  | 239737.000  | 10026.031  |
  (B) patched                 | 237057.109  | 235305.000  | 9769.687   |
                              | -2.09%      | -1.85%      |            |
  page_fault1_scalability     |             |             |
  (A) base                    | 0.034387    | 0.035168    | 0.0018283  |
  (B) patched                 | 0.033988    | 0.034573    | 0.0018056  |
                              | -1.16%      | -1.69%      |            |
  page_fault2_per_process_ops |             |             |
  (A) base                    | 203561.836  | 203301.000  | 2550.764   |
  (B) patched                 | 197195.945  | 197746.000  | 2264.263   |
                              | -3.13%      | -2.73%      |            |
  page_fault2_per_thread_ops  |             |             |
  (A) base                    | 171046.473  | 170776.000  | 1509.679   |
  (B) patched                 | 166626.327  | 166406.000  | 768.753    |
                              | -2.58%      | -2.56%      |            |
  page_fault2_scalability     |             |             |
  (A) base                    | 0.054026    | 0.053821    | 0.00062121 |
  (B) patched                 | 0.053329    | 0.05306     | 0.00048394 |
                              | -1.29%      | -1.41%      |            |
  page_fault3_per_process_ops |             |             |
  (A) base                    | 1295807.782 | 1297550.000 | 5907.585   |
  (B) patched                 | 1275579.873 | 1273359.000 | 8759.160   |
                              | -1.56%      | -1.86%      |            |
  page_fault3_per_thread_ops  |             |             |
  (A) base                    | 391234.164  | 390860.000  | 1760.720   |
  (B) patched                 | 377231.273  | 376369.000  | 1874.971   |
                              | -3.58%      | -3.71%      |            |
  page_fault3_scalability     |             |             |
  (A) base                    | 0.60369     | 0.60072     | 0.0083029  |
  (B) patched                 | 0.61733     | 0.61544     | 0.009855   |
                              | +2.26%      | +2.45%      |            |

All regressions seem to be minimal, and within the normal variance for the
benchmark.  The fix for [1] assumes that 3% is noise -- and there were no
further practical complaints), so hopefully this means that such
variations in these microbenchmarks do not reflect on practical workloads.

(3) I also ran stress-ng in a nested cgroup and did not observe any
obvious regressions.

[1]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/

Link: https://lkml.kernel.org/r/20231129032154.3710765-4-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ivan Babrou <ivan@cloudflare.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Wei Xu <weixugc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 16:58:48 +08:00
Yosry Ahmed e3b1808e92 mm: memcg: move vmstats structs definition above flushing code
Upstream: commit e0bf1dc859fdd08ef738824710770a30a8069433
Conflicts: resolved
Backport-reason: mm: memcg: subtree stats flushing and thresholds

The following patch will make use of those structs in the flushing code,
so move their definitions (and a few other dependencies) a little bit up
to reduce the diff noise in the following patch.

No functional change intended.

Link: https://lkml.kernel.org/r/20231129032154.3710765-3-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ivan Babrou <ivan@cloudflare.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Wei Xu <weixugc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 16:58:48 +08:00
Yosry Ahmed 835b80a758 mm: memcg: change flush_next_time to flush_last_time
Upstream: commit 508bed884767a8eb394640bae9edcdf082816c43
Conflicts: none
Backport-reason: mm: memcg: subtree stats flushing and thresholds

Patch series "mm: memcg: subtree stats flushing and thresholds", v4.

This series attempts to address shortages in today's approach for memcg
stats flushing, namely occasionally stale or expensive stat reads.  The
series does so by changing the threshold that we use to decide whether to
trigger a flush to be per memcg instead of global (patch 3), and then
changing flushing to be per memcg (i.e.  subtree flushes) instead of
global (patch 5).

This patch (of 5):

flush_next_time is an inaccurate name.  It's not the next time that
periodic flushing will happen, it's rather the next time that ratelimited
flushing can happen if the periodic flusher is late.

Simplify its semantics by just storing the timestamp of the last flush
instead, flush_last_time.  Move the 2*FLUSH_TIME addition to
mem_cgroup_flush_stats_ratelimited(), and add a comment explaining it.
This way, all the ratelimiting semantics live in one place.

No functional change intended.

Link: https://lkml.kernel.org/r/20231129032154.3710765-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20231129032154.3710765-2-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Chris Li <chrisl@kernel.org> (Google)
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ivan Babrou <ivan@cloudflare.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Wei Xu <weixugc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 16:58:47 +08:00
Kairui Song 9faef83a53 kabi: move reservation in mem_cgroup to tail
Upstream: no

It was mistakenly put in the middle, fix it.

Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 16:58:47 +08:00
Kairui Song 2488b07a78 dist: config: update config
Upstream: no

Reduce NODES_SHIFT to 8, leave move space for page flags.

Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 16:58:47 +08:00
Kairui Song 690dae3cfe dist: disable non kernel pkg on non default config
Upstream: no

Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 16:58:46 +08:00
Kairui Song 432d074030 dist: add eks base config
Copied from 0017 with EMM enabled and tidyup with defconfig.

Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 15:36:21 +08:00
Kairui Song ec94d992b4 eks: net/toa: add ali_cip support
Upstream: no

As required for EKS.

Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 15:36:21 +08:00
Kairui Song 1583c6df5c eks: kvm/x86: introduce CONFIG_KVM_FORCE_PVCLOCK
Upstream: no

Allow force using PVCLOCK for better performance, with some
time accuracy loss.

Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 15:36:20 +08:00
linuszeng 390cfbd393 dist: kernel.template.spec: add lz4 build request
Upstream: no

Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
2024-04-03 15:36:20 +08:00
frankjpliu 62cfac3173 Merge branch 'herberthbli/scx' into 'master' (merge request !34)
herberthbli/scx
These commits are from upstream,they are the preparation of the sched_ext.
2024-03-29 03:02:59 +00:00
Jianping Liu 66d114febb checkpatch: add Signed-off-by check if commit cherry-pick from upstream
When a commit cherry-pick from upstream, we should add Signed-off-by in
commit message. With the Signed-off-by info, we could easy to know who
picked the commit, and could ask him the reason: why we need pick the commit.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-03-27 21:59:03 +08:00
Alex Shi f55a71cb98 checkpatch: check the backported commit for CID reference
Accepted few kind of references:
	commit xxx upstream or
	Upstream commit xxx or
	[ Upstream commit xxx or
	Upstream commit:

Signed-off-by: Alex Shi <alexsshi@tencent.com>
Acked-by: Alex Shi <alexsshi@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-03-27 21:41:27 +08:00
David Vernet 991fb56f6c selftests/bpf: Test pinning bpf timer to a core
Upstream commit 0d7ae06860753bb30b3731302b994da071120d00

Now that we support pinning a BPF timer to the current core, we should
test it with some selftests. This patch adds two new testcases to the
timer suite, which verifies that a BPF timer both with and without
BPF_F_TIMER_ABS, can be pinned to the calling core with BPF_F_TIMER_CPU_PIN.

Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hongbo Li <herberthbli@tencent.com>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20231004162339.200702-3-void@manifault.com
2024-03-27 18:09:18 +08:00
David Vernet c12dbc1bdd bpf: Add ability to pin bpf timer to calling CPU
Upstream commit d6247ecb6c1e17d7a33317090627f5bfe563cbb2

BPF supports creating high resolution timers using bpf_timer_* helper
functions. Currently, only the BPF_F_TIMER_ABS flag is supported, which
specifies that the timeout should be interpreted as absolute time. It
would also be useful to be able to pin that timer to a core. For
example, if you wanted to make a subset of cores run without timer
interrupts, and only have the timer be invoked on a single core.

This patch adds support for this with a new BPF_F_TIMER_CPU_PIN flag.
When specified, the HRTIMER_MODE_PINNED flag is passed to
hrtimer_start(). A subsequent patch will update selftests to validate.

Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hongbo Li <herberthbli@tencent.com>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20231004162339.200702-2-void@manifault.com
2024-03-27 18:09:14 +08:00
Ingo Molnar ed516134e3 sched/fair: Rename check_preempt_curr() to wakeup_preempt()
Upstream commit e23edc86b09df655bf8963bbcb16647adc787395

The name is a bit opaque - make it clear that this is about wakeup
preemption.

Also rename the ->check_preempt_curr() methods similarly.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Hongbo Li <herberthbli@tencent.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-03-27 18:09:11 +08:00
Ingo Molnar 0aaab31170 sched/fair: Rename check_preempt_wakeup() to check_preempt_wakeup_fair()
Upstream commit 82845683ca6a15fe8c7912c6264bb0e84ec6f5fb

Other scheduling classes already postfix their similar methods
with the class name.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Hongbo Li <herberthbli@tencent.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-03-27 18:09:07 +08:00
frankjpliu 175e1c3850 Merge branch 'leonylgao/master' into 'master' (merge request !25)
kabi: provide kabi check/update/create commands for local users
2024-03-26 09:38:26 +00:00
Yongliang Gao 618d09a6f8 script: update check-kabi script
Upstream: no

check-kabi script copy from tkernel4, run failed in tkernel5.
Traceback (most recent call last):
File "./scripts/check-kabi", line 143, in <module>
load_symvers(symvers,symvers_file)
File "./scripts/check-kabi", line 44, in load_symvers
checksum,symbol,directory,type = string.split(in_line)
ValueError: too many values to unpack

update script copy from dist/sources/check-kabi

Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Reviewed-by: Jianping Liu <frankjpliu@tencent.com>
2024-03-26 14:15:24 +08:00
Yongliang Gao 0c01b9d8fb kabi: provide kabi check/update/create commands for local users
Upstream: no

Provides kabi check/update/create commands for local users:
1. Check whether TencentOS Kennel KABI is compatible
2. Update TencentOS Kennel KABI file
3. Create TencentOS Kennel KABI file

Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Reviewed-by: Jianping Liu <frankjpliu@tencent.com>
2024-03-26 14:10:56 +08:00
Yongliang Gao 0512e1e0ee config: add kernel/configs/tkci.config
For easy get the config used in tkci, add kernel/configs/tkci.config,
users can run "make tencentconfig tkci.config" to generate the .config.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Reviewed-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: aurelianliu <aurelianliu@tencent.com>
2024-03-26 11:36:24 +08:00
Xinghui Li 0f1642a8b2 pci: bypass NVMe when booting PCIe storage with 5s delay
commit 762cad7("pci: delay 5s to proble multiple storage controllers")
aimed to order scsi device mount order. But nvme devices do not need
to do so, which will instead increase the boot time of storage servers.
Therefore we bypass NVMe device here.

Signed-off-by: Xinghui Li <korantli@tencent.com>
Signed-off-by: Samuel Liao <samuelliao@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-03-15 21:56:47 +08:00
Liu Yu 9cb9672adf pci: prohibit storage probe delay of virtio block device
virtio block device has no async probe path, so needn't probe delay
this patch will reduce about 5s kernel booting time.

Signed-off-by: Xiaoming Gao <newtongao@tencent.com>
Signed-off-by: Liu Yu <allanyuliu@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-03-15 21:56:47 +08:00
Samuel Liao 9501ffdbf1 pci: delay 5s to proble multiple storage controllers
For predictable disk order.

Signed-off-by: Samuel Liao <samuelliao@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-03-15 21:56:46 +08:00
costinchen ea51a4e717 spec: add support for secureboot by signing the vmlinuz.
spec: add dependency on libtool to build on koji.

Signed-off-by: Sinong Chen <costinchen@tencent.com>
Reviewed-by: Jianping Liu <frankjpliu@tencent.com>
2024-03-15 21:56:46 +08:00
Jianping Liu 855ffa3aaa config: enable CONFIG_ACPI_AGDI to support NMI
On AmpereONE soc, it support NMI interrupt, which needing enable
CONFIG_ACPI_AGDI.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-03-15 17:09:39 +08:00
Jianping Liu 0392975b4f config: enable ANDROID_BINDER to support android container
Cloud game need run android container, so add the configs as fellow:
CONFIG_ANDROID_BINDER_IPC=y
CONFIG_ANDROID_BINDERFS=y
CONFIG_ANDROID_BINDER_DEVICES="binder,hwbinder,vndbinder"
CONFIG_ANDROID_BINDER_IPC_SELFTEST=y

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-03-15 16:59:36 +08:00
Jianping Liu 4faa03afdc dist: add a modules-public rpm subpackage
TK have some kernel modules (such nvidia.ko) only uesd in public release verison,
split them into modules-public subpackage.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-03-15 12:12:14 +08:00
Jianping Liu 83c70cfab6 dist: rename modules-removable-media to modules-public-removable-media
Modules in kernle*modules-removable-media*.rpm are the drivers for removable
media. Using them will rise attack risck, and they are not used in private
release, only used in public release. So, rename it.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-03-15 12:06:30 +08:00
Jianping Liu 2596824741 dist: tks: add a removable media modules pkg
Tk4 have this sub package, try to be compatible, and use
filter-modules.sh instead which does a depmod check after splitting, to
avoid depmod failure after some modules are split into subpackage.

Signed-off-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-03-15 11:55:08 +08:00
frankjpliu da878504cf Merge branch 'cunhuang/master' into 'master' (merge request !12)
sync some ampere changes from upstream
2024-03-14 11:41:36 +00:00
aurelianliu dbc51490d1 x86 and arm64 config: add more module config
add module config from ocks && rhel

Signed-off-by: aurelianliu <aurelianliu@tencent.com>
2024-03-14 11:39:50 +00:00
Ilkka Koskinen 3e14a8ae4f perf vendor events arm64 AmpereOneX: Add core PMU events and metrics
commit 16438b652b464ef7d0a877d31e93ab54338f6b0a upstream.

Add JSON files for AmpereOneX core PMU events and metrics.

Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@arm.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mike Leach <mike.leach@linaro.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Link: https://lore.kernel.org/r/20231201021550.1109196-4-ilkka@os.amperecomputing.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Huang Cun <cunhuang@tencent.com>
2024-03-14 17:17:58 +08:00
Oliver Upton 59149005ec KVM: arm64: Always invalidate TLB for stage-2 permission faults
commit be097997a273259f1723baac5463cf19d8564efa upstream.

It is possible for multiple vCPUs to fault on the same IPA and attempt
to resolve the fault. One of the page table walks will actually update
the PTE and the rest will return -EAGAIN per our race detection scheme.
KVM elides the TLB invalidation on the racing threads as the return
value is nonzero.

Before commit a12ab1378a ("KVM: arm64: Use local TLBI on permission
relaxation") KVM always used broadcast TLB invalidations when handling
permission faults, which had the convenient property of making the
stage-2 updates visible to all CPUs in the system. However now we do a
local invalidation, and TLBI elision leads to the vCPU thread faulting
again on the stale entry. Remember that the architecture permits the TLB
to cache translations that precipitate a permission fault.

Invalidate the TLB entry responsible for the permission fault if the
stage-2 descriptor has been relaxed, regardless of which thread actually
did the job.

Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230922223229.1608155-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Huang Cun <cunhuang@tencent.com>
2024-03-14 17:17:19 +08:00
Oliver Upton fc51f30f7c KVM: arm64: Avoid soft lockups due to I-cache maintenance
commit 909b583f81b5bb5a398d4580543f59b908a86ccc upstream.

Gavin reports of soft lockups on his Ampere Altra Max machine when
backing KVM guests with hugetlb pages. Upon further investigation, it
was found that the system is unable to keep up with parallel I-cache
invalidations done by KVM's stage-2 fault handler.

This is ultimately an implementation problem. I-cache maintenance
instructions are available at EL0, so nothing stops a malicious
userspace from hammering a system with CMOs and cause it to fall over.
"Fixing" this problem in KVM is nothing more than slapping a bandage
over a much deeper problem.

Anyway, the kernel already has a heuristic for limiting TLB
invalidations to avoid soft lockups. Reuse that logic to limit I-cache
CMOs done by KVM to map executable pages on systems without FEAT_DIC.
While at it, restructure __invalidate_icache_guest_page() to improve
readability and squeeze our new condition into the existing branching
structure.

Link: https://lore.kernel.org/kvmarm/20230904072826.1468907-1-gshan@redhat.com/
Reviewed-by: Gavin Shan <gshan@redhat.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20230920080133.944717-3-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Huang Cun <cunhuang@tencent.com>
2024-03-14 17:16:44 +08:00
Oliver Upton 3a23b1b952 arm64: tlbflush: Rename MAX_TLBI_OPS
commit ec1c3b9ff16082f880b304be40992568f4eee6a7 upstream.

Perhaps unsurprisingly, I-cache invalidations suffer from performance
issues similar to TLB invalidations on certain systems. TLB and I-cache
maintenance all result in DVM on the mesh, which is where the real
bottleneck lies.

Rename the heuristic to point the finger at DVM, such that it may be
reused for limiting I-cache invalidations.

Reviewed-by: Gavin Shan <gshan@redhat.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20230920080133.944717-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Huang Cun <cunhuang@tencent.com>
2024-03-14 17:16:10 +08:00
Ilkka Koskinen 7c2a440c1d docs/perf: Add ampere_cspmu to toctree to fix a build warning
commit 0abe7f61c28d62ee0530c31589e6ea209aa82cbd upstream.

Add ampere_cspmu to toctree in order to address the following warning
produced when building documents:

	Documentation/admin-guide/perf/ampere_cspmu.rst: WARNING: document isn't included in any toctree

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/all/20231011172250.5a6498e5@canb.auug.org.au/
Fixes: 53a810ad3c5c ("perf: arm_cspmu: ampere_cspmu: Add support for Ampere SoC PMU")
Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20231012074103.3772114-1-ilkka@os.amperecomputing.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Huang Cun <cunhuang@tencent.com>
2024-03-14 17:15:27 +08:00
Ilkka Koskinen 34dc55de64 perf: arm_cspmu: ampere_cspmu: Add support for Ampere SoC PMU
commit 53a810ad3c5cde674cac71e629e6d10bfc9d838c upstream.

Ampere SoC PMU follows CoreSight PMU architecture. It uses implementation
specific registers to filter events rather than PMEVFILTnR registers.

Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20230913233941.9814-5-ilkka@os.amperecomputing.com
[will: Include linux/io.h in ampere_cspmu.c for writel()]
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Huang Cun <cunhuang@tencent.com>
2024-03-14 17:14:50 +08:00
Ilkka Koskinen e46dc8f19f perf: arm_cspmu: Support implementation specific validation
commit 647d5c5a9e7672e285f54f0e141ee759e69382f2 upstream.

Some platforms may use e.g. different filtering mechanism and, thus,
may need different way to validate the events and group.

Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20230913233941.9814-4-ilkka@os.amperecomputing.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Huang Cun <cunhuang@tencent.com>
2024-03-14 17:13:56 +08:00
Ilkka Koskinen 773c54aa27 perf: arm_cspmu: Support implementation specific filters
commit 0a7603ab242e9bab530227cf0d0d344d4e334acc upstream.

ARM Coresight PMU architecture specification [1] defines PMEVTYPER and
PMEVFILT* registers as optional in Chapter 2.1. Moreover, implementers may
choose to use PMIMPDEF* registers (offset: 0xD80-> 0xDFF) to filter the
events. Add support for those by adding implementation specific filter
callback function.

[1] https://developer.arm.com/documentation/ihi0091/latest

Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Reviewed-by: Besar Wicaksono <bwicaksono@nvidia.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20230913233941.9814-3-ilkka@os.amperecomputing.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Huang Cun <cunhuang@tencent.com>
2024-03-14 17:12:35 +08:00
Ilkka Koskinen 2fddeaf9e6 perf: arm_cspmu: Split 64-bit write to 32-bit writes
commit 8c282414ca6209977cb6d6cc66470ca2d1e56bf6 upstream.

Split the 64-bit register accesses if 64-bit access is not supported
by the PMU.

Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Reviewed-by: Besar Wicaksono <bwicaksono@nvidia.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20230913233941.9814-2-ilkka@os.amperecomputing.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Huang Cun <cunhuang@tencent.com>
2024-03-14 17:11:41 +08:00
Besar Wicaksono 8365e42c05 perf: arm_cspmu: Separate Arm and vendor module
commit bfc653aa89cb05796d7b4e046600accb442c9b7a upstream.

Arm Coresight PMU driver consists of main standard code and
vendor backend code. Both are currently built as a single module.
This patch adds vendor registration API to separate the two to
keep things modular. The main driver requests each known backend
module during initialization and defer device binding process.
The backend module then registers an init callback to the main
driver and continue the device driver binding process.

Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-and-tested-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20230821231608.50911-1-bwicaksono@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Huang Cun <cunhuang@tencent.com>
2024-03-14 17:10:46 +08:00
Jianping Liu e5904891ad config: enable slub debug as default in debug.config
Add CONFIG_SLUB_DEBUG_ON=y only in debug.config, it will not affect
release config.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-03-13 23:21:58 +08:00
Jianping Liu 0a34841932 config: enable CONFIG_HARDLOCKUP_DETECTOR
Linux 6.6 support hard lockup detect on aarch64, enable it.
It is useful to debug spin deadlock with irq disable.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-03-13 23:16:49 +08:00
leonylgao 5034e33943 Merge branch 'frankjpliu/master' into 'master' (merge request !10)
sync some CONFIG changes from tk4
2024-03-11 08:31:18 +00:00
Jianping Liu 0ba488e345 config: enable CONFIG_DEBUG_LOCKDEP in debug.config
Using CONFIG_DEBUG_LOCKDEP will easy to debug deadlock, so enable it
in debug kernel.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: aurelianliu <aurelianliu@tencent.com>
2024-03-09 12:07:39 +08:00
Jianping Liu 5a8a3efc1d config/x86: Disable CONFIG_LATENCYTOP by default
Performance degradation due to multi-core contending for global spinlock.

Disable CONFIG_LATENCYTOP won't select SCHEDSTATS and KALLSYMS_ALL by
default, which makes data section 's symbol can't be looked up.
So enable KALLSYMS_ALL and SCHEDSTATS after disable CONFIG_LATENCYTOP.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: aurelianliu <aurelianliu@tencent.com>
2024-03-09 11:57:41 +08:00
Jianping Liu 3e239a5e1c config: disable CONFIG_MODULE_SIG_FORCE
In TK5, private and public kernel rpm will be the same, so disable
CONFIG_MODULE_SIG_FORCE, and using module.sig_enforce=1 kernel param
in private TS distro.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: aurelianliu <aurelianliu@tencent.com>
2024-03-09 11:43:00 +08:00
Jianping Liu 8181f65c93 config: update tencent.config by make savedefconfig
Update tencent.config by:
make tencentconfig
make savedefconfig
mv defconfig arch/x86/configs/tencent.config
With out any manual change.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: aurelianliu <aurelianliu@tencent.com>
2024-03-09 11:36:49 +08:00