OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Kairui Song	fd77451861	emm: memcg, zram: add support for ZRAM memory accounting Upstream: alternative Add a CONFIG_MEMCG_ZRAM for ZRAM driver later. This commit only add basic structures. Current plan is that we implement the counting in ZRAM block level for simplicity of design, may more it to zpool level later for a unified zram / zswap accounting. Signed-off-by: Kairui Song <kasong@tencent.com>	2024-04-03 16:58:50 +08:00
Kairui Song	5e60af62c1	emm: mm: make it possible to disable memcg kmem by default Upstream: no Introduce a MEMCG_KMEM_DEFAULT_OFF config. Signed-off-by: Kairui Song <kasong@tencent.com>	2024-04-03 16:58:49 +08:00
Yosry Ahmed	4486196118	mm: memcg: optimize parent iteration in memcg_rstat_updated() Upstream: commit 9cee7e8ef3e31ca25b40ca52b8585dc6935deff2 Conflicts: none Backport-reason: mm: memcg: subtree stats flushing and thresholds In memcg_rstat_updated(), we iterate the memcg being updated and its parents to update memcg->vmstats_percpu->stats_updates in the fast path (i.e. no atomic updates). According to my math, this is 3 memory loads (and potentially 3 cache misses) per memcg: - Load the address of memcg->vmstats_percpu. - Load vmstats_percpu->stats_updates (based on some percpu calculation). - Load the address of the parent memcg. Avoid most of the cache misses by caching a pointer from each struct memcg_vmstats_percpu to its parent on the corresponding CPU. In this case, for the first memcg we have 2 memory loads (same as above): - Load the address of memcg->vmstats_percpu. - Load vmstats_percpu->stats_updates (based on some percpu calculation). Then for each additional memcg, we need a single load to get the parent's stats_updates directly. This reduces the number of loads from O(3N) to O(2+N) -- where N is the number of memcgs we need to iterate. Additionally, stash a pointer to memcg->vmstats in each struct memcg_vmstats_percpu such that we can access the atomic counter that all CPUs fold into, memcg->vmstats->stats_updates. memcg_should_flush_stats() is changed to memcg_vmstats_needs_flush() to accept a struct memcg_vmstats pointer accordingly. In struct memcg_vmstats_percpu, make sure both pointers together with stats_updates live on the same cacheline. Finally, update mem_cgroup_alloc() to take in a parent pointer and initialize the new cache pointers on each CPU. The percpu loop in mem_cgroup_alloc() may look concerning, but there are multiple similar loops in the cgroup creation path (e.g. cgroup_rstat_init()), most of which are hidden within alloc_percpu(). According to Oliver's testing [1], this fixes multiple 30-38% regressions in vm-scalability, will-it-scale-tlb_flush2, and will-it-scale-fallocate1. This comes at a cost of 2 more pointers per CPU (<2KB on a machine with 128 CPUs). [1] https://lore.kernel.org/lkml/ZbDJsfsZt2ITyo61@xsang-OptiPlex-9020/ [yosryahmed@google.com: fix struct memcg_vmstats_percpu size and alignment] Link: https://lkml.kernel.org/r/20240203044612.1234216-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20240124100023.660032-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Fixes: 8d59d2214c23 ("mm: memcg: make stats flushing threshold per-memcg") Tested-by: kernel test robot <oliver.sang@intel.com> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202401221624.cb53a8ca-oliver.sang@intel.com Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kairui Song <kasong@tencent.com>	2024-04-03 16:58:49 +08:00
Yosry Ahmed	f7a35d7bb7	mm: memcg: restore subtree stats flushing Upstream: commit 7d7ef0a4686abe43cd76a141b340a348f45ecdf2 Conflicts: Skip change in zswap.c, due to missing of b5ba474f3f51, should be OK, later backport will easily notice the change of function params. Backport-reason: mm: memcg: subtree stats flushing and thresholds Stats flushing for memcg currently follows the following rules: - Always flush the entire memcg hierarchy (i.e. flush the root). - Only one flusher is allowed at a time. If someone else tries to flush concurrently, they skip and return immediately. - A periodic flusher flushes all the stats every 2 seconds. The reason this approach is followed is because all flushes are serialized by a global rstat spinlock. On the memcg side, flushing is invoked from userspace reads as well as in-kernel flushers (e.g. reclaim, refault, etc). This approach aims to avoid serializing all flushers on the global lock, which can cause a significant performance hit under high concurrency. This approach has the following problems: - Occasionally a userspace read of the stats of a non-root cgroup will be too expensive as it has to flush the entire hierarchy [1]. - Sometimes the stats accuracy are compromised if there is an ongoing flush, and we skip and return before the subtree of interest is actually flushed, yielding stale stats (by up to 2s due to periodic flushing). This is more visible when reading stats from userspace, but can also affect in-kernel flushers. The latter problem is particulary a concern when userspace reads stats after an event occurs, but gets stats from before the event. Examples: - When memory usage / pressure spikes, a userspace OOM handler may look at the stats of different memcgs to select a victim based on various heuristics (e.g. how much private memory will be freed by killing this). Reading stale stats from before the usage spike in this case may cause a wrongful OOM kill. - A proactive reclaimer may read the stats after writing to memory.reclaim to measure the success of the reclaim operation. Stale stats from before reclaim may give a false negative. - Reading the stats of a parent and a child memcg may be inconsistent (child larger than parent), if the flush doesn't happen when the parent is read, but happens when the child is read. As for in-kernel flushers, they will occasionally get stale stats. No regressions are currently known from this, but if there are regressions, they would be very difficult to debug and link to the source of the problem. This patch aims to fix these problems by restoring subtree flushing, and removing the unified/coalesced flushing logic that skips flushing if there is an ongoing flush. This change would introduce a significant regression with global stats flushing thresholds. With per-memcg stats flushing thresholds, this seems to perform really well. The thresholds protect the underlying lock from unnecessary contention. This patch was tested in two ways to ensure the latency of flushing is up to par, on a machine with 384 cpus: - A synthetic test with 5000 concurrent workers in 500 cgroups doing allocations and reclaim, as well as 1000 readers for memory.stat (variation of [2]). No regressions were noticed in the total runtime. Note that significant regressions in this test are observed with global stats thresholds, but not with per-memcg thresholds. - A synthetic stress test for concurrently reading memcg stats while memory allocation/freeing workers are running in the background, provided by Wei Xu [3]. With 250k threads reading the stats every 100ms in 50k cgroups, 99.9% of reads take <= 50us. Less than 0.01% of reads take more than 1ms, and no reads take more than 100ms. [1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/ [2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/ [akpm@linux-foundation.org: fix mm/zswap.c] [yosryahmed@google.com: remove stats flushing mutex] Link: https://lkml.kernel.org/r/CAJD7tkZgP3m-VVPn+fF_YuvXeQYK=tZZjJHj=dzD=CcSSpp2qg@mail.gmail.com Link: https://lkml.kernel.org/r/20231129032154.3710765-6-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kairui Song <kasong@tencent.com>	2024-04-03 16:58:49 +08:00
Yosry Ahmed	1a9570e74d	mm: workingset: move the stats flush into workingset_test_recent() Upstream: commit b006847222623ac3cda8589d15379eac86a2bcb7 Conflicts: none Backport-reason: mm: memcg: subtree stats flushing and thresholds The workingset code flushes the stats in workingset_refault() to get accurate stats of the eviction memcg. In preparation for more scoped flushed and passing the eviction memcg to the flush call, move the call to workingset_test_recent() where we have a pointer to the eviction memcg. The flush call is sleepable, and cannot be made in an rcu read section. Hence, minimize the rcu read section by also moving it into workingset_test_recent(). Furthermore, instead of holding the rcu read lock throughout workingset_test_recent(), only hold it briefly to get a ref on the eviction memcg. This allows us to make the flush call after we get the eviction memcg. As for workingset_refault(), nothing else there appears to be protected by rcu. The memcg of the faulted folio (which is not necessarily the same as the eviction memcg) is protected by the folio lock, which is held from all callsites. Add a VM_BUG_ON() to make sure this doesn't change from under us. No functional change intended. Link: https://lkml.kernel.org/r/20231129032154.3710765-5-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kairui Song <kasong@tencent.com>	2024-04-03 16:58:48 +08:00
Yosry Ahmed	0f76dc379d	mm: memcg: make stats flushing threshold per-memcg Upstream: commit 8d59d2214c2362e7a9d185d80b613e632581af7b Conflicts: none Backport-reason: mm: memcg: subtree stats flushing and thresholds A global counter for the magnitude of memcg stats update is maintained on the memcg side to avoid invoking rstat flushes when the pending updates are not significant. This avoids unnecessary flushes, which are not very cheap even if there isn't a lot of stats to flush. It also avoids unnecessary lock contention on the underlying global rstat lock. Make this threshold per-memcg. The scheme is followed where percpu (now also per-memcg) counters are incremented in the update path, and only propagated to per-memcg atomics when they exceed a certain threshold. This provides two benefits: (a) On large machines with a lot of memcgs, the global threshold can be reached relatively fast, so guarding the underlying lock becomes less effective. Making the threshold per-memcg avoids this. (b) Having a global threshold makes it hard to do subtree flushes, as we cannot reset the global counter except for a full flush. Per-memcg counters removes this as a blocker from doing subtree flushes, which helps avoid unnecessary work when the stats of a small subtree are needed. Nothing is free, of course. This comes at a cost: (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 bytes. The extra memory usage is insigificant. (b) More work on the update side, although in the common case it will only be percpu counter updates. The amount of work scales with the number of ancestors (i.e. tree depth). This is not a new concept, adding a cgroup to the rstat tree involves a parent loop, so is charging. Testing results below show no significant regressions. (c) The error margin in the stats for the system as a whole increases from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * NR_MEMCGS. This is probably fine because we have a similar per-memcg error in charges coming from percpu stocks, and we have a periodic flusher that makes sure we always flush all the stats every 2s anyway. This patch was tested to make sure no significant regressions are introduced on the update path as follows. The following benchmarks were ran in a cgroup that is 2 levels deep (/sys/fs/cgroup/a/b/): (1) Running 22 instances of netperf on a 44 cpu machine with hyperthreading disabled. All instances are run in a level 2 cgroup, as well as netserver: # netserver -6 # netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Averaging 20 runs, the numbers are as follows: Base: 40198.0 mbps Patched: 38629.7 mbps (-3.9%) The regression is minimal, especially for 22 instances in the same cgroup sharing all ancestors (so updating the same atomics). (2) will-it-scale page_fault tests. These tests (specifically per_process_ops in page_fault3 test) detected a 25.9% regression before for a change in the stats update path [1]. These are the numbers from 10 runs (+ is good) on a machine with 256 cpus: LABEL \| MEAN \| MEDIAN \| STDDEV \| ------------------------------+-------------+-------------+------------- page_fault1_per_process_ops \| \| \| \| (A) base \| 270249.164 \| 265437.000 \| 13451.836 \| (B) patched \| 261368.709 \| 255725.000 \| 13394.767 \| \| -3.29% \| -3.66% \| \| page_fault1_per_thread_ops \| \| \| \| (A) base \| 242111.345 \| 239737.000 \| 10026.031 \| (B) patched \| 237057.109 \| 235305.000 \| 9769.687 \| \| -2.09% \| -1.85% \| \| page_fault1_scalability \| \| \| (A) base \| 0.034387 \| 0.035168 \| 0.0018283 \| (B) patched \| 0.033988 \| 0.034573 \| 0.0018056 \| \| -1.16% \| -1.69% \| \| page_fault2_per_process_ops \| \| \| (A) base \| 203561.836 \| 203301.000 \| 2550.764 \| (B) patched \| 197195.945 \| 197746.000 \| 2264.263 \| \| -3.13% \| -2.73% \| \| page_fault2_per_thread_ops \| \| \| (A) base \| 171046.473 \| 170776.000 \| 1509.679 \| (B) patched \| 166626.327 \| 166406.000 \| 768.753 \| \| -2.58% \| -2.56% \| \| page_fault2_scalability \| \| \| (A) base \| 0.054026 \| 0.053821 \| 0.00062121 \| (B) patched \| 0.053329 \| 0.05306 \| 0.00048394 \| \| -1.29% \| -1.41% \| \| page_fault3_per_process_ops \| \| \| (A) base \| 1295807.782 \| 1297550.000 \| 5907.585 \| (B) patched \| 1275579.873 \| 1273359.000 \| 8759.160 \| \| -1.56% \| -1.86% \| \| page_fault3_per_thread_ops \| \| \| (A) base \| 391234.164 \| 390860.000 \| 1760.720 \| (B) patched \| 377231.273 \| 376369.000 \| 1874.971 \| \| -3.58% \| -3.71% \| \| page_fault3_scalability \| \| \| (A) base \| 0.60369 \| 0.60072 \| 0.0083029 \| (B) patched \| 0.61733 \| 0.61544 \| 0.009855 \| \| +2.26% \| +2.45% \| \| All regressions seem to be minimal, and within the normal variance for the benchmark. The fix for [1] assumes that 3% is noise -- and there were no further practical complaints), so hopefully this means that such variations in these microbenchmarks do not reflect on practical workloads. (3) I also ran stress-ng in a nested cgroup and did not observe any obvious regressions. [1]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/ Link: https://lkml.kernel.org/r/20231129032154.3710765-4-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kairui Song <kasong@tencent.com>	2024-04-03 16:58:48 +08:00
Yosry Ahmed	e3b1808e92	mm: memcg: move vmstats structs definition above flushing code Upstream: commit e0bf1dc859fdd08ef738824710770a30a8069433 Conflicts: resolved Backport-reason: mm: memcg: subtree stats flushing and thresholds The following patch will make use of those structs in the flushing code, so move their definitions (and a few other dependencies) a little bit up to reduce the diff noise in the following patch. No functional change intended. Link: https://lkml.kernel.org/r/20231129032154.3710765-3-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kairui Song <kasong@tencent.com>	2024-04-03 16:58:48 +08:00
Yosry Ahmed	835b80a758	mm: memcg: change flush_next_time to flush_last_time Upstream: commit 508bed884767a8eb394640bae9edcdf082816c43 Conflicts: none Backport-reason: mm: memcg: subtree stats flushing and thresholds Patch series "mm: memcg: subtree stats flushing and thresholds", v4. This series attempts to address shortages in today's approach for memcg stats flushing, namely occasionally stale or expensive stat reads. The series does so by changing the threshold that we use to decide whether to trigger a flush to be per memcg instead of global (patch 3), and then changing flushing to be per memcg (i.e. subtree flushes) instead of global (patch 5). This patch (of 5): flush_next_time is an inaccurate name. It's not the next time that periodic flushing will happen, it's rather the next time that ratelimited flushing can happen if the periodic flusher is late. Simplify its semantics by just storing the timestamp of the last flush instead, flush_last_time. Move the 2*FLUSH_TIME addition to mem_cgroup_flush_stats_ratelimited(), and add a comment explaining it. This way, all the ratelimiting semantics live in one place. No functional change intended. Link: https://lkml.kernel.org/r/20231129032154.3710765-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20231129032154.3710765-2-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Chris Li <chrisl@kernel.org> (Google) Tested-by: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kairui Song <kasong@tencent.com>	2024-04-03 16:58:47 +08:00
Kairui Song	9faef83a53	kabi: move reservation in mem_cgroup to tail Upstream: no It was mistakenly put in the middle, fix it. Signed-off-by: Kairui Song <kasong@tencent.com>	2024-04-03 16:58:47 +08:00
Kairui Song	2488b07a78	dist: config: update config Upstream: no Reduce NODES_SHIFT to 8, leave move space for page flags. Signed-off-by: Kairui Song <kasong@tencent.com>	2024-04-03 16:58:47 +08:00
Kairui Song	690dae3cfe	dist: disable non kernel pkg on non default config Upstream: no Signed-off-by: Kairui Song <kasong@tencent.com>	2024-04-03 16:58:46 +08:00
Kairui Song	432d074030	dist: add eks base config Copied from 0017 with EMM enabled and tidyup with defconfig. Signed-off-by: Kairui Song <kasong@tencent.com>	2024-04-03 15:36:21 +08:00
Kairui Song	ec94d992b4	eks: net/toa: add ali_cip support Upstream: no As required for EKS. Signed-off-by: Kairui Song <kasong@tencent.com>	2024-04-03 15:36:21 +08:00
Kairui Song	1583c6df5c	eks: kvm/x86: introduce CONFIG_KVM_FORCE_PVCLOCK Upstream: no Allow force using PVCLOCK for better performance, with some time accuracy loss. Signed-off-by: Kairui Song <kasong@tencent.com>	2024-04-03 15:36:20 +08:00
linuszeng	390cfbd393	dist: kernel.template.spec: add lz4 build request Upstream: no Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>	2024-04-03 15:36:20 +08:00
frankjpliu	62cfac3173	Merge branch 'herberthbli/scx' into 'master' (merge request !34 ) herberthbli/scx These commits are from upstream，they are the preparation of the sched_ext.	2024-03-29 03:02:59 +00:00
Jianping Liu	66d114febb	checkpatch: add Signed-off-by check if commit cherry-pick from upstream When a commit cherry-pick from upstream, we should add Signed-off-by in commit message. With the Signed-off-by info, we could easy to know who picked the commit, and could ask him the reason: why we need pick the commit. Signed-off-by: Jianping Liu <frankjpliu@tencent.com> Reviewed-by: Yongliang Gao <leonylgao@tencent.com>	2024-03-27 21:59:03 +08:00
Alex Shi	f55a71cb98	checkpatch: check the backported commit for CID reference Accepted few kind of references: commit xxx upstream or Upstream commit xxx or [ Upstream commit xxx or Upstream commit: Signed-off-by: Alex Shi <alexsshi@tencent.com> Acked-by: Alex Shi <alexsshi@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-03-27 21:41:27 +08:00
David Vernet	991fb56f6c	selftests/bpf: Test pinning bpf timer to a core Upstream commit 0d7ae06860753bb30b3731302b994da071120d00 Now that we support pinning a BPF timer to the current core, we should test it with some selftests. This patch adds two new testcases to the timer suite, which verifies that a BPF timer both with and without BPF_F_TIMER_ABS, can be pinned to the calling core with BPF_F_TIMER_CPU_PIN. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20231004162339.200702-3-void@manifault.com	2024-03-27 18:09:18 +08:00
David Vernet	c12dbc1bdd	bpf: Add ability to pin bpf timer to calling CPU Upstream commit d6247ecb6c1e17d7a33317090627f5bfe563cbb2 BPF supports creating high resolution timers using bpf_timer_* helper functions. Currently, only the BPF_F_TIMER_ABS flag is supported, which specifies that the timeout should be interpreted as absolute time. It would also be useful to be able to pin that timer to a core. For example, if you wanted to make a subset of cores run without timer interrupts, and only have the timer be invoked on a single core. This patch adds support for this with a new BPF_F_TIMER_CPU_PIN flag. When specified, the HRTIMER_MODE_PINNED flag is passed to hrtimer_start(). A subsequent patch will update selftests to validate. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20231004162339.200702-2-void@manifault.com	2024-03-27 18:09:14 +08:00
Ingo Molnar	ed516134e3	sched/fair: Rename check_preempt_curr() to wakeup_preempt() Upstream commit e23edc86b09df655bf8963bbcb16647adc787395 The name is a bit opaque - make it clear that this is about wakeup preemption. Also rename the ->check_preempt_curr() methods similarly. Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2024-03-27 18:09:11 +08:00
Ingo Molnar	0aaab31170	sched/fair: Rename check_preempt_wakeup() to check_preempt_wakeup_fair() Upstream commit 82845683ca6a15fe8c7912c6264bb0e84ec6f5fb Other scheduling classes already postfix their similar methods with the class name. Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Hongbo Li <herberthbli@tencent.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2024-03-27 18:09:07 +08:00
frankjpliu	175e1c3850	Merge branch 'leonylgao/master' into 'master' (merge request !25 ) kabi: provide kabi check/update/create commands for local users	2024-03-26 09:38:26 +00:00
Yongliang Gao	618d09a6f8	script: update check-kabi script Upstream: no check-kabi script copy from tkernel4, run failed in tkernel5. Traceback (most recent call last): File "./scripts/check-kabi", line 143, in <module> load_symvers(symvers,symvers_file) File "./scripts/check-kabi", line 44, in load_symvers checksum,symbol,directory,type = string.split(in_line) ValueError: too many values to unpack update script copy from dist/sources/check-kabi Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Reviewed-by: Jianping Liu <frankjpliu@tencent.com>	2024-03-26 14:15:24 +08:00
Yongliang Gao	0c01b9d8fb	kabi: provide kabi check/update/create commands for local users Upstream: no Provides kabi check/update/create commands for local users: 1. Check whether TencentOS Kennel KABI is compatible 2. Update TencentOS Kennel KABI file 3. Create TencentOS Kennel KABI file Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Reviewed-by: Jianping Liu <frankjpliu@tencent.com>	2024-03-26 14:10:56 +08:00
Yongliang Gao	0512e1e0ee	config: add kernel/configs/tkci.config For easy get the config used in tkci, add kernel/configs/tkci.config, users can run "make tencentconfig tkci.config" to generate the .config. Signed-off-by: Jianping Liu <frankjpliu@tencent.com> Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Reviewed-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: aurelianliu <aurelianliu@tencent.com>	2024-03-26 11:36:24 +08:00
Xinghui Li	0f1642a8b2	pci: bypass NVMe when booting PCIe storage with 5s delay commit 762cad7("pci: delay 5s to proble multiple storage controllers") aimed to order scsi device mount order. But nvme devices do not need to do so, which will instead increase the boot time of storage servers. Therefore we bypass NVMe device here. Signed-off-by: Xinghui Li <korantli@tencent.com> Signed-off-by: Samuel Liao <samuelliao@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-03-15 21:56:47 +08:00
Liu Yu	9cb9672adf	pci: prohibit storage probe delay of virtio block device virtio block device has no async probe path, so needn't probe delay this patch will reduce about 5s kernel booting time. Signed-off-by: Xiaoming Gao <newtongao@tencent.com> Signed-off-by: Liu Yu <allanyuliu@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-03-15 21:56:47 +08:00
Samuel Liao	9501ffdbf1	pci: delay 5s to proble multiple storage controllers For predictable disk order. Signed-off-by: Samuel Liao <samuelliao@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-03-15 21:56:46 +08:00
costinchen	ea51a4e717	spec: add support for secureboot by signing the vmlinuz. spec: add dependency on libtool to build on koji. Signed-off-by: Sinong Chen <costinchen@tencent.com> Reviewed-by: Jianping Liu <frankjpliu@tencent.com>	2024-03-15 21:56:46 +08:00
Jianping Liu	855ffa3aaa	config: enable CONFIG_ACPI_AGDI to support NMI On AmpereONE soc, it support NMI interrupt, which needing enable CONFIG_ACPI_AGDI. Signed-off-by: Jianping Liu <frankjpliu@tencent.com> Reviewed-by: Yongliang Gao <leonylgao@tencent.com>	2024-03-15 17:09:39 +08:00
Jianping Liu	0392975b4f	config: enable ANDROID_BINDER to support android container Cloud game need run android container, so add the configs as fellow: CONFIG_ANDROID_BINDER_IPC=y CONFIG_ANDROID_BINDERFS=y CONFIG_ANDROID_BINDER_DEVICES="binder,hwbinder,vndbinder" CONFIG_ANDROID_BINDER_IPC_SELFTEST=y Signed-off-by: Jianping Liu <frankjpliu@tencent.com> Reviewed-by: Yongliang Gao <leonylgao@tencent.com>	2024-03-15 16:59:36 +08:00
Jianping Liu	4faa03afdc	dist: add a modules-public rpm subpackage TK have some kernel modules (such nvidia.ko) only uesd in public release verison, split them into modules-public subpackage. Signed-off-by: Jianping Liu <frankjpliu@tencent.com> Reviewed-by: Yongliang Gao <leonylgao@tencent.com>	2024-03-15 12:12:14 +08:00
Jianping Liu	83c70cfab6	dist: rename modules-removable-media to modules-public-removable-media Modules in kernlemodules-removable-media.rpm are the drivers for removable media. Using them will rise attack risck, and they are not used in private release, only used in public release. So, rename it. Signed-off-by: Jianping Liu <frankjpliu@tencent.com> Reviewed-by: Yongliang Gao <leonylgao@tencent.com>	2024-03-15 12:06:30 +08:00
Jianping Liu	2596824741	dist: tks: add a removable media modules pkg Tk4 have this sub package, try to be compatible, and use filter-modules.sh instead which does a depmod check after splitting, to avoid depmod failure after some modules are split into subpackage. Signed-off-by: Kairui Song <kasong@tencent.com> Signed-off-by: Jianping Liu <frankjpliu@tencent.com>	2024-03-15 11:55:08 +08:00
frankjpliu	da878504cf	Merge branch 'cunhuang/master' into 'master' (merge request !12 ) sync some ampere changes from upstream	2024-03-14 11:41:36 +00:00
aurelianliu	dbc51490d1	x86 and arm64 config: add more module config add module config from ocks && rhel Signed-off-by: aurelianliu <aurelianliu@tencent.com>	2024-03-14 11:39:50 +00:00
Ilkka Koskinen	3e14a8ae4f	perf vendor events arm64 AmpereOneX: Add core PMU events and metrics commit 16438b652b464ef7d0a877d31e93ab54338f6b0a upstream. Add JSON files for AmpereOneX core PMU events and metrics. Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@arm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Garry <john.g.garry@oracle.com> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mike Leach <mike.leach@linaro.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: linux-arm-kernel@lists.infradead.org Link: https://lore.kernel.org/r/20231201021550.1109196-4-ilkka@os.amperecomputing.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Huang Cun <cunhuang@tencent.com>	2024-03-14 17:17:58 +08:00
Oliver Upton	59149005ec	KVM: arm64: Always invalidate TLB for stage-2 permission faults commit be097997a273259f1723baac5463cf19d8564efa upstream. It is possible for multiple vCPUs to fault on the same IPA and attempt to resolve the fault. One of the page table walks will actually update the PTE and the rest will return -EAGAIN per our race detection scheme. KVM elides the TLB invalidation on the racing threads as the return value is nonzero. Before commit `a12ab1378a` ("KVM: arm64: Use local TLBI on permission relaxation") KVM always used broadcast TLB invalidations when handling permission faults, which had the convenient property of making the stage-2 updates visible to all CPUs in the system. However now we do a local invalidation, and TLBI elision leads to the vCPU thread faulting again on the stale entry. Remember that the architecture permits the TLB to cache translations that precipitate a permission fault. Invalidate the TLB entry responsible for the permission fault if the stage-2 descriptor has been relaxed, regardless of which thread actually did the job. Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230922223229.1608155-1-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Huang Cun <cunhuang@tencent.com>	2024-03-14 17:17:19 +08:00
Oliver Upton	fc51f30f7c	KVM: arm64: Avoid soft lockups due to I-cache maintenance commit 909b583f81b5bb5a398d4580543f59b908a86ccc upstream. Gavin reports of soft lockups on his Ampere Altra Max machine when backing KVM guests with hugetlb pages. Upon further investigation, it was found that the system is unable to keep up with parallel I-cache invalidations done by KVM's stage-2 fault handler. This is ultimately an implementation problem. I-cache maintenance instructions are available at EL0, so nothing stops a malicious userspace from hammering a system with CMOs and cause it to fall over. "Fixing" this problem in KVM is nothing more than slapping a bandage over a much deeper problem. Anyway, the kernel already has a heuristic for limiting TLB invalidations to avoid soft lockups. Reuse that logic to limit I-cache CMOs done by KVM to map executable pages on systems without FEAT_DIC. While at it, restructure __invalidate_icache_guest_page() to improve readability and squeeze our new condition into the existing branching structure. Link: https://lore.kernel.org/kvmarm/20230904072826.1468907-1-gshan@redhat.com/ Reviewed-by: Gavin Shan <gshan@redhat.com> Tested-by: Gavin Shan <gshan@redhat.com> Link: https://lore.kernel.org/r/20230920080133.944717-3-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Huang Cun <cunhuang@tencent.com>	2024-03-14 17:16:44 +08:00
Oliver Upton	3a23b1b952	arm64: tlbflush: Rename MAX_TLBI_OPS commit ec1c3b9ff16082f880b304be40992568f4eee6a7 upstream. Perhaps unsurprisingly, I-cache invalidations suffer from performance issues similar to TLB invalidations on certain systems. TLB and I-cache maintenance all result in DVM on the mesh, which is where the real bottleneck lies. Rename the heuristic to point the finger at DVM, such that it may be reused for limiting I-cache invalidations. Reviewed-by: Gavin Shan <gshan@redhat.com> Tested-by: Gavin Shan <gshan@redhat.com> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20230920080133.944717-2-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Huang Cun <cunhuang@tencent.com>	2024-03-14 17:16:10 +08:00
Ilkka Koskinen	7c2a440c1d	docs/perf: Add ampere_cspmu to toctree to fix a build warning commit 0abe7f61c28d62ee0530c31589e6ea209aa82cbd upstream. Add ampere_cspmu to toctree in order to address the following warning produced when building documents: Documentation/admin-guide/perf/ampere_cspmu.rst: WARNING: document isn't included in any toctree Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/all/20231011172250.5a6498e5@canb.auug.org.au/ Fixes: 53a810ad3c5c ("perf: arm_cspmu: ampere_cspmu: Add support for Ampere SoC PMU") Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com> Link: https://lore.kernel.org/r/20231012074103.3772114-1-ilkka@os.amperecomputing.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Huang Cun <cunhuang@tencent.com>	2024-03-14 17:15:27 +08:00
Ilkka Koskinen	34dc55de64	perf: arm_cspmu: ampere_cspmu: Add support for Ampere SoC PMU commit 53a810ad3c5cde674cac71e629e6d10bfc9d838c upstream. Ampere SoC PMU follows CoreSight PMU architecture. It uses implementation specific registers to filter events rather than PMEVFILTnR registers. Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com> Link: https://lore.kernel.org/r/20230913233941.9814-5-ilkka@os.amperecomputing.com [will: Include linux/io.h in ampere_cspmu.c for writel()] Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Huang Cun <cunhuang@tencent.com>	2024-03-14 17:14:50 +08:00
Ilkka Koskinen	e46dc8f19f	perf: arm_cspmu: Support implementation specific validation commit 647d5c5a9e7672e285f54f0e141ee759e69382f2 upstream. Some platforms may use e.g. different filtering mechanism and, thus, may need different way to validate the events and group. Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20230913233941.9814-4-ilkka@os.amperecomputing.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Huang Cun <cunhuang@tencent.com>	2024-03-14 17:13:56 +08:00
Ilkka Koskinen	773c54aa27	perf: arm_cspmu: Support implementation specific filters commit 0a7603ab242e9bab530227cf0d0d344d4e334acc upstream. ARM Coresight PMU architecture specification [1] defines PMEVTYPER and PMEVFILT* registers as optional in Chapter 2.1. Moreover, implementers may choose to use PMIMPDEF* registers (offset: 0xD80-> 0xDFF) to filter the events. Add support for those by adding implementation specific filter callback function. [1] https://developer.arm.com/documentation/ihi0091/latest Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com> Reviewed-by: Besar Wicaksono <bwicaksono@nvidia.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20230913233941.9814-3-ilkka@os.amperecomputing.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Huang Cun <cunhuang@tencent.com>	2024-03-14 17:12:35 +08:00
Ilkka Koskinen	2fddeaf9e6	perf: arm_cspmu: Split 64-bit write to 32-bit writes commit 8c282414ca6209977cb6d6cc66470ca2d1e56bf6 upstream. Split the 64-bit register accesses if 64-bit access is not supported by the PMU. Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com> Reviewed-by: Besar Wicaksono <bwicaksono@nvidia.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20230913233941.9814-2-ilkka@os.amperecomputing.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Huang Cun <cunhuang@tencent.com>	2024-03-14 17:11:41 +08:00
Besar Wicaksono	8365e42c05	perf: arm_cspmu: Separate Arm and vendor module commit bfc653aa89cb05796d7b4e046600accb442c9b7a upstream. Arm Coresight PMU driver consists of main standard code and vendor backend code. Both are currently built as a single module. This patch adds vendor registration API to separate the two to keep things modular. The main driver requests each known backend module during initialization and defer device binding process. The backend module then registers an init callback to the main driver and continue the device driver binding process. Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Reviewed-and-tested-by: Ilkka Koskinen <ilkka@os.amperecomputing.com> Link: https://lore.kernel.org/r/20230821231608.50911-1-bwicaksono@nvidia.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Huang Cun <cunhuang@tencent.com>	2024-03-14 17:10:46 +08:00
Jianping Liu	e5904891ad	config: enable slub debug as default in debug.config Add CONFIG_SLUB_DEBUG_ON=y only in debug.config, it will not affect release config. Signed-off-by: Jianping Liu <frankjpliu@tencent.com> Reviewed-by: Yongliang Gao <leonylgao@tencent.com>	2024-03-13 23:21:58 +08:00
Jianping Liu	0a34841932	config: enable CONFIG_HARDLOCKUP_DETECTOR Linux 6.6 support hard lockup detect on aarch64, enable it. It is useful to debug spin deadlock with irq disable. Signed-off-by: Jianping Liu <frankjpliu@tencent.com> Reviewed-by: Yongliang Gao <leonylgao@tencent.com>	2024-03-13 23:16:49 +08:00
leonylgao	5034e33943	Merge branch 'frankjpliu/master' into 'master' (merge request !10 ) sync some CONFIG changes from tk4	2024-03-11 08:31:18 +00:00

1 2 3 4 5 ...

1221847 Commits All Branches Search

1221847 Commits

All Branches