Commit Graph

1221847 Commits

Author SHA1 Message Date
Kairui Song fd77451861 emm: memcg, zram: add support for ZRAM memory accounting
Upstream: alternative

Add a CONFIG_MEMCG_ZRAM for ZRAM driver later. This commit only add
basic structures.

Current plan is that we implement the counting in ZRAM block
level for simplicity of design, may more it to zpool level later for a
unified zram / zswap accounting.

Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 16:58:50 +08:00
Kairui Song 5e60af62c1 emm: mm: make it possible to disable memcg kmem by default
Upstream: no

Introduce a MEMCG_KMEM_DEFAULT_OFF config.

Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 16:58:49 +08:00
Yosry Ahmed 4486196118 mm: memcg: optimize parent iteration in memcg_rstat_updated()
Upstream: commit 9cee7e8ef3e31ca25b40ca52b8585dc6935deff2
Conflicts: none
Backport-reason: mm: memcg: subtree stats flushing and thresholds

In memcg_rstat_updated(), we iterate the memcg being updated and its
parents to update memcg->vmstats_percpu->stats_updates in the fast path
(i.e. no atomic updates). According to my math, this is 3 memory loads
(and potentially 3 cache misses) per memcg:
- Load the address of memcg->vmstats_percpu.
- Load vmstats_percpu->stats_updates (based on some percpu calculation).
- Load the address of the parent memcg.

Avoid most of the cache misses by caching a pointer from each struct
memcg_vmstats_percpu to its parent on the corresponding CPU. In this
case, for the first memcg we have 2 memory loads (same as above):
- Load the address of memcg->vmstats_percpu.
- Load vmstats_percpu->stats_updates (based on some percpu calculation).

Then for each additional memcg, we need a single load to get the
parent's stats_updates directly. This reduces the number of loads from
O(3N) to O(2+N) -- where N is the number of memcgs we need to iterate.

Additionally, stash a pointer to memcg->vmstats in each struct
memcg_vmstats_percpu such that we can access the atomic counter that all
CPUs fold into, memcg->vmstats->stats_updates.
memcg_should_flush_stats() is changed to memcg_vmstats_needs_flush() to
accept a struct memcg_vmstats pointer accordingly.

In struct memcg_vmstats_percpu, make sure both pointers together with
stats_updates live on the same cacheline. Finally, update
mem_cgroup_alloc() to take in a parent pointer and initialize the new
cache pointers on each CPU. The percpu loop in mem_cgroup_alloc() may
look concerning, but there are multiple similar loops in the cgroup
creation path (e.g. cgroup_rstat_init()), most of which are hidden
within alloc_percpu().

According to Oliver's testing [1], this fixes multiple 30-38%
regressions in vm-scalability, will-it-scale-tlb_flush2, and
will-it-scale-fallocate1. This comes at a cost of 2 more pointers per
CPU (<2KB on a machine with 128 CPUs).

[1] https://lore.kernel.org/lkml/ZbDJsfsZt2ITyo61@xsang-OptiPlex-9020/

[yosryahmed@google.com: fix struct memcg_vmstats_percpu size and alignment]
  Link: https://lkml.kernel.org/r/20240203044612.1234216-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20240124100023.660032-1-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Fixes: 8d59d2214c23 ("mm: memcg: make stats flushing threshold per-memcg")
Tested-by: kernel test robot <oliver.sang@intel.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202401221624.cb53a8ca-oliver.sang@intel.com
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 16:58:49 +08:00
Yosry Ahmed f7a35d7bb7 mm: memcg: restore subtree stats flushing
Upstream: commit 7d7ef0a4686abe43cd76a141b340a348f45ecdf2
Conflicts: Skip change in zswap.c, due to missing of b5ba474f3f51,
    should be OK, later backport will easily notice the change of
    function params.
Backport-reason: mm: memcg: subtree stats flushing and thresholds

Stats flushing for memcg currently follows the following rules:
- Always flush the entire memcg hierarchy (i.e. flush the root).
- Only one flusher is allowed at a time. If someone else tries to flush
  concurrently, they skip and return immediately.
- A periodic flusher flushes all the stats every 2 seconds.

The reason this approach is followed is because all flushes are serialized
by a global rstat spinlock.  On the memcg side, flushing is invoked from
userspace reads as well as in-kernel flushers (e.g.  reclaim, refault,
etc).  This approach aims to avoid serializing all flushers on the global
lock, which can cause a significant performance hit under high
concurrency.

This approach has the following problems:
- Occasionally a userspace read of the stats of a non-root cgroup will
  be too expensive as it has to flush the entire hierarchy [1].
- Sometimes the stats accuracy are compromised if there is an ongoing
  flush, and we skip and return before the subtree of interest is
  actually flushed, yielding stale stats (by up to 2s due to periodic
  flushing). This is more visible when reading stats from userspace,
  but can also affect in-kernel flushers.

The latter problem is particulary a concern when userspace reads stats
after an event occurs, but gets stats from before the event. Examples:
- When memory usage / pressure spikes, a userspace OOM handler may look
  at the stats of different memcgs to select a victim based on various
  heuristics (e.g. how much private memory will be freed by killing
  this). Reading stale stats from before the usage spike in this case
  may cause a wrongful OOM kill.
- A proactive reclaimer may read the stats after writing to
  memory.reclaim to measure the success of the reclaim operation. Stale
  stats from before reclaim may give a false negative.
- Reading the stats of a parent and a child memcg may be inconsistent
  (child larger than parent), if the flush doesn't happen when the
  parent is read, but happens when the child is read.

As for in-kernel flushers, they will occasionally get stale stats.  No
regressions are currently known from this, but if there are regressions,
they would be very difficult to debug and link to the source of the
problem.

This patch aims to fix these problems by restoring subtree flushing, and
removing the unified/coalesced flushing logic that skips flushing if there
is an ongoing flush.  This change would introduce a significant regression
with global stats flushing thresholds.  With per-memcg stats flushing
thresholds, this seems to perform really well.  The thresholds protect the
underlying lock from unnecessary contention.

This patch was tested in two ways to ensure the latency of flushing is
up to par, on a machine with 384 cpus:

- A synthetic test with 5000 concurrent workers in 500 cgroups doing
  allocations and reclaim, as well as 1000 readers for memory.stat
  (variation of [2]). No regressions were noticed in the total runtime.
  Note that significant regressions in this test are observed with
  global stats thresholds, but not with per-memcg thresholds.

- A synthetic stress test for concurrently reading memcg stats while
  memory allocation/freeing workers are running in the background,
  provided by Wei Xu [3]. With 250k threads reading the stats every
  100ms in 50k cgroups, 99.9% of reads take <= 50us. Less than 0.01%
  of reads take more than 1ms, and no reads take more than 100ms.

[1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
[2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/

[akpm@linux-foundation.org: fix mm/zswap.c]
[yosryahmed@google.com: remove stats flushing mutex]
  Link: https://lkml.kernel.org/r/CAJD7tkZgP3m-VVPn+fF_YuvXeQYK=tZZjJHj=dzD=CcSSpp2qg@mail.gmail.com
Link: https://lkml.kernel.org/r/20231129032154.3710765-6-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ivan Babrou <ivan@cloudflare.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Wei Xu <weixugc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 16:58:49 +08:00
Yosry Ahmed 1a9570e74d mm: workingset: move the stats flush into workingset_test_recent()
Upstream: commit b006847222623ac3cda8589d15379eac86a2bcb7
Conflicts: none
Backport-reason: mm: memcg: subtree stats flushing and thresholds

The workingset code flushes the stats in workingset_refault() to get
accurate stats of the eviction memcg.  In preparation for more scoped
flushed and passing the eviction memcg to the flush call, move the call to
workingset_test_recent() where we have a pointer to the eviction memcg.

The flush call is sleepable, and cannot be made in an rcu read section.
Hence, minimize the rcu read section by also moving it into
workingset_test_recent().  Furthermore, instead of holding the rcu read
lock throughout workingset_test_recent(), only hold it briefly to get a
ref on the eviction memcg.  This allows us to make the flush call after we
get the eviction memcg.

As for workingset_refault(), nothing else there appears to be protected by
rcu.  The memcg of the faulted folio (which is not necessarily the same as
the eviction memcg) is protected by the folio lock, which is held from all
callsites.  Add a VM_BUG_ON() to make sure this doesn't change from under
us.

No functional change intended.

Link: https://lkml.kernel.org/r/20231129032154.3710765-5-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ivan Babrou <ivan@cloudflare.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Wei Xu <weixugc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 16:58:48 +08:00
Yosry Ahmed 0f76dc379d mm: memcg: make stats flushing threshold per-memcg
Upstream: commit 8d59d2214c2362e7a9d185d80b613e632581af7b
Conflicts: none
Backport-reason: mm: memcg: subtree stats flushing and thresholds

A global counter for the magnitude of memcg stats update is maintained on
the memcg side to avoid invoking rstat flushes when the pending updates
are not significant.  This avoids unnecessary flushes, which are not very
cheap even if there isn't a lot of stats to flush.  It also avoids
unnecessary lock contention on the underlying global rstat lock.

Make this threshold per-memcg.  The scheme is followed where percpu (now
also per-memcg) counters are incremented in the update path, and only
propagated to per-memcg atomics when they exceed a certain threshold.

This provides two benefits: (a) On large machines with a lot of memcgs,
the global threshold can be reached relatively fast, so guarding the
underlying lock becomes less effective.  Making the threshold per-memcg
avoids this.

(b) Having a global threshold makes it hard to do subtree flushes, as we
cannot reset the global counter except for a full flush.  Per-memcg
counters removes this as a blocker from doing subtree flushes, which helps
avoid unnecessary work when the stats of a small subtree are needed.

Nothing is free, of course.  This comes at a cost: (a) A new per-cpu
counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 bytes.  The extra
memory usage is insigificant.

(b) More work on the update side, although in the common case it will only
be percpu counter updates.  The amount of work scales with the number of
ancestors (i.e.  tree depth).  This is not a new concept, adding a cgroup
to the rstat tree involves a parent loop, so is charging.  Testing results
below show no significant regressions.

(c) The error margin in the stats for the system as a whole increases from
NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * NR_MEMCGS.
This is probably fine because we have a similar per-memcg error in charges
coming from percpu stocks, and we have a periodic flusher that makes sure
we always flush all the stats every 2s anyway.

This patch was tested to make sure no significant regressions are
introduced on the update path as follows.  The following benchmarks were
ran in a cgroup that is 2 levels deep (/sys/fs/cgroup/a/b/):

(1) Running 22 instances of netperf on a 44 cpu machine with
hyperthreading disabled. All instances are run in a level 2 cgroup, as
well as netserver:
  # netserver -6
  # netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Averaging 20 runs, the numbers are as follows:
Base: 40198.0 mbps
Patched: 38629.7 mbps (-3.9%)

The regression is minimal, especially for 22 instances in the same
cgroup sharing all ancestors (so updating the same atomics).

(2) will-it-scale page_fault tests. These tests (specifically
per_process_ops in page_fault3 test) detected a 25.9% regression before
for a change in the stats update path [1]. These are the
numbers from 10 runs (+ is good) on a machine with 256 cpus:

             LABEL            |     MEAN    |   MEDIAN    |   STDDEV   |
------------------------------+-------------+-------------+-------------
  page_fault1_per_process_ops |             |             |            |
  (A) base                    | 270249.164  | 265437.000  | 13451.836  |
  (B) patched                 | 261368.709  | 255725.000  | 13394.767  |
                              | -3.29%      | -3.66%      |            |
  page_fault1_per_thread_ops  |             |             |            |
  (A) base                    | 242111.345  | 239737.000  | 10026.031  |
  (B) patched                 | 237057.109  | 235305.000  | 9769.687   |
                              | -2.09%      | -1.85%      |            |
  page_fault1_scalability     |             |             |
  (A) base                    | 0.034387    | 0.035168    | 0.0018283  |
  (B) patched                 | 0.033988    | 0.034573    | 0.0018056  |
                              | -1.16%      | -1.69%      |            |
  page_fault2_per_process_ops |             |             |
  (A) base                    | 203561.836  | 203301.000  | 2550.764   |
  (B) patched                 | 197195.945  | 197746.000  | 2264.263   |
                              | -3.13%      | -2.73%      |            |
  page_fault2_per_thread_ops  |             |             |
  (A) base                    | 171046.473  | 170776.000  | 1509.679   |
  (B) patched                 | 166626.327  | 166406.000  | 768.753    |
                              | -2.58%      | -2.56%      |            |
  page_fault2_scalability     |             |             |
  (A) base                    | 0.054026    | 0.053821    | 0.00062121 |
  (B) patched                 | 0.053329    | 0.05306     | 0.00048394 |
                              | -1.29%      | -1.41%      |            |
  page_fault3_per_process_ops |             |             |
  (A) base                    | 1295807.782 | 1297550.000 | 5907.585   |
  (B) patched                 | 1275579.873 | 1273359.000 | 8759.160   |
                              | -1.56%      | -1.86%      |            |
  page_fault3_per_thread_ops  |             |             |
  (A) base                    | 391234.164  | 390860.000  | 1760.720   |
  (B) patched                 | 377231.273  | 376369.000  | 1874.971   |
                              | -3.58%      | -3.71%      |            |
  page_fault3_scalability     |             |             |
  (A) base                    | 0.60369     | 0.60072     | 0.0083029  |
  (B) patched                 | 0.61733     | 0.61544     | 0.009855   |
                              | +2.26%      | +2.45%      |            |

All regressions seem to be minimal, and within the normal variance for the
benchmark.  The fix for [1] assumes that 3% is noise -- and there were no
further practical complaints), so hopefully this means that such
variations in these microbenchmarks do not reflect on practical workloads.

(3) I also ran stress-ng in a nested cgroup and did not observe any
obvious regressions.

[1]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/

Link: https://lkml.kernel.org/r/20231129032154.3710765-4-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ivan Babrou <ivan@cloudflare.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Wei Xu <weixugc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 16:58:48 +08:00
Yosry Ahmed e3b1808e92 mm: memcg: move vmstats structs definition above flushing code
Upstream: commit e0bf1dc859fdd08ef738824710770a30a8069433
Conflicts: resolved
Backport-reason: mm: memcg: subtree stats flushing and thresholds

The following patch will make use of those structs in the flushing code,
so move their definitions (and a few other dependencies) a little bit up
to reduce the diff noise in the following patch.

No functional change intended.

Link: https://lkml.kernel.org/r/20231129032154.3710765-3-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ivan Babrou <ivan@cloudflare.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Wei Xu <weixugc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 16:58:48 +08:00
Yosry Ahmed 835b80a758 mm: memcg: change flush_next_time to flush_last_time
Upstream: commit 508bed884767a8eb394640bae9edcdf082816c43
Conflicts: none
Backport-reason: mm: memcg: subtree stats flushing and thresholds

Patch series "mm: memcg: subtree stats flushing and thresholds", v4.

This series attempts to address shortages in today's approach for memcg
stats flushing, namely occasionally stale or expensive stat reads.  The
series does so by changing the threshold that we use to decide whether to
trigger a flush to be per memcg instead of global (patch 3), and then
changing flushing to be per memcg (i.e.  subtree flushes) instead of
global (patch 5).

This patch (of 5):

flush_next_time is an inaccurate name.  It's not the next time that
periodic flushing will happen, it's rather the next time that ratelimited
flushing can happen if the periodic flusher is late.

Simplify its semantics by just storing the timestamp of the last flush
instead, flush_last_time.  Move the 2*FLUSH_TIME addition to
mem_cgroup_flush_stats_ratelimited(), and add a comment explaining it.
This way, all the ratelimiting semantics live in one place.

No functional change intended.

Link: https://lkml.kernel.org/r/20231129032154.3710765-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20231129032154.3710765-2-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Chris Li <chrisl@kernel.org> (Google)
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ivan Babrou <ivan@cloudflare.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Wei Xu <weixugc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 16:58:47 +08:00
Kairui Song 9faef83a53 kabi: move reservation in mem_cgroup to tail
Upstream: no

It was mistakenly put in the middle, fix it.

Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 16:58:47 +08:00
Kairui Song 2488b07a78 dist: config: update config
Upstream: no

Reduce NODES_SHIFT to 8, leave move space for page flags.

Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 16:58:47 +08:00
Kairui Song 690dae3cfe dist: disable non kernel pkg on non default config
Upstream: no

Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 16:58:46 +08:00
Kairui Song 432d074030 dist: add eks base config
Copied from 0017 with EMM enabled and tidyup with defconfig.

Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 15:36:21 +08:00
Kairui Song ec94d992b4 eks: net/toa: add ali_cip support
Upstream: no

As required for EKS.

Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 15:36:21 +08:00
Kairui Song 1583c6df5c eks: kvm/x86: introduce CONFIG_KVM_FORCE_PVCLOCK
Upstream: no

Allow force using PVCLOCK for better performance, with some
time accuracy loss.

Signed-off-by: Kairui Song <kasong@tencent.com>
2024-04-03 15:36:20 +08:00
linuszeng 390cfbd393 dist: kernel.template.spec: add lz4 build request
Upstream: no

Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
2024-04-03 15:36:20 +08:00
frankjpliu 62cfac3173 Merge branch 'herberthbli/scx' into 'master' (merge request !34)
herberthbli/scx
These commits are from upstream,they are the preparation of the sched_ext.
2024-03-29 03:02:59 +00:00
Jianping Liu 66d114febb checkpatch: add Signed-off-by check if commit cherry-pick from upstream
When a commit cherry-pick from upstream, we should add Signed-off-by in
commit message. With the Signed-off-by info, we could easy to know who
picked the commit, and could ask him the reason: why we need pick the commit.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-03-27 21:59:03 +08:00
Alex Shi f55a71cb98 checkpatch: check the backported commit for CID reference
Accepted few kind of references:
	commit xxx upstream or
	Upstream commit xxx or
	[ Upstream commit xxx or
	Upstream commit:

Signed-off-by: Alex Shi <alexsshi@tencent.com>
Acked-by: Alex Shi <alexsshi@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-03-27 21:41:27 +08:00
David Vernet 991fb56f6c selftests/bpf: Test pinning bpf timer to a core
Upstream commit 0d7ae06860753bb30b3731302b994da071120d00

Now that we support pinning a BPF timer to the current core, we should
test it with some selftests. This patch adds two new testcases to the
timer suite, which verifies that a BPF timer both with and without
BPF_F_TIMER_ABS, can be pinned to the calling core with BPF_F_TIMER_CPU_PIN.

Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hongbo Li <herberthbli@tencent.com>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20231004162339.200702-3-void@manifault.com
2024-03-27 18:09:18 +08:00
David Vernet c12dbc1bdd bpf: Add ability to pin bpf timer to calling CPU
Upstream commit d6247ecb6c1e17d7a33317090627f5bfe563cbb2

BPF supports creating high resolution timers using bpf_timer_* helper
functions. Currently, only the BPF_F_TIMER_ABS flag is supported, which
specifies that the timeout should be interpreted as absolute time. It
would also be useful to be able to pin that timer to a core. For
example, if you wanted to make a subset of cores run without timer
interrupts, and only have the timer be invoked on a single core.

This patch adds support for this with a new BPF_F_TIMER_CPU_PIN flag.
When specified, the HRTIMER_MODE_PINNED flag is passed to
hrtimer_start(). A subsequent patch will update selftests to validate.

Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hongbo Li <herberthbli@tencent.com>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20231004162339.200702-2-void@manifault.com
2024-03-27 18:09:14 +08:00
Ingo Molnar ed516134e3 sched/fair: Rename check_preempt_curr() to wakeup_preempt()
Upstream commit e23edc86b09df655bf8963bbcb16647adc787395

The name is a bit opaque - make it clear that this is about wakeup
preemption.

Also rename the ->check_preempt_curr() methods similarly.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Hongbo Li <herberthbli@tencent.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-03-27 18:09:11 +08:00
Ingo Molnar 0aaab31170 sched/fair: Rename check_preempt_wakeup() to check_preempt_wakeup_fair()
Upstream commit 82845683ca6a15fe8c7912c6264bb0e84ec6f5fb

Other scheduling classes already postfix their similar methods
with the class name.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Hongbo Li <herberthbli@tencent.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-03-27 18:09:07 +08:00
frankjpliu 175e1c3850 Merge branch 'leonylgao/master' into 'master' (merge request !25)
kabi: provide kabi check/update/create commands for local users
2024-03-26 09:38:26 +00:00
Yongliang Gao 618d09a6f8 script: update check-kabi script
Upstream: no

check-kabi script copy from tkernel4, run failed in tkernel5.
Traceback (most recent call last):
File "./scripts/check-kabi", line 143, in <module>
load_symvers(symvers,symvers_file)
File "./scripts/check-kabi", line 44, in load_symvers
checksum,symbol,directory,type = string.split(in_line)
ValueError: too many values to unpack

update script copy from dist/sources/check-kabi

Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Reviewed-by: Jianping Liu <frankjpliu@tencent.com>
2024-03-26 14:15:24 +08:00
Yongliang Gao 0c01b9d8fb kabi: provide kabi check/update/create commands for local users
Upstream: no

Provides kabi check/update/create commands for local users:
1. Check whether TencentOS Kennel KABI is compatible
2. Update TencentOS Kennel KABI file
3. Create TencentOS Kennel KABI file

Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Reviewed-by: Jianping Liu <frankjpliu@tencent.com>
2024-03-26 14:10:56 +08:00
Yongliang Gao 0512e1e0ee config: add kernel/configs/tkci.config
For easy get the config used in tkci, add kernel/configs/tkci.config,
users can run "make tencentconfig tkci.config" to generate the .config.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Reviewed-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: aurelianliu <aurelianliu@tencent.com>
2024-03-26 11:36:24 +08:00
Xinghui Li 0f1642a8b2 pci: bypass NVMe when booting PCIe storage with 5s delay
commit 762cad7("pci: delay 5s to proble multiple storage controllers")
aimed to order scsi device mount order. But nvme devices do not need
to do so, which will instead increase the boot time of storage servers.
Therefore we bypass NVMe device here.

Signed-off-by: Xinghui Li <korantli@tencent.com>
Signed-off-by: Samuel Liao <samuelliao@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-03-15 21:56:47 +08:00
Liu Yu 9cb9672adf pci: prohibit storage probe delay of virtio block device
virtio block device has no async probe path, so needn't probe delay
this patch will reduce about 5s kernel booting time.

Signed-off-by: Xiaoming Gao <newtongao@tencent.com>
Signed-off-by: Liu Yu <allanyuliu@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-03-15 21:56:47 +08:00
Samuel Liao 9501ffdbf1 pci: delay 5s to proble multiple storage controllers
For predictable disk order.

Signed-off-by: Samuel Liao <samuelliao@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-03-15 21:56:46 +08:00
costinchen ea51a4e717 spec: add support for secureboot by signing the vmlinuz.
spec: add dependency on libtool to build on koji.

Signed-off-by: Sinong Chen <costinchen@tencent.com>
Reviewed-by: Jianping Liu <frankjpliu@tencent.com>
2024-03-15 21:56:46 +08:00
Jianping Liu 855ffa3aaa config: enable CONFIG_ACPI_AGDI to support NMI
On AmpereONE soc, it support NMI interrupt, which needing enable
CONFIG_ACPI_AGDI.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-03-15 17:09:39 +08:00
Jianping Liu 0392975b4f config: enable ANDROID_BINDER to support android container
Cloud game need run android container, so add the configs as fellow:
CONFIG_ANDROID_BINDER_IPC=y
CONFIG_ANDROID_BINDERFS=y
CONFIG_ANDROID_BINDER_DEVICES="binder,hwbinder,vndbinder"
CONFIG_ANDROID_BINDER_IPC_SELFTEST=y

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-03-15 16:59:36 +08:00
Jianping Liu 4faa03afdc dist: add a modules-public rpm subpackage
TK have some kernel modules (such nvidia.ko) only uesd in public release verison,
split them into modules-public subpackage.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-03-15 12:12:14 +08:00
Jianping Liu 83c70cfab6 dist: rename modules-removable-media to modules-public-removable-media
Modules in kernle*modules-removable-media*.rpm are the drivers for removable
media. Using them will rise attack risck, and they are not used in private
release, only used in public release. So, rename it.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-03-15 12:06:30 +08:00
Jianping Liu 2596824741 dist: tks: add a removable media modules pkg
Tk4 have this sub package, try to be compatible, and use
filter-modules.sh instead which does a depmod check after splitting, to
avoid depmod failure after some modules are split into subpackage.

Signed-off-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-03-15 11:55:08 +08:00
frankjpliu da878504cf Merge branch 'cunhuang/master' into 'master' (merge request !12)
sync some ampere changes from upstream
2024-03-14 11:41:36 +00:00
aurelianliu dbc51490d1 x86 and arm64 config: add more module config
add module config from ocks && rhel

Signed-off-by: aurelianliu <aurelianliu@tencent.com>
2024-03-14 11:39:50 +00:00
Ilkka Koskinen 3e14a8ae4f perf vendor events arm64 AmpereOneX: Add core PMU events and metrics
commit 16438b652b464ef7d0a877d31e93ab54338f6b0a upstream.

Add JSON files for AmpereOneX core PMU events and metrics.

Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@arm.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mike Leach <mike.leach@linaro.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Link: https://lore.kernel.org/r/20231201021550.1109196-4-ilkka@os.amperecomputing.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Huang Cun <cunhuang@tencent.com>
2024-03-14 17:17:58 +08:00
Oliver Upton 59149005ec KVM: arm64: Always invalidate TLB for stage-2 permission faults
commit be097997a273259f1723baac5463cf19d8564efa upstream.

It is possible for multiple vCPUs to fault on the same IPA and attempt
to resolve the fault. One of the page table walks will actually update
the PTE and the rest will return -EAGAIN per our race detection scheme.
KVM elides the TLB invalidation on the racing threads as the return
value is nonzero.

Before commit a12ab1378a ("KVM: arm64: Use local TLBI on permission
relaxation") KVM always used broadcast TLB invalidations when handling
permission faults, which had the convenient property of making the
stage-2 updates visible to all CPUs in the system. However now we do a
local invalidation, and TLBI elision leads to the vCPU thread faulting
again on the stale entry. Remember that the architecture permits the TLB
to cache translations that precipitate a permission fault.

Invalidate the TLB entry responsible for the permission fault if the
stage-2 descriptor has been relaxed, regardless of which thread actually
did the job.

Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230922223229.1608155-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Huang Cun <cunhuang@tencent.com>
2024-03-14 17:17:19 +08:00
Oliver Upton fc51f30f7c KVM: arm64: Avoid soft lockups due to I-cache maintenance
commit 909b583f81b5bb5a398d4580543f59b908a86ccc upstream.

Gavin reports of soft lockups on his Ampere Altra Max machine when
backing KVM guests with hugetlb pages. Upon further investigation, it
was found that the system is unable to keep up with parallel I-cache
invalidations done by KVM's stage-2 fault handler.

This is ultimately an implementation problem. I-cache maintenance
instructions are available at EL0, so nothing stops a malicious
userspace from hammering a system with CMOs and cause it to fall over.
"Fixing" this problem in KVM is nothing more than slapping a bandage
over a much deeper problem.

Anyway, the kernel already has a heuristic for limiting TLB
invalidations to avoid soft lockups. Reuse that logic to limit I-cache
CMOs done by KVM to map executable pages on systems without FEAT_DIC.
While at it, restructure __invalidate_icache_guest_page() to improve
readability and squeeze our new condition into the existing branching
structure.

Link: https://lore.kernel.org/kvmarm/20230904072826.1468907-1-gshan@redhat.com/
Reviewed-by: Gavin Shan <gshan@redhat.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20230920080133.944717-3-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Huang Cun <cunhuang@tencent.com>
2024-03-14 17:16:44 +08:00
Oliver Upton 3a23b1b952 arm64: tlbflush: Rename MAX_TLBI_OPS
commit ec1c3b9ff16082f880b304be40992568f4eee6a7 upstream.

Perhaps unsurprisingly, I-cache invalidations suffer from performance
issues similar to TLB invalidations on certain systems. TLB and I-cache
maintenance all result in DVM on the mesh, which is where the real
bottleneck lies.

Rename the heuristic to point the finger at DVM, such that it may be
reused for limiting I-cache invalidations.

Reviewed-by: Gavin Shan <gshan@redhat.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20230920080133.944717-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Huang Cun <cunhuang@tencent.com>
2024-03-14 17:16:10 +08:00
Ilkka Koskinen 7c2a440c1d docs/perf: Add ampere_cspmu to toctree to fix a build warning
commit 0abe7f61c28d62ee0530c31589e6ea209aa82cbd upstream.

Add ampere_cspmu to toctree in order to address the following warning
produced when building documents:

	Documentation/admin-guide/perf/ampere_cspmu.rst: WARNING: document isn't included in any toctree

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/all/20231011172250.5a6498e5@canb.auug.org.au/
Fixes: 53a810ad3c5c ("perf: arm_cspmu: ampere_cspmu: Add support for Ampere SoC PMU")
Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20231012074103.3772114-1-ilkka@os.amperecomputing.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Huang Cun <cunhuang@tencent.com>
2024-03-14 17:15:27 +08:00
Ilkka Koskinen 34dc55de64 perf: arm_cspmu: ampere_cspmu: Add support for Ampere SoC PMU
commit 53a810ad3c5cde674cac71e629e6d10bfc9d838c upstream.

Ampere SoC PMU follows CoreSight PMU architecture. It uses implementation
specific registers to filter events rather than PMEVFILTnR registers.

Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20230913233941.9814-5-ilkka@os.amperecomputing.com
[will: Include linux/io.h in ampere_cspmu.c for writel()]
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Huang Cun <cunhuang@tencent.com>
2024-03-14 17:14:50 +08:00
Ilkka Koskinen e46dc8f19f perf: arm_cspmu: Support implementation specific validation
commit 647d5c5a9e7672e285f54f0e141ee759e69382f2 upstream.

Some platforms may use e.g. different filtering mechanism and, thus,
may need different way to validate the events and group.

Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20230913233941.9814-4-ilkka@os.amperecomputing.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Huang Cun <cunhuang@tencent.com>
2024-03-14 17:13:56 +08:00
Ilkka Koskinen 773c54aa27 perf: arm_cspmu: Support implementation specific filters
commit 0a7603ab242e9bab530227cf0d0d344d4e334acc upstream.

ARM Coresight PMU architecture specification [1] defines PMEVTYPER and
PMEVFILT* registers as optional in Chapter 2.1. Moreover, implementers may
choose to use PMIMPDEF* registers (offset: 0xD80-> 0xDFF) to filter the
events. Add support for those by adding implementation specific filter
callback function.

[1] https://developer.arm.com/documentation/ihi0091/latest

Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Reviewed-by: Besar Wicaksono <bwicaksono@nvidia.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20230913233941.9814-3-ilkka@os.amperecomputing.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Huang Cun <cunhuang@tencent.com>
2024-03-14 17:12:35 +08:00
Ilkka Koskinen 2fddeaf9e6 perf: arm_cspmu: Split 64-bit write to 32-bit writes
commit 8c282414ca6209977cb6d6cc66470ca2d1e56bf6 upstream.

Split the 64-bit register accesses if 64-bit access is not supported
by the PMU.

Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Reviewed-by: Besar Wicaksono <bwicaksono@nvidia.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://lore.kernel.org/r/20230913233941.9814-2-ilkka@os.amperecomputing.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Huang Cun <cunhuang@tencent.com>
2024-03-14 17:11:41 +08:00
Besar Wicaksono 8365e42c05 perf: arm_cspmu: Separate Arm and vendor module
commit bfc653aa89cb05796d7b4e046600accb442c9b7a upstream.

Arm Coresight PMU driver consists of main standard code and
vendor backend code. Both are currently built as a single module.
This patch adds vendor registration API to separate the two to
keep things modular. The main driver requests each known backend
module during initialization and defer device binding process.
The backend module then registers an init callback to the main
driver and continue the device driver binding process.

Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-and-tested-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20230821231608.50911-1-bwicaksono@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Huang Cun <cunhuang@tencent.com>
2024-03-14 17:10:46 +08:00
Jianping Liu e5904891ad config: enable slub debug as default in debug.config
Add CONFIG_SLUB_DEBUG_ON=y only in debug.config, it will not affect
release config.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-03-13 23:21:58 +08:00
Jianping Liu 0a34841932 config: enable CONFIG_HARDLOCKUP_DETECTOR
Linux 6.6 support hard lockup detect on aarch64, enable it.
It is useful to debug spin deadlock with irq disable.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-03-13 23:16:49 +08:00
leonylgao 5034e33943 Merge branch 'frankjpliu/master' into 'master' (merge request !10)
sync some CONFIG changes from tk4
2024-03-11 08:31:18 +00:00