Commit Graph

128 Commits

Author SHA1 Message Date
Linus Torvalds 86e98ed15b cgroup changes for v6.4-rc1
* cpuset changes including the fix for an incorrect interaction with CPU
   hotplug and an optimization.
 
 * Other doc and cosmetic changes.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZErfng4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGVVtAQCDycK4VSgc4nsFPG1vh1Oy1A723ciEUwAbKmV/
 F1n7xwEA68FiDvE29LpMJJuYP9HnX0A5zRMyNnb52kN9jmgcEQI=
 =ALol
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - cpuset changes including the fix for an incorrect interaction with
   CPU hotplug and an optimization

 - Other doc and cosmetic changes

* tag 'cgroup-for-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  docs: cgroup-v1/cpusets: update libcgroup project link
  cgroup/cpuset: Minor updates to test_cpuset_prs.sh
  cgroup/cpuset: Include offline CPUs when tasks' cpumasks in top_cpuset are updated
  cgroup/cpuset: Skip task update if hotplug doesn't affect current cpuset
  cpuset: Clean up cpuset_node_allowed
  cgroup: bpf: use cgroup_lock()/cgroup_unlock() wrappers
2023-04-29 10:05:22 -07:00
Waiman Long 7e27cb6ad4 cgroup/cpuset: Make cpuset_attach_task() skip subpartitions CPUs for top_cpuset
It is found that attaching a task to the top_cpuset does not currently
ignore CPUs allocated to subpartitions in cpuset_attach_task(). So the
code is changed to fix that.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-04-12 08:23:58 -10:00
Waiman Long eee8785379 cgroup/cpuset: Add cpuset_can_fork() and cpuset_cancel_fork() methods
In the case of CLONE_INTO_CGROUP, not all cpusets are ready to accept
new tasks. It is too late to check that in cpuset_fork(). So we need
to add the cpuset_can_fork() and cpuset_cancel_fork() methods to
pre-check it before we can allow attachment to a different cpuset.

We also need to set the attach_in_progress flag to alert other code
that a new task is going to be added to the cpuset.

Fixes: ef2c41cf38 ("clone3: allow spawning processes into cgroups")
Suggested-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-04-12 08:23:58 -10:00
Waiman Long 42a11bf5c5 cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly
By default, the clone(2) syscall spawn a child process into the same
cgroup as its parent. With the use of the CLONE_INTO_CGROUP flag
introduced by commit ef2c41cf38 ("clone3: allow spawning processes
into cgroups"), the child will be spawned into a different cgroup which
is somewhat similar to writing the child's tid into "cgroup.threads".

The current cpuset_fork() method does not properly handle the
CLONE_INTO_CGROUP case where the cpuset of the child may be different
from that of its parent.  Update the cpuset_fork() method to treat the
CLONE_INTO_CGROUP case similar to cpuset_attach().

Since the newly cloned task has not been running yet, its actual
memory usage isn't known. So it is not necessary to make change to mm
in cpuset_fork().

Fixes: ef2c41cf38 ("clone3: allow spawning processes into cgroups")
Reported-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-04-12 08:23:58 -10:00
Waiman Long ba9182a896 cgroup/cpuset: Wake up cpuset_attach_wq tasks in cpuset_cancel_attach()
After a successful cpuset_can_attach() call which increments the
attach_in_progress flag, either cpuset_cancel_attach() or cpuset_attach()
will be called later. In cpuset_attach(), tasks in cpuset_attach_wq,
if present, will be woken up at the end. That is not the case in
cpuset_cancel_attach(). So missed wakeup is possible if the attach
operation is somehow cancelled. Fix that by doing the wakeup in
cpuset_cancel_attach() as well.

Fixes: e44193d39e ("cpuset: let hotplug propagation work wait for task attaching")
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Cc: stable@vger.kernel.org # v3.11+
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-04-12 08:23:58 -10:00
Waiman Long 6667439f51 cgroup/cpuset: Include offline CPUs when tasks' cpumasks in top_cpuset are updated
Similar to commit 3fb906e7fa ("group/cpuset: Don't filter offline
CPUs in cpuset_cpus_allowed() for top cpuset tasks"), the whole set of
possible CPUs including offline ones should be used for setting cpumasks
for tasks in the top cpuset when a cpuset partition is modified as the
hotplug code won't update cpumasks for tasks in the top cpuset when
CPUs become online or offline.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-03-29 08:42:09 -10:00
Waiman Long df59b72cd8 cgroup/cpuset: Skip task update if hotplug doesn't affect current cpuset
If a hotplug event doesn't affect the current cpuset, there is no point
to call hotplug_update_tasks() or hotplug_update_tasks_legacy(). So
just skip it.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-03-29 08:41:28 -10:00
Tejun Heo 5dcad50636 Merge branch 'for-6.3-fixes' into for-6.4
To receive 292fd843de ("cgroup/cpuset: Fix partition root's cpuset.cpus
update bug") in preparation for further cpuset updates.
2023-03-29 08:39:46 -10:00
Waiman Long 292fd843de cgroup/cpuset: Fix partition root's cpuset.cpus update bug
It was found that commit 7a2127e66a ("cpuset: Call
set_cpus_allowed_ptr() with appropriate mask for task") introduced a bug
that corrupted "cpuset.cpus" of a partition root when it was updated.

It is because the tmp->new_cpus field of the passed tmp parameter
of update_parent_subparts_cpumask() should not be used at all as
it contains important cpumask data that should not be overwritten.
Fix it by using tmp->addmask instead.

Also update update_cpumask() to make sure that trialcs->cpu_allowed
will not be corrupted until it is no longer needed.

Fixes: 7a2127e66a ("cpuset: Call set_cpus_allowed_ptr() with appropriate mask for task")
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: stable@vger.kernel.org # v6.2+
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-03-29 08:37:07 -10:00
Haifeng Xu 8e4645226b cpuset: Clean up cpuset_node_allowed
Commit 002f290627 ("cpuset: use static key better and convert to new API")
has used __cpuset_node_allowed() instead of cpuset_node_allowed() to check
whether we can allocate on a memory node. Now this function isn't used by
anyone, so we can do the follow things to clean up it.

1. remove unused codes
2. rename __cpuset_node_allowed() to cpuset_node_allowed()
3. update comments in mm/page_alloc.c

Suggested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-03-23 16:02:27 -10:00
Linus Torvalds 8ca8d89b43 cgroup changes for v6.3-rc1
All the commits are trivial. Doc updates and a trivial code cleanup.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCY/Kvjw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGfcZAQDoyqEm807R2iaELEfee0xq/NPwJIaDRnXDP6ZL
 rIEtqAD8CzpvvtKauekMIsCHVq7SuxDae8Hi0a/RImN8e1sy6wY=
 =pUSo
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:
 "All the changes are trivial: documentation updates and a trivial code
  cleanup"

* tag 'cgroup-for-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/cpuset: fix a few kernel-doc warnings & coding style
  docs: cgroup-v1: use numbered lists for user interface setup
  docs: cgroup-v1: add internal cross-references
  docs: cgroup-v1: make swap extension subsections subsections
  docs: cgroup-v1: use bullet lists for list of stat file tables
  docs: cgroup-v1: move hierarchy of accounting caption
  docs: cgroup-v1: fix footnotes
  docs: cgroup-v1: use code block for locking order schema
  docs: cgroup-v1: wrap remaining admonitions in admonition blocks
  docs: cgroup-v1: replace custom note constructs with appropriate admonition blocks
  cgroup/cpuset: no need to explicitly init a global static variable
2023-02-21 10:36:29 -08:00
Will Deacon 7a2127e66a cpuset: Call set_cpus_allowed_ptr() with appropriate mask for task
set_cpus_allowed_ptr() will fail with -EINVAL if the requested
affinity mask is not a subset of the task_cpu_possible_mask() for the
task being updated. Consequently, on a heterogeneous system with cpusets
spanning the different CPU types, updates to the cgroup hierarchy can
silently fail to update task affinities when the effective affinity
mask for the cpuset is expanded.

For example, consider an arm64 system with 4 CPUs, where CPUs 2-3 are
the only cores capable of executing 32-bit tasks. Attaching a 32-bit
task to a cpuset containing CPUs 0-2 will correctly affine the task to
CPU 2. Extending the cpuset to CPUs 0-3, however, will fail to extend
the affinity mask of the 32-bit task because update_tasks_cpumask() will
pass the full 0-3 mask to set_cpus_allowed_ptr().

Extend update_tasks_cpumask() to take a temporary 'cpumask' paramater
and use it to mask the 'effective_cpus' mask with the possible mask for
each task being updated.

Fixes: 431c69fac0 ("cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()")
Signed-off-by: Will Deacon <will@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-02-06 10:18:36 -10:00
Waiman Long 3fb906e7fa cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks
Since commit 8f9ea86fdf ("sched: Always preserve the user
requested cpumask"), relax_compatible_cpus_allowed_ptr() is calling
__sched_setaffinity() unconditionally. This helps to expose a bug in
the current cpuset hotplug code where the cpumasks of the tasks in
the top cpuset are not updated at all when some CPUs become online or
offline. It is likely caused by the fact that some of the tasks in the
top cpuset, like percpu kthreads, cannot have their cpu affinity changed.

One way to reproduce this as suggested by Peter is:
 - boot machine
 - offline all CPUs except one
 - taskset -p ffffffff $$
 - online all CPUs

Fix this by allowing cpuset_cpus_allowed() to return a wider mask that
includes offline CPUs for those tasks that are in the top cpuset. For
tasks not in the top cpuset, the old rule applies and only online CPUs
will be returned in the mask since hotplug events will update their
cpumasks accordingly.

Fixes: 8f9ea86fdf ("sched: Always preserve the user requested cpumask")
Reported-by: Will Deacon <will@kernel.org>
Originally-from: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Will Deacon <will@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-02-06 10:15:08 -10:00
Waiman Long e5ae880384 cgroup/cpuset: Fix wrong check in update_parent_subparts_cpumask()
It was found that the check to see if a partition could use up all
the cpus from the parent cpuset in update_parent_subparts_cpumask()
was incorrect. As a result, it is possible to leave parent with no
effective cpu left even if there are tasks in the parent cpuset. This
can lead to system panic as reported in [1].

Fix this probem by updating the check to fail the enabling the partition
if parent's effective_cpus is a subset of the child's cpus_allowed.

Also record the error code when an error happens in update_prstate()
and add a test case where parent partition and child have the same cpu
list and parent has task. Enabling partition in the child will fail in
this case.

[1] https://www.spinics.net/lists/cgroups/msg36254.html

Fixes: f0af1bfc27 ("cgroup/cpuset: Relax constraints to partition & cpus changes")
Cc: stable@vger.kernel.org # v6.1
Reported-by: Srinivas Pandruvada <srinivas.pandruvada@intel.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-31 12:14:02 -10:00
Randy Dunlap 32a47817d0 cgroup/cpuset: fix a few kernel-doc warnings & coding style
Fix kernel-doc notation warnings:

kernel/cgroup/cpuset.c:1309: warning: Excess function parameter 'cpuset' description in 'update_parent_subparts_cpumask'
kernel/cgroup/cpuset.c:3909: warning: expecting prototype for cpuset_mem_spread_node(). Prototype was for cpuset_spread_node() instead

Also drop a blank line before EXPORT_SYMBOL_GPL() to be consistent
with kernel coding style.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-10 12:28:17 -10:00
Daniel Vacek 21786e5cb3 cgroup/cpuset: no need to explicitly init a global static variable
cpuset_rwsem is a static variable defined with DEFINE_STATIC_PERCPU_RWSEM().
It's initialized at build time and so there's no need for explicit runtime
init leaking one percpu int.

Signed-off-by: Daniel Vacek <neelx@redhat.com>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Acked-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-04 12:19:17 -10:00
Linus Torvalds e2ca6ba6ba MM patches for 6.2-rc1.
- More userfaultfs work from Peter Xu.
 
 - Several convert-to-folios series from Sidhartha Kumar and Huang Ying.
 
 - Some filemap cleanups from Vishal Moola.
 
 - David Hildenbrand added the ability to selftest anon memory COW handling.
 
 - Some cpuset simplifications from Liu Shixin.
 
 - Addition of vmalloc tracing support by Uladzislau Rezki.
 
 - Some pagecache folioifications and simplifications from Matthew Wilcox.
 
 - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it.
 
 - Miguel Ojeda contributed some cleanups for our use of the
   __no_sanitize_thread__ gcc keyword.  This series shold have been in the
   non-MM tree, my bad.
 
 - Naoya Horiguchi improved the interaction between memory poisoning and
   memory section removal for huge pages.
 
 - DAMON cleanups and tuneups from SeongJae Park
 
 - Tony Luck fixed the handling of COW faults against poisoned pages.
 
 - Peter Xu utilized the PTE marker code for handling swapin errors.
 
 - Hugh Dickins reworked compound page mapcount handling, simplifying it
   and making it more efficient.
 
 - Removal of the autonuma savedwrite infrastructure from Nadav Amit and
   David Hildenbrand.
 
 - zram support for multiple compression streams from Sergey Senozhatsky.
 
 - David Hildenbrand reworked the GUP code's R/O long-term pinning so
   that drivers no longer need to use the FOLL_FORCE workaround which
   didn't work very well anyway.
 
 - Mel Gorman altered the page allocator so that local IRQs can remnain
   enabled during per-cpu page allocations.
 
 - Vishal Moola removed the try_to_release_page() wrapper.
 
 - Stefan Roesch added some per-BDI sysfs tunables which are used to
   prevent network block devices from dirtying excessive amounts of
   pagecache.
 
 - David Hildenbrand did some cleanup and repair work on KSM COW
   breaking.
 
 - Nhat Pham and Johannes Weiner have implemented writeback in zswap's
   zsmalloc backend.
 
 - Brian Foster has fixed a longstanding corner-case oddity in
   file[map]_write_and_wait_range().
 
 - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang
   Chen.
 
 - Shiyang Ruan has done some work on fsdax, to make its reflink mode
   work better under xfstests.  Better, but still not perfect.
 
 - Christoph Hellwig has removed the .writepage() method from several
   filesystems.  They only need .writepages().
 
 - Yosry Ahmed wrote a series which fixes the memcg reclaim target
   beancounting.
 
 - David Hildenbrand has fixed some of our MM selftests for 32-bit
   machines.
 
 - Many singleton patches, as usual.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY5j6ZwAKCRDdBJ7gKXxA
 jkDYAP9qNeVqp9iuHjZNTqzMXkfmJPsw2kmy2P+VdzYVuQRcJgEAgoV9d7oMq4ml
 CodAgiA51qwzId3GRytIo/tfWZSezgA=
 =d19R
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - More userfaultfs work from Peter Xu

 - Several convert-to-folios series from Sidhartha Kumar and Huang Ying

 - Some filemap cleanups from Vishal Moola

 - David Hildenbrand added the ability to selftest anon memory COW
   handling

 - Some cpuset simplifications from Liu Shixin

 - Addition of vmalloc tracing support by Uladzislau Rezki

 - Some pagecache folioifications and simplifications from Matthew
   Wilcox

 - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use
   it

 - Miguel Ojeda contributed some cleanups for our use of the
   __no_sanitize_thread__ gcc keyword.

   This series should have been in the non-MM tree, my bad

 - Naoya Horiguchi improved the interaction between memory poisoning and
   memory section removal for huge pages

 - DAMON cleanups and tuneups from SeongJae Park

 - Tony Luck fixed the handling of COW faults against poisoned pages

 - Peter Xu utilized the PTE marker code for handling swapin errors

 - Hugh Dickins reworked compound page mapcount handling, simplifying it
   and making it more efficient

 - Removal of the autonuma savedwrite infrastructure from Nadav Amit and
   David Hildenbrand

 - zram support for multiple compression streams from Sergey Senozhatsky

 - David Hildenbrand reworked the GUP code's R/O long-term pinning so
   that drivers no longer need to use the FOLL_FORCE workaround which
   didn't work very well anyway

 - Mel Gorman altered the page allocator so that local IRQs can remnain
   enabled during per-cpu page allocations

 - Vishal Moola removed the try_to_release_page() wrapper

 - Stefan Roesch added some per-BDI sysfs tunables which are used to
   prevent network block devices from dirtying excessive amounts of
   pagecache

 - David Hildenbrand did some cleanup and repair work on KSM COW
   breaking

 - Nhat Pham and Johannes Weiner have implemented writeback in zswap's
   zsmalloc backend

 - Brian Foster has fixed a longstanding corner-case oddity in
   file[map]_write_and_wait_range()

 - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang
   Chen

 - Shiyang Ruan has done some work on fsdax, to make its reflink mode
   work better under xfstests. Better, but still not perfect

 - Christoph Hellwig has removed the .writepage() method from several
   filesystems. They only need .writepages()

 - Yosry Ahmed wrote a series which fixes the memcg reclaim target
   beancounting

 - David Hildenbrand has fixed some of our MM selftests for 32-bit
   machines

 - Many singleton patches, as usual

* tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits)
  mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio
  mm: mmu_gather: allow more than one batch of delayed rmaps
  mm: fix typo in struct pglist_data code comment
  kmsan: fix memcpy tests
  mm: add cond_resched() in swapin_walk_pmd_entry()
  mm: do not show fs mm pc for VM_LOCKONFAULT pages
  selftests/vm: ksm_functional_tests: fixes for 32bit
  selftests/vm: cow: fix compile warning on 32bit
  selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions
  mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem
  mm,thp,rmap: fix races between updates of subpages_mapcount
  mm: memcg: fix swapcached stat accounting
  mm: add nodes= arg to memory.reclaim
  mm: disable top-tier fallback to reclaim on proactive reclaim
  selftests: cgroup: make sure reclaim target memcg is unprotected
  selftests: cgroup: refactor proactive reclaim code to reclaim_until()
  mm: memcg: fix stale protection of reclaim target memcg
  mm/mmap: properly unaccount memory on mas_preallocate() failure
  omfs: remove ->writepage
  jfs: remove ->writepage
  ...
2022-12-13 19:29:45 -08:00
Kamalesh Babulal 0a2cafe6c7 cgroup/cpuset: Improve cpuset_css_alloc() description
Change the function argument in the description of cpuset_css_alloc()
from 'struct cgroup' -> 'struct cgroup_subsys_state'.  The change to the
argument type was introduced by commit eb95419b02 ("cgroup: pass
around cgroup_subsys_state instead of cgroup in subsystem methods").
Also, add more information to its description.

Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Joel Savitz <jsavitz@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-11-22 11:11:47 -10:00
Waiman Long 7fd4da9c15 cgroup/cpuset: Optimize cpuset_attach() on v2
It was found that with the default hierarchy, enabling cpuset in the
child cgroups can trigger a cpuset_attach() call in each of the child
cgroups that have tasks with no change in effective cpus and mems. If
there are many processes in those child cgroups, it will burn quite a
lot of cpu cycles iterating all the tasks without doing useful work.

Optimizing this case by comparing between the old and new cpusets and
skip useless update if there is no change in effective cpus and mems.
Also mems_allowed are less likely to be changed than cpus_allowed. So
skip changing mm if there is no change in effective_mems and
CS_MEMORY_MIGRATE is not set.

By inserting some instrumentation code and running a simple command in
a container 200 times in a cgroup v2 system, it was found that all the
cpuset_attach() calls are skipped (401 times in total) as there was no
change in effective cpus and mems.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-11-14 11:45:56 -10:00
Waiman Long 18f9a4d475 cgroup/cpuset: Skip spread flags update on v2
Cpuset v2 has no spread flags to set. So we can skip spread
flags update if cpuset v2 is being used. Also change the name to
cpuset_update_task_spread_flags() to indicate that there are multiple
spread flags.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-11-14 11:45:51 -10:00
Liu Shixin 1eeaa4fd39 memory: move hotplug memory notifier priority to same file for easy sorting
The priority of hotplug memory callback is defined in a different file. 
And there are some callers using numbers directly.  Collect them together
into include/linux/memory.h for easy reading.  This allows us to sort
their priorities more intuitively without additional comments.

Link: https://lkml.kernel.org/r/20220923033347.3935160-9-liushixin2@huawei.com
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Waiman Long <longman@redhat.com>
Cc: zefan li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 17:37:17 -08:00
Liu Shixin f9e60becee cgroup/cpuset: use hotplug_memory_notifier() directly
Patch series "mm: Use hotplug_memory_notifier() instead of
register_hotmemory_notifier()", v4.

Commit f02c696800 ("include/linux/memory.h: implement
register_hotmemory_notifier()") introduced register_hotmemory_notifier()
to avoid a compile problem with gcc-4.4.4:

    When CONFIG_MEMORY_HOTPLUG=n, we don't want the memory-hotplug notifier
    handlers to be included in the .o files, for space reasons.
    
    The existing hotplug_memory_notifier() tries to handle this but testing
    with gcc-4.4.4 shows that it doesn't work - the hotplug functions are
    still present in the .o files.

Since commit 76ae847497 ("Documentation: raise minimum supported
version of GCC to 5.1") has already updated the minimum gcc version to
5.1.  The previous problem mentioned in f02c696800 does not exist.  So
we can now revert to use hotplug_memory_notifier() directly rather than
register_hotmemory_notifier().

In the last patch, we move all hotplug memory notifier priority to same
file for easy sorting.


This patch (of 8):

Commit 76ae847497 ("Documentation: raise minimum supported version of
GCC to 5.1") updated the minimum gcc version to 5.1.  So the problem
mentioned in f02c696800 ("include/linux/memory.h: implement
register_hotmemory_notifier()") no longer exist.  So we can now switch to
use hotplug_memory_notifier() directly rather than
register_hotmemory_notifier().

Link: https://lkml.kernel.org/r/20220923033347.3935160-1-liushixin2@huawei.com
Link: https://lkml.kernel.org/r/20220923033347.3935160-2-liushixin2@huawei.com
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Waiman Long <longman@redhat.com>
Cc: zefan li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 17:37:15 -08:00
Jiapeng Chong c478bd8836 cgroup/cpuset: remove unreachable code
The function sched_partition_show cannot execute seq_puts, delete the
invalid code.

kernel/cgroup/cpuset.c:2849 sched_partition_show() warn: ignoring unreachable code.

Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2087
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-07 05:30:29 -10:00
Waiman Long d7c8142d5a cgroup/cpuset: Make partition invalid if cpumask change violates exclusivity rule
Currently, changes in "cpust.cpus" of a partition root is not allowed if
it violates the sibling cpu exclusivity rule when the check is done
in the validate_change() function. That is inconsistent with the
other cpuset changes that are always allowed but may make a partition
invalid.

Update the cpuset code to allow cpumask change even if it violates the
sibling cpu exclusivity rule, but invalidate the partition instead
just like the other changes. However, other sibling partitions with
conflicting cpumask will also be invalidated in order to not violating
the exclusivity rule. This behavior is specific to this partition
rule violation.

Note that a previous commit has made sibling cpu exclusivity rule check
the last check of validate_change(). So if -EINVAL is returned, we can
be sure that sibling cpu exclusivity rule violation is the only rule
that is broken.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04 10:47:28 -10:00
Waiman Long 74027a6535 cgroup/cpuset: Relocate a code block in validate_change()
This patch moves down the exclusive cpu and memory check in
validate_change(). There is no functional change.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04 10:47:28 -10:00
Waiman Long 7476a636d3 cgroup/cpuset: Show invalid partition reason string
There are a number of different reasons which can cause a partition to
become invalid. A user seeing an invalid partition may not know exactly
why. To help user to get a better understanding of the underlying reason,
The cpuset.cpus.partition control file, when read, will now report the
reason why a partition become invalid. When a partition does become
invalid, reading the control file will show "root invalid (<reason>)"
where <reason> is a string that describes why the partition is invalid.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04 10:47:27 -10:00
Waiman Long f28e22441f cgroup/cpuset: Add a new isolated cpus.partition type
Cpuset v1 uses the sched_load_balance control file to determine if load
balancing should be enabled.  Cpuset v2 gets rid of sched_load_balance
as its use may require disabling load balancing at cgroup root.

For workloads that require very low latency like DPDK, the latency
jitters caused by periodic load balancing may exceed the desired
latency limit.

When cpuset v2 is in use, the only way to avoid this latency cost is to
use the "isolcpus=" kernel boot option to isolate a set of CPUs. After
the kernel boot, however, there is no way to add or remove CPUs from
this isolated set. For workloads that are more dynamic in nature, that
means users have to provision enough CPUs for the worst case situation
resulting in excess idle CPUs.

To address this issue for cpuset v2, a new cpuset.cpus.partition type
"isolated" is added which allows the creation of a cpuset partition
without load balancing. This will allow system administrators to
dynamically adjust the size of isolated partition to the current need
of the workload without rebooting the system.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04 10:47:27 -10:00
Waiman Long f0af1bfc27 cgroup/cpuset: Relax constraints to partition & cpus changes
Currently, enabling a partition root is only allowed if all the
constraints of a valid partition are satisfied. Even changes to
"cpuset.cpus" may not be allowed in some cases. Moreover, there are
limits to changes made to a parent cpuset if it is a valid partition
root. This is contrary to the general cgroup v2 philosophy.

This patch relaxes the constraints of changing the state of "cpuset.cpus"
and "cpuset.cpus.partition". Now all valid changes ("member" or "root")
to "cpuset.cpus.partition" are allowed even if there are child cpusets
underneath it.

Trying to make a cpuset a partition root, however, will cause its state
to become invalid if the following constraints of a valid partition
root are not satisfied.

 1) The "cpuset.cpus" is non-empty and exclusive.
 2) The parent cpuset is a valid partition root.
 3) The "cpuset.cpus" overlaps parent's "cpuset.cpus".

Similarly, almost all changes to "cpuset.cpus" are allowed with the
exception that if the underlying CS_CPU_EXCLUSIVE flag is set, the
exclusivity rule will still apply.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04 10:47:27 -10:00
Waiman Long e2d59900d9 cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective
Currently, a partition root cannot have empty "cpuset.cpus.effective".
As a result, a parent partition root cannot distribute out all its
CPUs to child partitions with no CPUs left. However in most cases,
there shouldn't be any tasks associated with intermediate nodes of the
default hierarchy. So the current rule is too restrictive and can waste
valuable CPU resource.

To address this issue, we are now allowing a partition to have empty
"cpuset.cpus.effective" as long as it has no task. Since cpuset is
threaded, no-internal-process rule does not apply. So it is possible
to have tasks in a partition root with child sub-partitions even though
that should be uncommon.

A parent partition with no task can now have all its CPUs distributed out
to its child partitions. The top cpuset always have some house-keeping
tasks running and so its list of effective cpu can't be empty.

Once a partition with empty "cpuset.cpus.effective" is formed, no
new task can be moved into it until "cpuset.cpus.effective" becomes
non-empty.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04 10:47:27 -10:00
Waiman Long 18065ebe9b cgroup/cpuset: Miscellaneous cleanups & add helper functions
The partition root state (PRS) macro names do not currently match the
external names. Change them to match the external names and add helper
functions to read or change the state.

Shorten the cpuset argument of update_parent_subparts_cpumask() to cs
to match other cpuset functions.

Remove the new_prs argument from notify_partition_change() as the
cs->partition_root_state has already been set to new_prs before it
is called.

There is no functional change.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04 10:47:27 -10:00
Waiman Long ec5fbdfb99 cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset
Previously, update_tasks_cpumask() is not supposed to be called with
top cpuset. With cpuset partition that takes CPUs away from the top
cpuset, adjusting the cpus_mask of the tasks in the top cpuset is
necessary. Percpu kthreads, however, are ignored.

Fixes: ee8dde0cd2 ("cpuset: Add new v2 cpuset.sched.partition flag")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-09-04 10:47:27 -10:00
Tejun Heo 4f7e723643 cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock
Bringing up a CPU may involve creating and destroying tasks which requires
read-locking threadgroup_rwsem, so threadgroup_rwsem nests inside
cpus_read_lock(). However, cpuset's ->attach(), which may be called with
thredagroup_rwsem write-locked, also wants to disable CPU hotplug and
acquires cpus_read_lock(), leading to a deadlock.

Fix it by guaranteeing that ->attach() is always called with CPU hotplug
disabled and removing cpus_read_lock() call from cpuset_attach().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-and-tested-by: Imran Khan <imran.f.khan@oracle.com>
Reported-and-tested-by: Xuewen Yan <xuewen.yan@unisoc.com>
Fixes: 05c7b7a92c ("cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug")
Cc: stable@vger.kernel.org # v5.17+
2022-08-17 07:36:05 -10:00
Waiman Long b6e8d40d43 sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowed
With cgroup v2, the cpuset's cpus_allowed mask can be empty indicating
that the cpuset will just use the effective CPUs of its parent. So
cpuset_can_attach() can call task_can_attach() with an empty mask.
This can lead to cpumask_any_and() returns nr_cpu_ids causing the call
to dl_bw_of() to crash due to percpu value access of an out of bound
CPU value. For example:

	[80468.182258] BUG: unable to handle page fault for address: ffffffff8b6648b0
	  :
	[80468.191019] RIP: 0010:dl_cpu_busy+0x30/0x2b0
	  :
	[80468.207946] Call Trace:
	[80468.208947]  cpuset_can_attach+0xa0/0x140
	[80468.209953]  cgroup_migrate_execute+0x8c/0x490
	[80468.210931]  cgroup_update_dfl_csses+0x254/0x270
	[80468.211898]  cgroup_subtree_control_write+0x322/0x400
	[80468.212854]  kernfs_fop_write_iter+0x11c/0x1b0
	[80468.213777]  new_sync_write+0x11f/0x1b0
	[80468.214689]  vfs_write+0x1eb/0x280
	[80468.215592]  ksys_write+0x5f/0xe0
	[80468.216463]  do_syscall_64+0x5c/0x80
	[80468.224287]  entry_SYSCALL_64_after_hwframe+0x44/0xae

Fix that by using effective_cpus instead. For cgroup v1, effective_cpus
is the same as cpus_allowed. For v2, effective_cpus is the real cpumask
to be used by tasks within the cpuset anyway.

Also update task_can_attach()'s 2nd argument name to cs_effective_cpus to
reflect the change. In addition, a check is added to task_can_attach()
to guard against the possibility that cpumask_any_and() may return a
value >= nr_cpu_ids.

Fixes: 7f51412a41 ("sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20220803015451.2219567-1-longman@redhat.com
2022-08-03 10:34:26 +02:00
Waiman Long 2685027fca cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()
There are 3 places where the cpu and node masks of the top cpuset can
be initialized in the order they are executed:
 1) start_kernel -> cpuset_init()
 2) start_kernel -> cgroup_init() -> cpuset_bind()
 3) kernel_init_freeable() -> do_basic_setup() -> cpuset_init_smp()

The first cpuset_init() call just sets all the bits in the masks.
The second cpuset_bind() call sets cpus_allowed and mems_allowed to the
default v2 values. The third cpuset_init_smp() call sets them back to
v1 values.

For systems with cgroup v2 setup, cpuset_bind() is called once.  As a
result, cpu and memory node hot add may fail to update the cpu and node
masks of the top cpuset to include the newly added cpu or node in a
cgroup v2 environment.

For systems with cgroup v1 setup, cpuset_bind() is called again by
rebind_subsystem() when the v1 cpuset filesystem is mounted as shown
in the dmesg log below with an instrumented kernel.

  [    2.609781] cpuset_bind() called - v2 = 1
  [    3.079473] cpuset_init_smp() called
  [    7.103710] cpuset_bind() called - v2 = 0

smp_init() is called after the first two init functions.  So we don't
have a complete list of active cpus and memory nodes until later in
cpuset_init_smp() which is the right time to set up effective_cpus
and effective_mems.

To fix this cgroup v2 mask setup problem, the potentially incorrect
cpus_allowed & mems_allowed setting in cpuset_init_smp() are removed.
For cgroup v2 systems, the initial cpuset_bind() call will set the masks
correctly.  For cgroup v1 systems, the second call to cpuset_bind()
will do the right setup.

cc: stable@vger.kernel.org
Signed-off-by: Waiman Long <longman@redhat.com>
Tested-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-05-05 08:57:00 -10:00
Linus Torvalds 2fce7ea0e0 Merge branch 'for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "All trivial cleanups without meaningful behavior changes"

* 'for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: cleanup comments
  cgroup: Fix cgroup_can_fork() and cgroup_post_fork() kernel-doc comment
  cgroup: rstat: retrieve current bstat to delta directly
  cgroup: rstat: use same convention to assign cgroup_base_stat
2022-03-23 12:43:35 -07:00
Ingo Molnar ccdbf33c23 Linux 5.17-rc8
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmIuUskeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGCFkH/2n3mpGXuITp0ZXE
 TNrpbdZOof5SgLw+w7THswXuo6m5yRGNKQs9fvIvDD8Vf7/OdQQfPOmF1cIE5+nk
 wcz6aHKbdrok8Jql2qjJqWXZ5xbGj6qywg3zZrwOUsCKFP5p+AjBJcmZOsvQHjSp
 ASODy1moOlK+nO52TrMaJw74a8xQPmQiNa+T2P+FedEYjlcRH/c7hLJ7GEnL6+cC
 /R4bATZq3tiInbTBlkC0hR0iVNgRXwXNyv9PEXrYYYHnekh8G1mgSNf06iejLcsG
 aAYsW9NyPxu8zPhhHNx79K9o8BMtxGD4YQpsfdfIEnf9Q3euqAKe2evRWqHHlDms
 RuSCtsc=
 =M9Nc
 -----END PGP SIGNATURE-----

Merge tag 'v5.17-rc8' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-03-15 10:28:12 +01:00
Tom Rix f9da322e86 cgroup: cleanup comments
for spdx, add a space before //

replacements
judgement to judgment
transofrmed to transformed
partitition to partition
histrical to historical
migratecd to migrated

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-03-13 19:19:27 -10:00
Jiapeng Chong c70cd039f1 cpuset: Fix kernel-doc
Fix the following W=1 kernel warnings:

kernel/cgroup/cpuset.c:3718: warning: expecting prototype for
cpuset_memory_pressure_bump(). Prototype was for
__cpuset_memory_pressure_bump() instead.

kernel/cgroup/cpuset.c:3568: warning: expecting prototype for
cpuset_node_allowed(). Prototype was for __cpuset_node_allowed()
instead.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-02-22 09:49:49 -10:00
Ingo Molnar 6255b48aeb Linux 5.17-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmISrYgeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGg20IAKDZr7rfSHBopjQV
 Cocw744tom0XuxpvSZpp2GGOOXF+tkswcNNaRIrbGOl1mkyxA7eBZCTMpDeDS9aQ
 wB0D0Gxx8QBAJp4KgB1W7TB+hIGes/rs8Ve+6iO4ulLLdCVWX/q2boI0aZ7QX9O9
 qNi8OsoZQtk6falRvciZFHwV5Av1p2Sy1AW57udQ7DvJ4H98AfKf1u8/z208WWW8
 1ixC+qJxQcUcM9vI+7P9Tt7NbFSKv8SvAmqjFY7P+DxQAsVw6KXoqVXykDzeOv0t
 fUNOE/t0oFZafwtn8h7KBQnwS9lH03+3KkslVZs+iMFyUj/Bar+NVVyKoDhWXtVg
 /PuMhEg=
 =eU1o
 -----END PGP SIGNATURE-----

Merge tag 'v5.17-rc5' into sched/core, to resolve conflicts

New conflicts in sched/core due to the following upstream fixes:

  44585f7bc0 ("psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n")
  a06247c680 ("psi: Fix uaf issue when psi trigger is destroyed while being polled")

Conflicts:
	include/linux/psi_types.h
	kernel/sched/psi.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-02-21 11:53:51 +01:00
Frederic Weisbecker 04d4e665a6 sched/isolation: Use single feature type while referring to housekeeping cpumask
Refer to housekeeping APIs using single feature types instead of flags.
This prevents from passing multiple isolation features at once to
housekeeping interfaces, which soon won't be possible anymore as each
isolation features will have their own cpumask.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org
2022-02-16 15:57:55 +01:00
Zhang Qiao 05c7b7a92c cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug
As previously discussed(https://lkml.org/lkml/2022/1/20/51),
cpuset_attach() is affected with similar cpu hotplug race,
as follow scenario:

     cpuset_attach()				cpu hotplug
    ---------------------------            ----------------------
    down_write(cpuset_rwsem)
    guarantee_online_cpus() // (load cpus_attach)
					sched_cpu_deactivate
					  set_cpu_active()
					  // will change cpu_active_mask
    set_cpus_allowed_ptr(cpus_attach)
      __set_cpus_allowed_ptr_locked()
       // (if the intersection of cpus_attach and
         cpu_active_mask is empty, will return -EINVAL)
    up_write(cpuset_rwsem)

To avoid races such as described above, protect cpuset_attach() call
with cpu_hotplug_lock.

Fixes: be367d0992 ("cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time")
Cc: stable@vger.kernel.org # v2.6.32+
Reported-by: Zhao Gongyi <zhaogongyi@huawei.com>
Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-02-14 09:48:04 -10:00
Waiman Long 2bdfd2825c cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning
It was found that a "suspicious RCU usage" lockdep warning was issued
with the rcu_read_lock() call in update_sibling_cpumasks().  It is
because the update_cpumasks_hier() function may sleep. So we have
to release the RCU lock, call update_cpumasks_hier() and reacquire
it afterward.

Also add a percpu_rwsem_assert_held() in update_sibling_cpumasks()
instead of stating that in the comment.

Fixes: 4716909cc5 ("cpuset: Track cpusets that use parent's effective_cpus")
Signed-off-by: Waiman Long <longman@redhat.com>
Tested-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-02-03 05:59:01 -10:00
Tianchen Ding c80d401c52 cpuset: Fix the bug that subpart_cpus updated wrongly in update_cpumask()
subparts_cpus should be limited as a subset of cpus_allowed, but it is
updated wrongly by using cpumask_andnot(). Use cpumask_and() instead to
fix it.

Fixes: ee8dde0cd2 ("cpuset: Add new v2 cpuset.sched.partition flag")
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-26 06:49:30 -10:00
Michal Koutný d068eebbd4 cgroup/cpuset: Make child cpusets restrict parents on v1 hierarchy
The commit 1f1562fcd0 ("cgroup/cpuset: Don't let child cpusets
restrict parent in default hierarchy") inteded to relax the check only
on the default hierarchy (or v2 mode) but it dropped the check in v1
too.

This patch returns and separates the legacy-only validations so that
they can be considered only in the v1 mode, which should enforce the old
constraints for the sake of compatibility.

Fixes: 1f1562fcd0 ("cgroup/cpuset: Don't let child cpusets restrict parent in default hierarchy")
Suggested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-12 11:24:45 -10:00
Qi Zheng d4296faebd cpuset: convert 'allowed' in __cpuset_node_allowed() to be boolean
Convert 'allowed' in __cpuset_node_allowed() to be boolean since the
return types of node_isset() and __cpuset_node_allowed() are both
boolean.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-01-07 12:05:52 -10:00
Waiman Long 1f1562fcd0 cgroup/cpuset: Don't let child cpusets restrict parent in default hierarchy
In validate_change(), there is a check since v2.6.12 to make sure that
each of the child cpusets must be a subset of a parent cpuset.  IOW, it
allows child cpusets to restrict what changes can be made to a parent's
"cpuset.cpus". This actually violates one of the core principles of the
default hierarchy where a cgroup higher up in the hierarchy should be
able to change configuration however it sees fit as deligation breaks
down otherwise.

To address this issue, the check is now removed for the default hierarchy
to free parent cpusets from being restricted by child cpusets. The
check will still apply for legacy hierarchy.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-12-13 10:41:02 -10:00
Feng Tang 8ca1b5a498 mm/page_alloc: detect allocation forbidden by cpuset and bail out early
There was a report that starting an Ubuntu in docker while using cpuset
to bind it to movable nodes (a node only has movable zone, like a node
for hotplug or a Persistent Memory node in normal usage) will fail due
to memory allocation failure, and then OOM is involved and many other
innocent processes got killed.

It can be reproduced with command:

    $ docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"

(where node 4 is a movable node)

  runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
  CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
  Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
  Call Trace:
   dump_stack+0x6b/0x88
   dump_header+0x4a/0x1e2
   oom_kill_process.cold+0xb/0x10
   out_of_memory.part.0+0xaf/0x230
   out_of_memory+0x3d/0x80
   __alloc_pages_slowpath.constprop.0+0x954/0xa20
   __alloc_pages_nodemask+0x2d3/0x300
   pipe_write+0x322/0x590
   new_sync_write+0x196/0x1b0
   vfs_write+0x1c3/0x1f0
   ksys_write+0xa7/0xe0
   do_syscall_64+0x52/0xd0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Mem-Info:
  active_anon:392832 inactive_anon:182 isolated_anon:0
   active_file:68130 inactive_file:151527 isolated_file:0
   unevictable:2701 dirty:0 writeback:7
   slab_reclaimable:51418 slab_unreclaimable:116300
   mapped:45825 shmem:735 pagetables:2540 bounce:0
   free:159849484 free_pcp:73 free_cma:0
  Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
  Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB
  lowmem_reserve[]: 0 0 0 0 0
  Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB

  oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0
  Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0
  oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB
  oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The reason is that in this case, the target cpuset nodes only have
movable zone, while the creation of an OS in docker sometimes needs to
allocate memory in non-movable zones (dma/dma32/normal) like
GFP_HIGHUSER, and the cpuset limit forbids the allocation, then
out-of-memory killing is involved even when normal nodes and movable
nodes both have many free memory.

The OOM killer cannot help to resolve the situation as there is no
usable memory for the request in the cpuset scope.  The only reasonable
measure to take is to fail the allocation right away and have the caller
to deal with it.

So add a check for cases like this in the slowpath of allocation, and
bail out early returning NULL for the allocation.

As page allocation is one of the hottest path in kernel, this check will
hurt all users with sane cpuset configuration, add a static branch check
and detect the abnormal config in cpuset memory binding setup so that
the extra check cost in page allocation is not paid by everyone.

[thanks to Micho Hocko and David Rientjes for suggesting not handling
 it inside OOM code, adding cpuset check, refining comments]

Link: https://lkml.kernel.org/r/1632481657-68112-1-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 13:30:38 -07:00
Waiman Long b94f9ac79a cgroup/cpuset: Change references of cpuset_mutex to cpuset_rwsem
Since commit 1243dc518c ("cgroup/cpuset: Convert cpuset_mutex to
percpu_rwsem"), cpuset_mutex has been replaced by cpuset_rwsem which is
a percpu rwsem. However, the comments in kernel/cgroup/cpuset.c still
reference cpuset_mutex which are now incorrect.

Change all the references of cpuset_mutex to cpuset_rwsem.

Fixes: 1243dc518c ("cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-09-13 08:06:17 -10:00
Linus Torvalds 69dc8010b8 Merge branch 'for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Two cpuset behavior changes:

   - cpuset on cgroup2 is changed to enable memory migration based on
     nodemask by default.

   - A notification is generated when cpuset partition state changes.

  All other patches are minor fixes and cleanups"

* 'for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Avoid compiler warnings with no subsystems
  cgroup/cpuset: Avoid memory migration when nodemasks match
  cgroup/cpuset: Enable memory migration for cpuset v2
  cgroup/cpuset: Enable event notification when partition state changes
  cgroup: cgroup-v1: clean up kernel-doc notation
  cgroup: Replace deprecated CPU-hotplug functions.
  cgroup/cpuset: Fix violation of cpuset locking rule
  cgroup/cpuset: Fix a partition bug with hotplug
  cgroup/cpuset: Miscellaneous code cleanup
  cgroup: remove cgroup_mount from comments
2021-08-31 15:49:04 -07:00
Nicolas Saenz Julienne 9f72daf7ed cgroup/cpuset: Avoid memory migration when nodemasks match
With the introduction of ee9707e859 ("cgroup/cpuset: Enable memory
migration for cpuset v2") attaching a process to a different cgroup will
trigger a memory migration regardless of whether it's really needed.
Memory migration is an expensive operation, so bypass it if the
nodemasks passed to cpuset_migrate_mm() are equal.

Note that we're not only avoiding the migration work itself, but also a
call to lru_cache_disable(), which triggers and flushes an LRU drain
work on every online CPU.

Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-08-25 06:51:51 -10:00