cgroup changes for v6.3-rc1

All the commits are trivial. Doc updates and a trivial code cleanup.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCY/Kvjw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGfcZAQDoyqEm807R2iaELEfee0xq/NPwJIaDRnXDP6ZL
 rIEtqAD8CzpvvtKauekMIsCHVq7SuxDae8Hi0a/RImN8e1sy6wY=
 =pUSo
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:
 "All the changes are trivial: documentation updates and a trivial code
  cleanup"

* tag 'cgroup-for-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/cpuset: fix a few kernel-doc warnings & coding style
  docs: cgroup-v1: use numbered lists for user interface setup
  docs: cgroup-v1: add internal cross-references
  docs: cgroup-v1: make swap extension subsections subsections
  docs: cgroup-v1: use bullet lists for list of stat file tables
  docs: cgroup-v1: move hierarchy of accounting caption
  docs: cgroup-v1: fix footnotes
  docs: cgroup-v1: use code block for locking order schema
  docs: cgroup-v1: wrap remaining admonitions in admonition blocks
  docs: cgroup-v1: replace custom note constructs with appropriate admonition blocks
  cgroup/cpuset: no need to explicitly init a global static variable
This commit is contained in:
Linus Torvalds 2023-02-21 10:36:29 -08:00
commit 8ca8d89b43
3 changed files with 154 additions and 131 deletions

View File

@ -80,6 +80,8 @@ access. For example, cpusets (see Documentation/admin-guide/cgroup-v1/cpusets.rs
you to associate a set of CPUs and a set of memory nodes with the
tasks in each cgroup.
.. _cgroups-why-needed:
1.2 Why are cgroups needed ?
----------------------------

View File

@ -2,18 +2,18 @@
Memory Resource Controller
==========================
NOTE:
.. caution::
This document is hopelessly outdated and it asks for a complete
rewrite. It still contains a useful information so we are keeping it
here but make sure to check the current code if you need a deeper
understanding.
NOTE:
.. note::
The Memory Resource Controller has generically been referred to as the
memory controller in this document. Do not confuse memory controller
used here with the memory controller that is used in hardware.
(For editors) In this document:
.. hint::
When we mention a cgroup (cgroupfs's directory) with memory controller,
we call it "memory cgroup". When you see git-log and source code, you'll
see patch's title and function names tend to use "memcg".
@ -23,7 +23,7 @@ Benefits and Purpose of the memory controller
=============================================
The memory controller isolates the memory behaviour of a group of tasks
from the rest of the system. The article on LWN [12] mentions some probable
from the rest of the system. The article on LWN [12]_ mentions some probable
uses of the memory controller. The memory controller can be used to
a. Isolate an application or a group of applications
@ -55,7 +55,8 @@ Features:
- Root cgroup has no limit controls.
Kernel memory support is a work in progress, and the current version provides
basically functionality. (See Section 2.7)
basically functionality. (See :ref:`section 2.7
<cgroup-v1-memory-kernel-extension>`)
Brief summary of control files.
@ -107,16 +108,16 @@ Brief summary of control files.
==========
The memory controller has a long history. A request for comments for the memory
controller was posted by Balbir Singh [1]. At the time the RFC was posted
controller was posted by Balbir Singh [1]_. At the time the RFC was posted
there were several implementations for memory control. The goal of the
RFC was to build consensus and agreement for the minimal features required
for memory control. The first RSS controller was posted by Balbir Singh[2]
in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the
RSS controller. At OLS, at the resource management BoF, everyone suggested
that we handle both page cache and RSS together. Another request was raised
to allow user space handling of OOM. The current memory controller is
for memory control. The first RSS controller was posted by Balbir Singh [2]_
in Feb 2007. Pavel Emelianov [3]_ [4]_ [5]_ has since posted three versions
of the RSS controller. At OLS, at the resource management BoF, everyone
suggested that we handle both page cache and RSS together. Another request was
raised to allow user space handling of OOM. The current memory controller is
at version 6; it combines both mapped (RSS) and unmapped Page
Cache Control [11].
Cache Control [11]_.
2. Memory Control
=================
@ -147,7 +148,8 @@ specific data structure (mem_cgroup) associated with it.
2.2. Accounting
---------------
::
.. code-block::
:caption: Figure 1: Hierarchy of Accounting
+--------------------+
| mem_cgroup |
@ -167,7 +169,6 @@ specific data structure (mem_cgroup) associated with it.
| | | |
+---------------+ +---------------+
(Figure 1: Hierarchy of Accounting)
Figure 1 shows the important aspects of the controller
@ -221,8 +222,9 @@ behind this approach is that a cgroup that aggressively uses a shared
page will eventually get charged for it (once it is uncharged from
the cgroup that brought it in -- this will happen on memory pressure).
But see section 8.2: when moving a task to another cgroup, its pages may
be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.
But see :ref:`section 8.2 <cgroup-v1-memory-movable-charges>` when moving a
task to another cgroup, its pages may be recharged to the new cgroup, if
move_charge_at_immigrate has been chosen.
2.4 Swap Extension
--------------------------------------
@ -244,7 +246,8 @@ In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap.
By using the memsw limit, you can avoid system OOM which can be caused by swap
shortage.
**why 'memory+swap' rather than swap**
2.4.1 why 'memory+swap' rather than swap
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
to move account from memory to swap...there is no change in usage of
@ -252,7 +255,8 @@ memory+swap. In other words, when we want to limit the usage of swap without
affecting global LRU, memory+swap limit is better than just limiting swap from
an OS point of view.
**What happens when a cgroup hits memory.memsw.limit_in_bytes**
2.4.2. What happens when a cgroup hits memory.memsw.limit_in_bytes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
in this cgroup. Then, swap-out will not be done by cgroup routine and file
@ -268,26 +272,26 @@ global VM. When a cgroup goes over its limit, we first try
to reclaim memory from the cgroup so as to make space for the new
pages that the cgroup has touched. If the reclaim is unsuccessful,
an OOM routine is invoked to select and kill the bulkiest task in the
cgroup. (See 10. OOM Control below.)
cgroup. (See :ref:`10. OOM Control <cgroup-v1-memory-oom-control>` below.)
The reclaim algorithm has not been modified for cgroups, except that
pages that are selected for reclaiming come from the per-cgroup LRU
list.
NOTE:
Reclaim does not work for the root cgroup, since we cannot set any
limits on the root cgroup.
.. note::
Reclaim does not work for the root cgroup, since we cannot set any
limits on the root cgroup.
Note2:
When panic_on_oom is set to "2", the whole system will panic.
.. note::
When panic_on_oom is set to "2", the whole system will panic.
When oom event notifier is registered, event will be delivered.
(See oom_control section)
(See :ref:`oom_control <cgroup-v1-memory-oom-control>` section)
2.6 Locking
-----------
Lock order is as follows:
Lock order is as follows::
Page lock (PG_locked bit of page->flags)
mm->page_table_lock or split pte_lock
@ -299,6 +303,8 @@ Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
lruvec->lru_lock; PG_lru bit of page->flags is cleared before
isolating a page from its LRU under lruvec->lru_lock.
.. _cgroup-v1-memory-kernel-extension:
2.7 Kernel Memory Extension
-----------------------------------------------
@ -367,10 +373,10 @@ U != 0, K < U:
never greater than the total memory, and freely set U at the cost of his
QoS.
WARNING:
In the current implementation, memory reclaim will NOT be
triggered for a cgroup when it hits K while staying below U, which makes
this setup impractical.
.. warning::
In the current implementation, memory reclaim will NOT be triggered for
a cgroup when it hits K while staying below U, which makes this setup
impractical.
U != 0, K >= U:
Since kmem charges will also be fed to the user counter and reclaim will be
@ -381,45 +387,41 @@ U != 0, K >= U:
3. User Interface
=================
3.0. Configuration
------------------
To use the user interface:
a. Enable CONFIG_CGROUPS
b. Enable CONFIG_MEMCG
3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
-------------------------------------------------------------------
::
1. Enable CONFIG_CGROUPS and CONFIG_MEMCG options
2. Prepare the cgroups (see :ref:`Why are cgroups needed?
<cgroups-why-needed>` for the background information)::
# mount -t tmpfs none /sys/fs/cgroup
# mkdir /sys/fs/cgroup/memory
# mount -t cgroup none /sys/fs/cgroup/memory -o memory
3.2. Make the new group and move bash into it::
3. Make the new group and move bash into it::
# mkdir /sys/fs/cgroup/memory/0
# echo $$ > /sys/fs/cgroup/memory/0/tasks
Since now we're in the 0 cgroup, we can alter the memory limit::
4. Since now we're in the 0 cgroup, we can alter the memory limit::
# echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
NOTE:
We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes,
Gibibytes.)
The limit can now be queried::
NOTE:
We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``.
# cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
4194304
NOTE:
We cannot set limits on the root cgroup any more.
.. note::
We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes,
Gibibytes.)
::
.. note::
We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``.
.. note::
We cannot set limits on the root cgroup any more.
# cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
4194304
We can check the usage::
@ -458,6 +460,8 @@ test because it has noise of shared objects/status.
But the above two are testing extreme situations.
Trying usual test under memory controller is always helpful.
.. _cgroup-v1-memory-test-troubleshoot:
4.1 Troubleshooting
-------------------
@ -470,8 +474,11 @@ terminated by the OOM killer. There are several causes for this:
A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of
some of the pages cached in the cgroup (page cache pages).
To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and
seeing what happens will be helpful.
To know what happens, disabling OOM_Kill as per :ref:`"10. OOM Control"
<cgroup-v1-memory-oom-control>` (below) and seeing what happens will be
helpful.
.. _cgroup-v1-memory-test-task-migration:
4.2 Task migration
------------------
@ -482,15 +489,16 @@ remain charged to it, the charge is dropped when the page is freed or
reclaimed.
You can move charges of a task along with task migration.
See 8. "Move charges at task migration"
See :ref:`8. "Move charges at task migration" <cgroup-v1-memory-move-charges>`
4.3 Removing a cgroup
---------------------
A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
cgroup might have some charge associated with it, even though all
tasks have migrated away from it. (because we charge against pages, not
against tasks.)
A cgroup can be removed by rmdir, but as discussed in :ref:`sections 4.1
<cgroup-v1-memory-test-troubleshoot>` and :ref:`4.2
<cgroup-v1-memory-test-task-migration>`, a cgroup might have some charge
associated with it, even though all tasks have migrated away from it. (because
we charge against pages, not against tasks.)
We move the stats to parent, and no change on the charge except uncharging
from the child.
@ -519,67 +527,66 @@ will be charged as a new owner of it.
5.2 stat file
-------------
memory.stat file includes following statistics
memory.stat file includes following statistics:
per-memory cgroup local status
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* per-memory cgroup local status
=============== ===============================================================
cache # of bytes of page cache memory.
rss # of bytes of anonymous and swap cache memory (includes
transparent hugepages).
rss_huge # of bytes of anonymous transparent hugepages.
mapped_file # of bytes of mapped file (includes tmpfs/shmem)
pgpgin # of charging events to the memory cgroup. The charging
event happens each time a page is accounted as either mapped
anon page(RSS) or cache page(Page Cache) to the cgroup.
pgpgout # of uncharging events to the memory cgroup. The uncharging
event happens each time a page is unaccounted from the cgroup.
swap # of bytes of swap usage
dirty # of bytes that are waiting to get written back to the disk.
writeback # of bytes of file/anon cache that are queued for syncing to
disk.
inactive_anon # of bytes of anonymous and swap cache memory on inactive
LRU list.
active_anon # of bytes of anonymous and swap cache memory on active
LRU list.
inactive_file # of bytes of file-backed memory and MADV_FREE anonymous memory(
LazyFree pages) on inactive LRU list.
active_file # of bytes of file-backed memory on active LRU list.
unevictable # of bytes of memory that cannot be reclaimed (mlocked etc).
=============== ===============================================================
=============== ===============================================================
cache # of bytes of page cache memory.
rss # of bytes of anonymous and swap cache memory (includes
transparent hugepages).
rss_huge # of bytes of anonymous transparent hugepages.
mapped_file # of bytes of mapped file (includes tmpfs/shmem)
pgpgin # of charging events to the memory cgroup. The charging
event happens each time a page is accounted as either mapped
anon page(RSS) or cache page(Page Cache) to the cgroup.
pgpgout # of uncharging events to the memory cgroup. The uncharging
event happens each time a page is unaccounted from the
cgroup.
swap # of bytes of swap usage
dirty # of bytes that are waiting to get written back to the disk.
writeback # of bytes of file/anon cache that are queued for syncing to
disk.
inactive_anon # of bytes of anonymous and swap cache memory on inactive
LRU list.
active_anon # of bytes of anonymous and swap cache memory on active
LRU list.
inactive_file # of bytes of file-backed memory and MADV_FREE anonymous
memory (LazyFree pages) on inactive LRU list.
active_file # of bytes of file-backed memory on active LRU list.
unevictable # of bytes of memory that cannot be reclaimed (mlocked etc).
=============== ===============================================================
status considering hierarchy (see memory.use_hierarchy settings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* status considering hierarchy (see memory.use_hierarchy settings):
========================= ===================================================
hierarchical_memory_limit # of bytes of memory limit with regard to hierarchy
under which the memory cgroup is
hierarchical_memsw_limit # of bytes of memory+swap limit with regard to
hierarchy under which memory cgroup is.
========================= ===================================================
hierarchical_memory_limit # of bytes of memory limit with regard to
hierarchy
under which the memory cgroup is
hierarchical_memsw_limit # of bytes of memory+swap limit with regard to
hierarchy under which memory cgroup is.
total_<counter> # hierarchical version of <counter>, which in
addition to the cgroup's own value includes the
sum of all hierarchical children's values of
<counter>, i.e. total_cache
========================= ===================================================
total_<counter> # hierarchical version of <counter>, which in
addition to the cgroup's own value includes the
sum of all hierarchical children's values of
<counter>, i.e. total_cache
========================= ===================================================
The following additional stats are dependent on CONFIG_DEBUG_VM
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* additional vm parameters (depends on CONFIG_DEBUG_VM):
========================= ========================================
recent_rotated_anon VM internal parameter. (see mm/vmscan.c)
recent_rotated_file VM internal parameter. (see mm/vmscan.c)
recent_scanned_anon VM internal parameter. (see mm/vmscan.c)
recent_scanned_file VM internal parameter. (see mm/vmscan.c)
========================= ========================================
========================= ========================================
recent_rotated_anon VM internal parameter. (see mm/vmscan.c)
recent_rotated_file VM internal parameter. (see mm/vmscan.c)
recent_scanned_anon VM internal parameter. (see mm/vmscan.c)
recent_scanned_file VM internal parameter. (see mm/vmscan.c)
========================= ========================================
Memo:
.. hint::
recent_rotated means recent frequency of LRU rotation.
recent_scanned means recent # of scans to LRU.
showing for better debug please see the code for meanings.
Note:
.. note::
Only anonymous and swap cache memory is listed as part of 'rss' stat.
This should not be confused with the true 'resident set size' or the
amount of physical memory used by the cgroup.
@ -710,13 +717,16 @@ If we want to change this to 1G, we can at any time use::
# echo 1G > memory.soft_limit_in_bytes
NOTE1:
.. note::
Soft limits take effect over a long period of time, since they involve
reclaiming memory for balancing between memory cgroups
NOTE2:
.. note::
It is recommended to set the soft limit always below the hard limit,
otherwise the hard limit will take precedence.
.. _cgroup-v1-memory-move-charges:
8. Move charges at task migration
=================================
@ -735,23 +745,29 @@ If you want to enable it::
# echo (some positive value) > memory.move_charge_at_immigrate
Note:
.. note::
Each bits of move_charge_at_immigrate has its own meaning about what type
of charges should be moved. See 8.2 for details.
Note:
of charges should be moved. See :ref:`section 8.2
<cgroup-v1-memory-movable-charges>` for details.
.. note::
Charges are moved only when you move mm->owner, in other words,
a leader of a thread group.
Note:
.. note::
If we cannot find enough space for the task in the destination cgroup, we
try to make space by reclaiming memory. Task migration may fail if we
cannot make enough space.
Note:
.. note::
It can take several seconds if you move charges much.
And if you want disable it again::
# echo 0 > memory.move_charge_at_immigrate
.. _cgroup-v1-memory-movable-charges:
8.2 Type of charges which can be moved
--------------------------------------
@ -801,6 +817,8 @@ threshold in any direction.
It's applicable for root and non-root cgroup.
.. _cgroup-v1-memory-oom-control:
10. OOM Control
===============
@ -956,15 +974,16 @@ commented and discussed quite extensively in the community.
References
==========
1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
2. Singh, Balbir. Memory Controller (RSS Control),
.. [1] Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
.. [2] Singh, Balbir. Memory Controller (RSS Control),
http://lwn.net/Articles/222762/
3. Emelianov, Pavel. Resource controllers based on process cgroups
.. [3] Emelianov, Pavel. Resource controllers based on process cgroups
https://lore.kernel.org/r/45ED7DEC.7010403@sw.ru
4. Emelianov, Pavel. RSS controller based on process cgroups (v2)
.. [4] Emelianov, Pavel. RSS controller based on process cgroups (v2)
https://lore.kernel.org/r/461A3010.90403@sw.ru
5. Emelianov, Pavel. RSS controller based on process cgroups (v3)
.. [5] Emelianov, Pavel. RSS controller based on process cgroups (v3)
https://lore.kernel.org/r/465D9739.8070209@openvz.org
6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/
7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control
subsystem (v3), http://lwn.net/Articles/235534/
@ -974,7 +993,8 @@ References
https://lore.kernel.org/r/464D267A.50107@linux.vnet.ibm.com
10. Singh, Balbir. Memory controller v6 test results,
https://lore.kernel.org/r/20070819094658.654.84837.sendpatchset@balbir-laptop
11. Singh, Balbir. Memory controller introduction (v6),
https://lore.kernel.org/r/20070817084228.26003.12568.sendpatchset@balbir-laptop
12. Corbet, Jonathan, Controlling memory use in cgroups,
http://lwn.net/Articles/243795/
.. [11] Singh, Balbir. Memory controller introduction (v6),
https://lore.kernel.org/r/20070817084228.26003.12568.sendpatchset@balbir-laptop
.. [12] Corbet, Jonathan, Controlling memory use in cgroups,
http://lwn.net/Articles/243795/

View File

@ -1271,7 +1271,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
int turning_on);
/**
* update_parent_subparts_cpumask - update subparts_cpus mask of parent cpuset
* @cpuset: The cpuset that requests change in partition root state
* @cs: The cpuset that requests change in partition root state
* @cmd: Partition root state change command
* @newmask: Optional new cpumask for partcmd_update
* @tmp: Temporary addmask and delmask
@ -3286,8 +3286,6 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
int __init cpuset_init(void)
{
BUG_ON(percpu_init_rwsem(&cpuset_rwsem));
BUG_ON(!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL));
BUG_ON(!alloc_cpumask_var(&top_cpuset.effective_cpus, GFP_KERNEL));
BUG_ON(!zalloc_cpumask_var(&top_cpuset.subparts_cpus, GFP_KERNEL));
@ -3907,8 +3905,7 @@ bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
}
/**
* cpuset_mem_spread_node() - On which node to begin search for a file page
* cpuset_slab_spread_node() - On which node to begin search for a slab page
* cpuset_spread_node() - On which node to begin search for a page
*
* If a task is marked PF_SPREAD_PAGE or PF_SPREAD_SLAB (as for
* tasks in a cpuset with is_spread_page or is_spread_slab set),
@ -3932,12 +3929,14 @@ bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
* is passed an offline node, it will fall back to the local node.
* See kmem_cache_alloc_node().
*/
static int cpuset_spread_node(int *rotor)
{
return *rotor = next_node_in(*rotor, current->mems_allowed);
}
/**
* cpuset_mem_spread_node() - On which node to begin search for a file page
*/
int cpuset_mem_spread_node(void)
{
if (current->cpuset_mem_spread_rotor == NUMA_NO_NODE)
@ -3947,6 +3946,9 @@ int cpuset_mem_spread_node(void)
return cpuset_spread_node(&current->cpuset_mem_spread_rotor);
}
/**
* cpuset_slab_spread_node() - On which node to begin search for a slab page
*/
int cpuset_slab_spread_node(void)
{
if (current->cpuset_slab_spread_rotor == NUMA_NO_NODE)
@ -3955,7 +3957,6 @@ int cpuset_slab_spread_node(void)
return cpuset_spread_node(&current->cpuset_slab_spread_rotor);
}
EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);
/**