With the recent updates, blk-throttle is finally ready for proper
hierarchy support. Dispatching now honors service_queue->parent_sq
and propagates correctly. The only thing missing is setting
->parent_sq correctly so that throtl_grp hierarchy matches the cgroup
hierarchy.
This patch updates throtl_pd_init() such that service_queues form the
same hierarchy as the cgroup hierarchy if sane_behavior is enabled.
As this concludes proper hierarchy support for blkcg, the shameful
.broken_hierarchy tag is removed from blkio_subsys.
v2: Updated blkio-controller.txt as suggested by Vivek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
This exports the amount of anonymous transparent hugepages for each
memcg via the new "rss_huge" stat in memory.stat. The units are in
bytes.
This is helpful to determine the hugepage utilization for individual
jobs on the system in comparison to rss and opportunities where
MADV_HUGEPAGE may be helpful.
The amount of anonymous transparent hugepages is also included in "rss"
for backwards compatibility.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking updates from David Miller:
"Highlights (1721 non-merge commits, this has to be a record of some
sort):
1) Add 'random' mode to team driver, from Jiri Pirko and Eric
Dumazet.
2) Make it so that any driver that supports configuration of multiple
MAC addresses can provide the forwarding database add and del
calls by providing a default implementation and hooking that up if
the driver doesn't have an explicit set of handlers. From Vlad
Yasevich.
3) Support GSO segmentation over tunnels and other encapsulating
devices such as VXLAN, from Pravin B Shelar.
4) Support L2 GRE tunnels in the flow dissector, from Michael Dalton.
5) Implement Tail Loss Probe (TLP) detection in TCP, from Nandita
Dukkipati.
6) In the PHY layer, allow supporting wake-on-lan in situations where
the PHY registers have to be written for it to be configured.
Use it to support wake-on-lan in mv643xx_eth.
From Michael Stapelberg.
7) Significantly improve firewire IPV6 support, from YOSHIFUJI
Hideaki.
8) Allow multiple packets to be sent in a single transmission using
network coding in batman-adv, from Martin Hundebøll.
9) Add support for T5 cxgb4 chips, from Santosh Rastapur.
10) Generalize the VXLAN forwarding tables so that there is more
flexibility in configurating various aspects of the endpoints.
From David Stevens.
11) Support RSS and TSO in hardware over GRE tunnels in bxn2x driver,
from Dmitry Kravkov.
12) Zero copy support in nfnelink_queue, from Eric Dumazet and Pablo
Neira Ayuso.
13) Start adding networking selftests.
14) In situations of overload on the same AF_PACKET fanout socket, or
per-cpu packet receive queue, minimize drop by distributing the
load to other cpus/fanouts. From Willem de Bruijn and Eric
Dumazet.
15) Add support for new payload offset BPF instruction, from Daniel
Borkmann.
16) Convert several drivers over to mdoule_platform_driver(), from
Sachin Kamat.
17) Provide a minimal BPF JIT image disassembler userspace tool, from
Daniel Borkmann.
18) Rewrite F-RTO implementation in TCP to match the final
specification of it in RFC4138 and RFC5682. From Yuchung Cheng.
19) Provide netlink socket diag of netlink sockets ("Yo dawg, I hear
you like netlink, so I implemented netlink dumping of netlink
sockets.") From Andrey Vagin.
20) Remove ugly passing of rtnetlink attributes into rtnl_doit
functions, from Thomas Graf.
21) Allow userspace to be able to see if a configuration change occurs
in the middle of an address or device list dump, from Nicolas
Dichtel.
22) Support RFC3168 ECN protection for ipv6 fragments, from Hannes
Frederic Sowa.
23) Increase accuracy of packet length used by packet scheduler, from
Jason Wang.
24) Beginning set of changes to make ipv4/ipv6 fragment handling more
scalable and less susceptible to overload and locking contention,
from Jesper Dangaard Brouer.
25) Get rid of using non-type-safe NLMSG_* macros and use nlmsg_*()
instead. From Hong Zhiguo.
26) Optimize route usage in IPVS by avoiding reference counting where
possible, from Julian Anastasov.
27) Convert IPVS schedulers to RCU, also from Julian Anastasov.
28) Support cpu fanouts in xt_NFQUEUE netfilter target, from Holger
Eitzenberger.
29) Network namespace support for nf_log, ebt_log, xt_LOG, ipt_ULOG,
nfnetlink_log, and nfnetlink_queue. From Gao feng.
30) Implement RFC3168 ECN protection, from Hannes Frederic Sowa.
31) Support several new r8169 chips, from Hayes Wang.
32) Support tokenized interface identifiers in ipv6, from Daniel
Borkmann.
33) Use usbnet_link_change() helper in USB net driver, from Ming Lei.
34) Add 802.1ad vlan offload support, from Patrick McHardy.
35) Support mmap() based netlink communication, also from Patrick
McHardy.
36) Support HW timestamping in mlx4 driver, from Amir Vadai.
37) Rationalize AF_PACKET packet timestamping when transmitting, from
Willem de Bruijn and Daniel Borkmann.
38) Bring parity to what's provided by /proc/net/packet socket dumping
and the info provided by netlink socket dumping of AF_PACKET
sockets. From Nicolas Dichtel.
39) Fix peeking beyond zero sized SKBs in AF_UNIX, from Benjamin
Poirier"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
filter: fix va_list build error
af_unix: fix a fatal race with bit fields
bnx2x: Prevent memory leak when cnic is absent
bnx2x: correct reading of speed capabilities
net: sctp: attribute printl with __printf for gcc fmt checks
netlink: kconfig: move mmap i/o into netlink kconfig
netpoll: convert mutex into a semaphore
netlink: Fix skb ref counting.
net_sched: act_ipt forward compat with xtables
mlx4_en: fix a build error on 32bit arches
Revert "bnx2x: allow nvram test to run when device is down"
bridge: avoid OOPS if root port not found
drivers: net: cpsw: fix kernel warn on cpsw irq enable
sh_eth: use random MAC address if no valid one supplied
3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA)
tg3: fix to append hardware time stamping flags
unix/stream: fix peeking with an offset larger than data in queue
unix/dgram: fix peeking with an offset larger than data in queue
unix/dgram: peek beyond 0-sized skbs
openvswitch: Remove unneeded ovs_netdev_get_ifindex()
...
Pull cgroup updates from Tejun Heo:
- Fixes and a lot of cleanups. Locking cleanup is finally complete.
cgroup_mutex is no longer exposed to individual controlelrs which
used to cause nasty deadlock issues. Li fixed and cleaned up quite a
bit including long standing ones like racy cgroup_path().
- device cgroup now supports proper hierarchy thanks to Aristeu.
- perf_event cgroup now supports proper hierarchy.
- A new mount option "__DEVEL__sane_behavior" is added. As indicated
by the name, this option is to be used for development only at this
point and generates a warning message when used. Unfortunately,
cgroup interface currently has too many brekages and inconsistencies
to implement a consistent and unified hierarchy on top. The new flag
is used to collect the behavior changes which are necessary to
implement consistent unified hierarchy. It's likely that this flag
won't be used verbatim when it becomes ready but will be enabled
implicitly along with unified hierarchy.
The option currently disables some of broken behaviors in cgroup core
and also .use_hierarchy switch in memcg (will be routed through -mm),
which can be used to make very unusual hierarchy where nesting is
partially honored. It will also be used to implement hierarchy
support for blk-throttle which would be impossible otherwise without
introducing a full separate set of control knobs.
This is essentially versioning of interface which isn't very nice but
at this point I can't see any other options which would allow keeping
the interface the same while moving towards hierarchy behavior which
is at least somewhat sane. The planned unified hierarchy is likely
to require some level of adaptation from userland anyway, so I think
it'd be best to take the chance and update the interface such that
it's supportable in the long term.
Maintaining the existing interface does complicate cgroup core but
shouldn't put too much strain on individual controllers and I think
it'd be manageable for the foreseeable future. Maybe we'll be able
to drop it in a decade.
Fix up conflicts (including a semantic one adding a new #include to ppc
that was uncovered by header the file changes) as per Tejun.
* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits)
cpuset: fix compile warning when CONFIG_SMP=n
cpuset: fix cpu hotplug vs rebuild_sched_domains() race
cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
cgroup: restore the call to eventfd->poll()
cgroup: fix use-after-free when umounting cgroupfs
cgroup: fix broken file xattrs
devcg: remove parent_cgroup.
memcg: force use_hierarchy if sane_behavior
cgroup: remove cgrp->top_cgroup
cgroup: introduce sane_behavior mount option
move cgroupfs_root to include/linux/cgroup.h
cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
cgroup: make cgroup_path() not print double slashes
Revert "cgroup: remove bind() method from cgroup_subsys."
perf: make perf_event cgroup hierarchical
cgroup: implement cgroup_is_descendant()
cgroup: make sure parent won't be destroyed before its children
cgroup: remove bind() method from cgroup_subsys.
devcg: remove broken_hierarchy tag
cgroup: remove cgroup_lock_is_held()
...
With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the pressure level
notifications. The levels are defined like this:
The "low" level means that the system is reclaiming memory for new
allocations. Monitoring this reclaiming activity might be useful for
maintaining cache level. Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).
The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file
caches, etc. Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.
The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger. Applications should do whatever they can to help the
system. It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.
The events are propagated upward until the event is handled, i.e. the
events are not pass-through. Here is what this means: for example you
have three cgroups: A->B->C. Now you set up an event listener on
cgroups A, B and C, and suppose group C experiences some pressure. In
this situation, only group C will receive the notification, i.e. groups
A and B will not receive it. This is done to avoid excessive
"broadcasting" of messages, which disturbs the system and which is
especially bad if we are low on memory or thrashing. So, organize the
cgroups wisely, or propagate the events manually (or, ask us to
implement the pass-through events, explaining why would you need them.)
Performance wise, the memory pressure notifications feature itself is
lightweight and does not require much of bookkeeping, in contrast to the
rest of memcg features. Unfortunately, as of current memcg
implementation, pages accounting is an inseparable part and cannot be
turned off. The good news is that there are some efforts[1] to improve
the situation; plus, implementing the same, fully API-compatible[2]
interface for CONFIG_MEMCG=n case (e.g. embedded) is also a viable
option, so it will not require any changes on the userland side.
[1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
[2] http://lkml.org/lkml/2013/2/21/454
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 84cfb6ab48. There
are scheduled changes which make use of the removed callback.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rami Rosen <ramirose@gmail.com>
Cc: Li Zefan <lizefan@huawei.com>
The bind() method of cgroup_subsys is not used in any of the
controllers (cpuset, freezer, blkio, net_cls, memcg, net_prio,
devices, perf, hugetlb, cpu and cpuacct)
tj: Removed the entry on ->bind() from
Documentation/cgroups/cgroups.txt. Also updated a couple
paragraphs which were suggesting that dynamic re-binding may be
implemented. It's not gonna.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch adds a new file, Documentation/cgroups/net_cls.txt, with info
about net_cls cgroups, and updates the 00-INDEX accordingly.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This trivial patch removes a word which appears twice in
Documentation/cgroup/cgroup.txt.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch makes exception changes to propagate down in hierarchy respecting
when possible local exceptions.
New exceptions allowing additional access to devices won't be propagated, but
it'll be possible to add an exception to access all of part of the newly
allowed device(s).
New exceptions disallowing access to devices will be propagated down and the
local group's exceptions will be revalidated for the new situation.
Example:
A
/ \
B
group behavior exceptions
A allow "b 8:* rwm", "c 116:1 rw"
B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm"
If a new exception is added to group A:
# echo "c 116:* r" > A/devices.deny
it'll propagate down and after revalidating B's local exceptions, the exception
"c 116:2 rwm" will be removed.
In case parent's exceptions change and local exceptions are not allowed anymore,
they'll be deleted.
v7:
- do not allow behavior change when the cgroup has children
- update documentation
v6: fixed issues pointed by Serge Hallyn
- only copy parent's exceptions while propagating behavior if the local
behavior is different
- while propagating exceptions, do not clear and copy parent's: it'd be against
the premise we don't propagate access to more devices
v5: fixed issues pointed by Serge Hallyn
- updated documentation
- not propagating when an exception is written to devices.allow
- when propagating a new behavior, clean the local exceptions list if they're
for a different behavior
v4: fixed issues pointed by Tejun Heo
- separated function to walk the tree and collect valid propagation targets
v3: fixed issues pointed by Tejun Heo
- update documentation
- move css_online/css_offline changes to a new patch
- use cgroup_for_each_descendant_pre() instead of own descendant walk
- move exception_copy rework to a separared patch
- move exception_clean rework to a separated patch
v2: fixed issues pointed by Tejun Heo
- instead of keeping the local settings that won't apply anymore, remove them
Cc: Tejun Heo <tj@kernel.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
cpuset no longer nests cgroup_mutex inside cpu_hotplug lock, so
we don't have to release cgroup_mutex before calling css_offline().
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull block IO core bits from Jens Axboe:
"Below are the core block IO bits for 3.9. It was delayed a few days
since my workstation kept crashing every 2-8h after pulling it into
current -git, but turns out it is a bug in the new pstate code (divide
by zero, will report separately). In any case, it contains:
- The big cfq/blkcg update from Tejun and and Vivek.
- Additional block and writeback tracepoints from Tejun.
- Improvement of the should sort (based on queues) logic in the plug
flushing.
- _io() variants of the wait_for_completion() interface, using
io_schedule() instead of schedule() to contribute to io wait
properly.
- Various little fixes.
You'll get two trivial merge conflicts, which should be easy enough to
fix up"
Fix up the trivial conflicts due to hlist traversal cleanups (commit
b67bfe0d42ca: "hlist: drop the node parameter from iterators").
* 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
block: remove redundant check to bd_openers()
block: use i_size_write() in bd_set_size()
cfq: fix lock imbalance with failed allocations
drivers/block/swim3.c: fix null pointer dereference
block: don't select PERCPU_RWSEM
block: account iowait time when waiting for completion of IO request
sched: add wait_for_completion_io[_timeout]
writeback: add more tracepoints
block: add block_{touch|dirty}_buffer tracepoint
buffer: make touch_buffer() an exported function
block: add @req to bio_{front|back}_merge tracepoints
block: add missing block_bio_complete() tracepoint
block: Remove should_sort judgement when flush blk_plug
block,elevator: use new hashtable implementation
cfq-iosched: add hierarchical cfq_group statistics
cfq-iosched: collect stats from dead cfqgs
cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
block: RCU free request_queue
blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
...
With the previous two patches, all cfqg scheduling decisions are based
on vfraction and ready for hierarchy support. The only thing which
keeps the behavior flat is cfqg_flat_parent() which makes vfraction
calculation consider all non-root cfqgs children of the root cfqg.
Replace it with cfqg_parent() which returns the real parent. This
enables full blkcg hierarchy support for cfq-iosched. For example,
consider the following hierarchy.
root
/ \
A:500 B:250
/ \
AA:500 AB:1000
For simplicity, let's say all the leaf nodes have active tasks and are
on service tree. For each leaf node, vfraction would be
AA: (500 / 1500) * (500 / 750) =~ 0.2222
AB: (1000 / 1500) * (500 / 750) =~ 0.4444
B: (250 / 750) =~ 0.3333
and vdisktime will be distributed accordingly. For more detail,
please refer to Documentation/block/cfq-iosched.txt.
v2: cfq-iosched.txt updated to describe group scheduling as suggested
by Vivek.
v3: blkio-controller.txt updated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Move the cgroup_event_listener.c tool from Documentation into the new
tools/cgroup directory.
This change involves wiring cgroup_event_listener.c into the tools/
make system so that is can be built with:
$ make tools/cgroup
Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
It is useful to know how many charges are still left after a call to
res_counter_uncharge. While it is possible to issue a res_counter_read
after uncharge, this can be racy.
If we need, for instance, to take some action when the counters drop down
to 0, only one of the callers should see it. This is the same semantics
as the atomic variables in the kernel.
Since the current return value is void, we don't need to worry about
anything breaking due to this change: nobody relied on that, and only
users appearing from now on will be checking this value.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.
The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull cgroup changes from Tejun Heo:
"A lot of activities on cgroup side. The big changes are focused on
making cgroup hierarchy handling saner.
- cgroup_rmdir() had peculiar semantics - it allowed cgroup
destruction to be vetoed by individual controllers and tried to
drain refcnt synchronously. The vetoing never worked properly and
caused good deal of contortions in cgroup. memcg was the last
reamining user. Michal Hocko removed the usage and cgroup_rmdir()
path has been simplified significantly. This was done in a
separate branch so that the memcg people can base further memcg
changes on top.
- The above allowed cleaning up cgroup lifecycle management and
implementation of generic cgroup iterators which are used to
improve hierarchy support.
- cgroup_freezer updated to allow migration in and out of a frozen
cgroup and handle hierarchy. If a cgroup is frozen, all descendant
cgroups are frozen.
- netcls_cgroup and netprio_cgroup updated to handle hierarchy
properly.
- Various fixes and cleanups.
- Two merge commits. One to pull in memcg and rmdir cleanups (needed
to build iterators). The other pulled in cgroup/for-3.7-fixes for
device_cgroup fixes so that further device_cgroup patches can be
stacked on top."
Fixed up a trivial conflict in mm/memcontrol.c as per Tejun (due to
commit bea8c150a7 ("memcg: fix hotplugged memory zone oops") in master
touching code close to commit 2ef37d3fe4 ("memcg: Simplify
mem_cgroup_force_empty_list error handling") in for-3.8)
* 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (65 commits)
cgroup: update Documentation/cgroups/00-INDEX
cgroup_rm_file: don't delete the uncreated files
cgroup: remove subsystem files when remounting cgroup
cgroup: use cgroup_addrm_files() in cgroup_clear_directory()
cgroup: warn about broken hierarchies only after css_online
cgroup: list_del_init() on removed events
cgroup: fix lockdep warning for event_control
cgroup: move list add after list head initilization
netprio_cgroup: allow nesting and inherit config on cgroup creation
netprio_cgroup: implement netprio[_set]_prio() helpers
netprio_cgroup: use cgroup->id instead of cgroup_netprio_state->prioidx
netprio_cgroup: reimplement priomap expansion
netprio_cgroup: shorten variable names in extend_netdev_table()
netprio_cgroup: simplify write_priomap()
netcls_cgroup: move config inheritance to ->css_online() and remove .broken_hierarchy marking
cgroup: remove obsolete guarantee from cgroup_task_migrate.
cgroup: add cgroup->id
cgroup, cpuset: remove cgroup_subsys->post_clone()
cgroup: s/CGRP_CLONE_CHILDREN/CGRP_CPUSET_CLONE_CHILDREN/
cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free()
...
mem_cgroup_charge_common() is invoked as the entry point for cgroup limits
charge rather than mem_cgroup_charge(), as the later has been removed for
years. Update the cgroup/memory.txt to reflect this change.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are new files added to cgroup documentation. Let's update the
index file to include the new files.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Inherit netprio configuration from ->css_online(), allow nesting and
remove .broken_hierarchy marking. This makes netprio_cgroup's
behavior match netcls_cgroup's.
Note that this patch changes userland-visible behavior. Nesting is
allowed and the first level cgroups below the root cgroup behave
differently - they inherit priorities from the root cgroup on creation
instead of starting with 0. This is unfortunate but not doing so is
much crazier.
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-and-Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: David S. Miller <davem@davemloft.net>
Currently CGRP_CPUSET_CLONE_CHILDREN triggers ->post_clone(). Now
that clone_children is cpuset specific, there's no reason to have this
rather odd option activation mechanism in cgroup core. cpuset can
check the flag from its ->css_allocate() and take the necessary
action.
Move cpuset_post_clone() logic to the end of cpuset_css_alloc() and
remove cgroup_subsys->post_clone().
Loosely based on Glauber's "generalize post_clone into post_create"
patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Original-patch-by: Glauber Costa <glommer@parallels.com>
Original-patch: <1351686554-22592-2-git-send-email-glommer@parallels.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
clone_children is only meaningful for cpuset and will stay that way.
Rename the flag to reflect that and update documentation. Also, drop
clone_children() wrapper in cgroup.c. The thin wrapper is used only a
few times and one of them will go away soon.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
Rename cgroup_subsys css lifetime related callbacks to better describe
what their roles are. Also, update documentation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
oom_badness() takes a totalpages argument which says how many pages are
available and it uses it as a base for the score calculation. The value
is calculated by mem_cgroup_get_limit which considers both limit and
total_swap_pages (resp. memsw portion of it).
This is usually correct but since fe35004fbf ("mm: avoid swapping out
with swappiness==0") we do not swap when swappiness is 0 which means
that we cannot really use up all the totalpages pages. This in turn
confuses oom score calculation if the memcg limit is much smaller than
the available swap because the used memory (capped by the limit) is
negligible comparing to totalpages so the resulting score is too small
if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
A wrong process might be selected as result.
The problem can be worked around by checking mem_cgroup_swappiness==0
and not considering swap at all in such a case.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Up until now, cgroup_freezer didn't implement hierarchy properly.
cgroups could be arranged in hierarchy but it didn't make any
difference in how each cgroup_freezer behaved. They all operated
separately.
This patch implements proper hierarchy support. If a cgroup is
frozen, all its descendants are frozen. A cgroup is thawed iff it and
all its ancestors are THAWED. freezer.self_freezing shows the current
freezing state for the cgroup itself. freezer.parent_freezing shows
whether the cgroup is freezing because any of its ancestors is
freezing.
freezer_post_create() locks the parent and new cgroup and inherits the
parent's state and freezer_change_state() applies new state top-down
using cgroup_for_each_descendant_pre() which guarantees that no child
can escape its parent's state. update_if_frozen() uses
cgroup_for_each_descendant_post() to propagate frozen states
bottom-up.
Synchronization could be coarser and easier by using a single mutex to
protect all hierarchy operations. Finer grained approach was used
because it wasn't too difficult for cgroup_freezer and I think it's
beneficial to have an example implementation and cgroup_freezer is
rather simple and can serve a good one.
As this makes cgroup_freezer properly hierarchical,
freezer_subsys.broken_hierarchy marking is removed.
Note that this patch changes userland visible behavior - freezing a
cgroup now freezes all its descendants too. This behavior change is
intended and has been warned via .broken_hierarchy.
v2: Michal spotted a bug in freezer_change_state() - descendants were
inheriting from the wrong ancestor. Fixed.
v3: Documentation/cgroups/freezer-subsystem.txt updated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
While reading through Documentation/cgroups/memory.txt, I found a number
of minor wordos and typos. The patch below is a conservative handling of
some of these: it provides just a number of "obviously correct" fixes to
the English that improve the readability of the document somewhat.
Obviously some more significant fixes need to be made to the document, but
some of those may not be in the "obvious correct" category.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While reading through Documentation/cgroups/cgroups.txt, I found a
number of minor wordos and typos. The patch below is a conservative
handling of some of these: it provides just a number of "obviously
correct" fixes to the English that improve the readability
of the document somewhat. Obviously some more significant
fixes could be made to the document, but some of those
may not be in the "obviously correct" category.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Remounting support will be removed in the future (see
feature-removal-schedule.txt). Discourage users from using it.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rob Landley <rob@landley.net>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
This function was removed by 86f82d5618.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rob Landley <rob@landley.net>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
It was introduced for memcg to iterate cgroup hierarchy without
holding cgroup_mutex, but soon after that it was replaced with
a lockless way in memcg.
No one used hierarchy_mutex since that, so remove it.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Presently, at removal of cgroup, ->pre_destroy() is called and moves
charges to the parent cgroup. A major reason for returning -EBUSY from
->pre_destroy() is that the 'moving' hits the parent's resource
limitation. It happens only when use_hierarchy=0.
Considering use_hierarchy=0, all cgroups should be flat. So, no one
cannot justify moving charges to parent...parent and children are in flat
configuration, not hierarchical.
This patch modifes the code to move charges to the root cgroup at
rmdir/force_empty if use_hierarchy==0. This will much simplify rmdir()
and reduce error in ->pre_destroy.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Glauber Costa <glommer@parallels.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When killing a res_counter which is a child of other counter, we need to
do
res_counter_uncharge(child, xxx)
res_counter_charge(parent, xxx)
This is not atomic and wastes CPU. This patch adds
res_counter_uncharge_until(). This function's uncharge propagates to
ancestors until specified res_counter.
res_counter_uncharge_until(child, parent, xxx)
Now the operation is atomic and efficient.
Signed-off-by: Frederic Weisbecker <fweisbec@redhat.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ying Han <yinghan@google.com>
Cc: Glauber Costa <glommer@parallels.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch changes memcg's behavior at task_move().
At task_move(), the kernel scans a task's page table and move the changes
for mapped pages from source cgroup to target cgroup. There has been a
bug at handling shared anonymous pages for a long time.
Before patch:
- The spec says 'shared anonymous pages are not moved.'
- The implementation was 'shared anonymoys pages may be moved'.
If page_mapcount <=2, shared anonymous pages's charge were moved.
After patch:
- The spec says 'all anonymous pages are moved'.
- The implementation is 'all anonymous pages are moved'.
Considering usage of memcg, this will not affect user's experience.
'shared anonymous' pages only exists between a tree of processes which
don't do exec(). Moving one of process without exec() seems not sane.
For example, libcgroup will not be affected by this change. (Anyway, no
one noticed the implementation for a long time...)
Below is a discussion log:
- current spec/implementation are complex
- Now, shared file caches are moved
- It adds unclear check as page_mapcount(). To do correct check,
we should check swap users, etc.
- No one notice this implementation behavior. So, no one get benefit
from the design.
- In general, once task is moved to a cgroup for running, it will not
be moved....
- Finally, we have control knob as memory.move_charge_at_immigrate.
Here is a patch to allow moving shared pages, completely. This makes
memcg simpler and fix current broken code.
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The hierarchical versions of per-memcg counters in memory.stat are all
calculated the same way and are all named total_<counter>.
Documenting the pattern is easier for maintenance than listing each
counter twice.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Ying Han <yinghan@google.com>
Randy Dunlap <rdunlap@xenotime.net>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull cgroup updates from Tejun Heo:
"cgroup file type addition / removal is updated so that file types are
added and removed instead of individual files so that dynamic file
type addition / removal can be implemented by cgroup and used by
controllers. blkio controller changes which will come through block
tree are dependent on this. Other changes include res_counter cleanup
and disallowing kthread / PF_THREAD_BOUND threads to be attached to
non-root cgroups.
There's a reported bug with the file type addition / removal handling
which can lead to oops on cgroup umount. The issue is being looked
into. It shouldn't cause problems for most setups and isn't a
security concern."
Fix up trivial conflict in Documentation/feature-removal-schedule.txt
* 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
res_counter: Account max_usage when calling res_counter_charge_nofail()
res_counter: Merge res_counter_charge and res_counter_charge_nofail
cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads
cgroup: remove cgroup_subsys->populate()
cgroup: get rid of populate for memcg
cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg
cgroup: make css->refcnt clearing on cgroup removal optional
cgroup: use negative bias on css->refcnt to block css_tryget()
cgroup: implement cgroup_rm_cftypes()
cgroup: introduce struct cfent
cgroup: relocate __d_cgrp() and __d_cft()
cgroup: remove cgroup_add_file[s]()
cgroup: convert memcg controller to the new cftype interface
memcg: always create memsw files if CONFIG_CGROUP_MEM_RES_CTLR_SWAP
cgroup: convert all non-memcg controllers to the new cftype interface
cgroup: relocate cftype and cgroup_subsys definitions in controllers
cgroup: merge cft_release_agent cftype array into the base files array
cgroup: implement cgroup_add_cftypes() and friends
cgroup: build list of all cgroups under a given cgroupfs_root
cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir()
...
These two functions do almost the same thing and duplicate some code.
Merge their implementation into a single common function.
res_counter_charge_locked() takes one more parameter but it doesn't seem
to be used outside res_counter.c yet anyway.
There is no (intended) change in the behaviour.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Glauber Costa <glommer@parallels.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Li Zefan <lizefan@huawei.com>
In v3.3-rc1, the global LRU was removed in commit 925b7673cc ("mm:
make per-memcg LRU lists exclusive"). The patch fixes up the memcg
docs.
I left the swap session to someone who has better understanding of
'memory+swap'.
Signed-off-by: Ying Han <yinghan@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull trivial tree from Jiri Kosina:
"It's indeed trivial -- mostly documentation updates and a bunch of
typo fixes from Masanari.
There are also several linux/version.h include removals from Jesper."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (101 commits)
kcore: fix spelling in read_kcore() comment
constify struct pci_dev * in obvious cases
Revert "char: Fix typo in viotape.c"
init: fix wording error in mm_init comment
usb: gadget: Kconfig: fix typo for 'different'
Revert "power, max8998: Include linux/module.h just once in drivers/power/max8998_charger.c"
writeback: fix fn name in writeback_inodes_sb_nr_if_idle() comment header
writeback: fix typo in the writeback_control comment
Documentation: Fix multiple typo in Documentation
tpm_tis: fix tis_lock with respect to RCU
Revert "media: Fix typo in mixer_drv.c and hdmi_drv.c"
Doc: Update numastat.txt
qla4xxx: Add missing spaces to error messages
compiler.h: Fix typo
security: struct security_operations kerneldoc fix
Documentation: broken URL in libata.tmpl
Documentation: broken URL in filesystems.tmpl
mtd: simplify return logic in do_map_probe()
mm: fix comment typo of truncate_inode_pages_range
power: bq27x00: Fix typos in comment
...
The argument is not used at all, and it's not necessary, because
a specific callback handler of course knows which subsys it
belongs to.
Now only ->pupulate() takes this argument, because the handlers of
this callback always call cgroup_add_file()/cgroup_add_files().
So we reduce a few lines of code, though the shrinking of object size
is minimal.
16 files changed, 113 insertions(+), 162 deletions(-)
text data bss dec hex filename
5486240 656987 7039960 13183187 c928d3 vmlinux.o.orig
5486170 656987 7039960 13183117 c9288d vmlinux.o
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The two memcg stats pgpgin/pgpgout have different meaning than the ones
in vmstat, which indicates that we picked a bad naming for them.
It might be late to change the stat name, but better documentation is
always helpful.
Signed-off-by: Ying Han <yinghan@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It should be memsw.max_usage_in_bytes. This typo has been there for
a really long time.
Signed-off-by: Zhu Yanhai <gaoyang.zyh@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
cgroup: fix to allow mounting a hierarchy by name
cgroup: move assignement out of condition in cgroup_attach_proc()
cgroup: Remove task_lock() from cgroup_post_fork()
cgroup: add sparse annotation to cgroup_iter_start() and cgroup_iter_end()
cgroup: mark cgroup_rmdir_waitq and cgroup_attach_proc() as static
cgroup: only need to check oldcgrp==newgrp once
cgroup: remove redundant get/put of task struct
cgroup: remove redundant get/put of old css_set from migrate
cgroup: Remove unnecessary task_lock before fetching css_set on migration
cgroup: Drop task_lock(parent) on cgroup_fork()
cgroups: remove redundant get/put of css_set from css_set_check_fetched()
resource cgroups: remove bogus cast
cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
cgroup, cpuset: don't use ss->pre_attach()
cgroup: don't use subsys->can_attach_task() or ->attach_task()
cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
cgroup: improve old cgroup handling in cgroup_attach_proc()
cgroup: always lock threadgroup during migration
threadgroup: extend threadgroup_lock() to cover exit and exec
threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
...
Fix up conflict in kernel/cgroup.c due to commit e0197aae59e5: "cgroups:
fix a css_set not found bug in cgroup_attach_proc" that already
mentioned that the bug is fixed (differently) in Tejun's cgroup
patchset. This one, in other words.
This reverts commit e5671dfae5.
After a follow up discussion with Michal, it was agreed it would
be better to leave the kmem controller with just the tcp files,
deferring the behavior of the other general memory.kmem.* files
for a later time, when more caches are controlled. This is because
generic kmem files are not used by tcp accounting and it is
not clear how other slab caches would fit into the scheme.
We are reverting the original commit so we can track the reference.
Part of the patch is kept, because it was used by the later tcp
code. Conflicts are shown in the bottom. init/Kconfig is removed from
the revert entirely.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
CC: Kirill A. Shutemov <kirill@shutemov.name>
CC: Paul Menage <paul@paulmenage.org>
CC: Greg Thelen <gthelen@google.com>
CC: Johannes Weiner <jweiner@redhat.com>
CC: David S. Miller <davem@davemloft.net>
Conflicts:
Documentation/cgroups/memory.txt
mm/memcontrol.c
Signed-off-by: David S. Miller <davem@davemloft.net>
These three methods are no longer used. Kill them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Currently, there's no way to pass multiple tasks to cgroup_subsys
methods necessitating the need for separate per-process and per-task
methods. This patch introduces cgroup_taskset which can be used to
pass multiple tasks and their associated cgroups to cgroup_subsys
methods.
Three methods - can_attach(), cancel_attach() and attach() - are
converted to use cgroup_taskset. This unifies passed parameters so
that all methods have access to all information. Conversions in this
patchset are identical and don't introduce any behavior change.
-v2: documentation updated as per Paul Menage's suggestion.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: James Morris <jmorris@namei.org>
This patch introduces kmem.tcp.usage_in_bytes file, living in the
kmem_cgroup filesystem. It is a simple read-only file that displays the
amount of kernel memory currently consumed by the cgroup.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch uses the "tcp.limit_in_bytes" field of the kmem_cgroup to
effectively control the amount of kernel memory pinned by a cgroup.
This value is ignored in the root cgroup, and in all others,
caps the value specified by the admin in the net namespaces'
view of tcp_sysctl_mem.
If namespaces are being used, the admin is allowed to set a
value bigger than cgroup's maximum, the same way it is allowed
to set pretty much unlimited values in a real box.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces memory pressure controls for the tcp
protocol. It uses the generic socket memory pressure code
introduced in earlier patches, and fills in the
necessary data in cg_proto struct.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The goal of this work is to move the memory pressure tcp
controls to a cgroup, instead of just relying on global
conditions.
To avoid excessive overhead in the network fast paths,
the code that accounts allocated memory to a cgroup is
hidden inside a static_branch(). This branch is patched out
until the first non-root cgroup is created. So when nobody
is using cgroups, even if it is mounted, no significant performance
penalty should be seen.
This patch handles the generic part of the code, and has nothing
tcp-specific.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtsu.com>
CC: Kirill A. Shutemov <kirill@shutemov.name>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch lays down the foundation for the kernel memory component
of the Memory Controller.
As of today, I am only laying down the following files:
* memory.independent_kmem_limit
* memory.kmem.limit_in_bytes (currently ignored)
* memory.kmem.usage_in_bytes (always zero)
Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Kirill A. Shutemov <kirill@shutemov.name>
CC: Paul Menage <paul@paulmenage.org>
CC: Greg Thelen <gthelen@google.com>
CC: Johannes Weiner <jweiner@redhat.com>
CC: Michal Hocko <mhocko@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the requisite documentation to explain to new users how net_prio cgroups work
Signed-off-by:Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
CC: Robert Love <robert.w.love@intel.com>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a typo in Documentation/cgroups/freezer-subsystem.txt.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
- ns cgroup has been removed.
- it's true moving a task to another cgroup can fail.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reclaim decides to skip scanning an active list when the corresponding
inactive list is above a certain size in comparison to leave the assumed
working set alone while there are still enough reclaim candidates around.
The memcg implementation of comparing those lists instead reports whether
the whole memcg is low on the requested type of inactive pages,
considering all nodes and zones.
This can lead to an oversized active list not being scanned because of the
state of the other lists in the memcg, as well as an active list being
scanned while its corresponding inactive list has enough pages.
Not only is this wrong, it's also a scalability hazard, because the global
memory state over all nodes and zones has to be gathered for each memcg
and zone scanned.
Make these calculations purely based on the size of the two LRU lists
that are actually affected by the outcome of the decision.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert the post-3.0 commit 82f9d486e5 ("memcg: add
memory.vmscan_stat").
The implementation of per-memcg reclaim statistics violates how memcg
hierarchies usually behave: hierarchically.
The reclaim statistics are accounted to child memcgs and the parent
hitting the limit, but not to hierarchy levels in between. Usually,
hierarchical statistics are perfectly recursive, with each level
representing the sum of itself and all its children.
Since this exports statistics to userspace, this may lead to confusion
and problems with changing things after the release, so revert it now,
we can try again later.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The commit log of 0ae5e89c60 ("memcg: count the soft_limit reclaim
in...") says it adds scanning stats to memory.stat file. But it doesn't
because we considered we needed to make a concensus for such new APIs.
This patch is a trial to add memory.scan_stat. This shows
- the number of scanned pages(total, anon, file)
- the number of rotated pages(total, anon, file)
- the number of freed pages(total, anon, file)
- the number of elaplsed time (including sleep/pause time)
for both of direct/soft reclaim.
The biggest difference with oringinal Ying's one is that this file
can be reset by some write, as
# echo 0 ...../memory.scan_stat
Example of output is here. This is a result after make -j 6 kernel
under 300M limit.
[kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
[kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
scanned_pages_by_limit 9471864
scanned_anon_pages_by_limit 6640629
scanned_file_pages_by_limit 2831235
rotated_pages_by_limit 4243974
rotated_anon_pages_by_limit 3971968
rotated_file_pages_by_limit 272006
freed_pages_by_limit 2318492
freed_anon_pages_by_limit 962052
freed_file_pages_by_limit 1356440
elapsed_ns_by_limit 351386416101
scanned_pages_by_system 0
scanned_anon_pages_by_system 0
scanned_file_pages_by_system 0
rotated_pages_by_system 0
rotated_anon_pages_by_system 0
rotated_file_pages_by_system 0
freed_pages_by_system 0
freed_anon_pages_by_system 0
freed_file_pages_by_system 0
elapsed_ns_by_system 0
scanned_pages_by_limit_under_hierarchy 9471864
scanned_anon_pages_by_limit_under_hierarchy 6640629
scanned_file_pages_by_limit_under_hierarchy 2831235
rotated_pages_by_limit_under_hierarchy 4243974
rotated_anon_pages_by_limit_under_hierarchy 3971968
rotated_file_pages_by_limit_under_hierarchy 272006
freed_pages_by_limit_under_hierarchy 2318492
freed_anon_pages_by_limit_under_hierarchy 962052
freed_file_pages_by_limit_under_hierarchy 1356440
elapsed_ns_by_limit_under_hierarchy 351386416101
scanned_pages_by_system_under_hierarchy 0
scanned_anon_pages_by_system_under_hierarchy 0
scanned_file_pages_by_system_under_hierarchy 0
rotated_pages_by_system_under_hierarchy 0
rotated_anon_pages_by_system_under_hierarchy 0
rotated_file_pages_by_system_under_hierarchy 0
freed_pages_by_system_under_hierarchy 0
freed_anon_pages_by_system_under_hierarchy 0
freed_file_pages_by_system_under_hierarchy 0
elapsed_ns_by_system_under_hierarchy 0
total_xxxx is for hierarchy management.
This will be useful for further memcg developments and need to be
developped before we do some complicated rework on LRU/softlimit
management.
This patch adds a new struct memcg_scanrecord into scan_control struct.
sc->nr_scanned at el is not designed for exporting information. For
example, nr_scanned is reset frequentrly and incremented +2 at scanning
mapped pages.
To avoid complexity, I added a new param in scan_control which is for
exporting scanning score.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Andrew Bresticker <abrestic@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Only the root cpuset has cpuset.memory_pressure_enabled flag,
but not the only one.
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Must echo a task id to the cgroups' tasks file, but not to a directory.
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All the blkio.throttle.* file names are incorrectly reported without
".throttle" in the documentation. Fix it.
Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix format and spelling.
Signed-off-by: Jörg Sommer <joerg@alea.gnuu.de>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
According to commit 676db4af04 ("cgroupfs: create /sys/fs/cgroup to
mount cgroupfs on") the canonical mountpoint for the cgroup filesystem
is /sys/fs/cgroup. Hence, this should be used in the documentation.
Signed-off-by: Jörg Sommer <joerg@alea.gnuu.de>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
leads to some problems:
* cgroup creation is out-of-control
* cgroup name can conflict when pids are looping
* it is not possible to have a single process handling a lot of
namespaces without falling in a exponential creation time
* we may want to create a namespace without creating a cgroup
The ns_cgroup was replaced by a compatibility flag 'clone_children',
where a newly created cgroup will copy the parent cgroup values.
The userspace has to manually create a cgroup and add a task to
the 'tasks' file.
This patch removes the ns_cgroup as suggested in the following thread:
https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html
The 'cgroup_clone' function is removed because it is no longer used.
This is a userspace-visible change. Commit 45531757b4 ("cgroup: notify
ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
printk warning users that the feature is planned for removal. Since that
time we have heard from XXX users who were affected by this.
Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jamal Hadi Salim <hadi@cyberus.ca>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Acked-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make procs file writable to move all threads by tgid at once.
Add functionality that enables users to move all threads in a threadgroup
at once to a cgroup by writing the tgid to the 'cgroup.procs' file. This
current implementation makes use of a per-threadgroup rwsem that's taken
for reading in the fork() path to prevent newly forking threads within the
threadgroup from "escaping" while the move is in progress.
Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add cgroup subsystem callbacks for per-thread attachment in atomic contexts
Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
for cgroups's subsystem interface. Unlike can_attach and attach, these
are for per-thread operations, to be called potentially many times when
attaching an entire threadgroup.
Also, the old "bool threadgroup" interface is removed, as replaced by
this. All subsystems are modified for the new interface - of note is
cpuset, which requires from/to nodemasks for attach to be globally scoped
(though per-cpuset would work too) to persist from its pre_attach to
attach_task and attach.
This is a pre-patch for cgroup-procs-writable.patch.
Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since 569b846d ("memcg: coalesce uncharge during unmap/truncate"), we do
batched (delayed) uncharge at truncation/unmap. And since cdec2e42(memcg:
coalesce charging via percpu storage), we have percpu cache for
res_counter.
These changes improved performance of memory cgroup very much, but made
res_counter->usage usually have a bigger value than the actual value of
memory usage. So, *.usage_in_bytes, which show res_counter->usage, are
not desirable for precise values of memory(and swap) usage anymore.
Instead of removing these files completely(because we cannot know
res_counter->usage without them), this patch updates the meaning of those
files.
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
Documentation/iostats.txt: bit-size reference etc.
cfq-iosched: removing unnecessary think time checking
cfq-iosched: Don't clear queue stats when preempt.
blk-throttle: Reset group slice when limits are changed
blk-cgroup: Only give unaccounted_time under debug
cfq-iosched: Don't set active queue in preempt
block: fix non-atomic access to genhd inflight structures
block: attempt to merge with existing requests on plug flush
block: NULL dereference on error path in __blkdev_get()
cfq-iosched: Don't update group weights when on service tree
fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
block: Require subsystems to explicitly allocate bio_set integrity mempool
jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
fs: make fsync_buffers_list() plug
mm: make generic_writepages() use plugging
blk-cgroup: Add unaccounted time to timeslice_used.
block: fixup plugging stubs for !CONFIG_BLOCK
block: remove obsolete comments for blkdev_issue_zeroout.
blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
...
Fix up conflicts in fs/{aio.c,super.c}
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits)
doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore
Update cpuset info & webiste for cgroups
dcdbas: force SMI to happen when expected
arch/arm/Kconfig: remove one to many l's in the word.
asm-generic/user.h: Fix spelling in comment
drm: fix printk typo 'sracth'
Remove one to many n's in a word
Documentation/filesystems/romfs.txt: fixing link to genromfs
drivers:scsi Change printk typo initate -> initiate
serial, pch uart: Remove duplicate inclusion of linux/pci.h header
fs/eventpoll.c: fix spelling
mm: Fix out-of-date comments which refers non-existent functions
drm: Fix printk typo 'failled'
coh901318.c: Change initate to initiate.
mbox-db5500.c Change initate to initiate.
edac: correct i82975x error-info reported
edac: correct i82975x mci initialisation
edac: correct commented info
fs: update comments to point correct document
target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c
...
Trivial conflict in fs/eventpoll.c (spelling vs addition)
The cgroup documentation does not specify how a process can be removed
from a particular group. This patch adds a note at the end of the
simple example about how this is done. Also, some cgroups (like
cpusets) require user input before a new group can be used. This is
noted in the patch as well.
Signed-off-by: Eric B Munson <emunson@mgebm.net>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cpuset related websited is changed.
and, update list of cpuset using cgroup(controller group).
Signed-off-by: Geunsik Lim <geunsik.lim@samsung.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
We've found that we still get good, useful isolation at weights this
low. I'd like to adjust the minimum so that any other changes can take
these values into account.
Signed-off-by: Justin TerAvest <teravest@google.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Effectively, make group_isolation=1 the default and remove the tunable.
The setting group_isolation=0 was because by default we idle on
sync-noidle tree and on fast devices, this can be very harmful for
throughput.
However, this problem can also be addressed by tuning slice_idle and
possibly group_idle on faster storage devices.
This change simplifies the CFQ code by removing the feature entirely.
Signed-off-by: Justin TerAvest <teravest@google.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The memcg code does not allow changing memory.use_hierarchy if the
parent cgroup has enabled use_hierarchy. Update documentation to match
the code.
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Subjct: Revert memory cgroup dirty_ratio Documentation.
The commit ece72400c2 adds documentation
for memcg's dirty ratio. But the function is not implemented yet.
Remove the documentation for avoiding confusing users.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
block: ensure that completion error gets properly traced
blktrace: add missing probe argument to block_bio_complete
block cfq: don't use atomic_t for cfq_group
block cfq: don't use atomic_t for cfq_queue
block: trace event block fix unassigned field
block: add internal hd part table references
block: fix accounting bug on cross partition merges
kref: add kref_test_and_get
bio-integrity: mark kintegrityd_wq highpri and CPU intensive
block: make kblockd_workqueue smarter
Revert "sd: implement sd_check_events()"
block: Clean up exit_io_context() source code.
Fix compile warnings due to missing removal of a 'ret' variable
fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
cfq-iosched: don't check cfqg in choose_service_tree()
fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
cdrom: export cdrom_check_events()
sd: implement sd_check_events()
sr: implement sr_check_events()
...
The 'ns' cgroup is considered deprecated. Change the cgroup subsystem
used in the examples of the cgroup documentation from 'ns' to 'blkio'.
Signed-off-by: Trevor Woerner <twoerner@gmail.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
o Allow hierarchical cgroup creation for blkio controller
o Currently we disallow it as both the io controller policies (throttling
as well as proportion bandwidth) do not support hierarhical accounting
and control. But the flip side is that blkio controller can not be used with
libvirt as libvirt creates a cgroup hierarchy deeper than 1 level.
<top-level-cgroup-dir>/<controller>/libvirt/qemu/<virtual-machine-groups>
o So this patch will allow creation of cgroup hierarhcy but at the backend
everything will be treated as flat. So if somebody created a an hierarchy
like as follows.
root
/ \
test1 test2
|
test3
CFQ and throttling will practically treat all groups at same level.
pivot
/ | \ \
root test1 test2 test3
o Once we have actual support for hierarchical accounting and control
then we can introduce another cgroup tunable file "blkio.use_hierarchy"
which will be 0 by default but if user wants to enforce hierarhical
control then it can be set to 1. This way there should not be any
ABI problems down the line.
o The only not so pretty part is introduction of extra file "use_hierarchy"
down the line. Kame-san had mentioned that hierarhical accounting is
expensive in memory controller hence they keep it off by default. I
suspect same will be the case for IO controller also as for each IO
completion we shall have to account IO through hierarchy up to the root.
if yes, then it probably is not a very bad idea to introduce this extra
file so that it will be used only when somebody needs it and some people
might enable hierarchy only in part of the hierarchy.
o This is how basically memory controller also uses "use_hierarhcy" and
they also allowed creation of hierarchies when actual backend support
was not available.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Reviewed-by: Ciju Rajan K <ciju@linux.vnet.ibm.com>
Tested-by: Ciju Rajan K <ciju@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The ns_cgroup is a control group interacting with the namespaces. When a
new namespace is created, a corresponding cgroup is automatically created
too. The cgroup name is the pid of the process who did 'unshare' or the
child of 'clone'.
This cgroup is tied with the namespace because it prevents a process to
escape the control group and use the post_clone callback, so the child
cgroup inherits the values of the parent cgroup.
Unfortunately, the more we use this cgroup and the more we are facing
problems with it:
(1) when a process unshares, the cgroup name may conflict with a
previous cgroup with the same pid, so unshare or clone return -EEXIST
(2) the cgroup creation is out of control because there may have an
application creating several namespaces where the system will
automatically create several cgroups in his back and let them on the
cgroupfs (eg. a vrf based on the network namespace).
(3) the mix of (1) and (2) force an administrator to regularly check
and clean these cgroups.
This patchset removes the ns_cgroup by adding a new flag to the cgroup and
the cgroupfs mount option. It enables the copy of the parent cgroup when
a child cgroup is created. We can then safely remove the ns_cgroup as
this flag brings a compatibility. We have now to manually create and add
the task to a cgroup, which is consistent with the cgroup framework.
This patch:
Sent as an answer to a previous thread around the ns_cgroup.
https://lists.linux-foundation.org/pipermail/containers/2009-June/018627.html
It adds a control file 'clone_children' for a cgroup. This control file
is a boolean specifying if the child cgroup should be a clone of the
parent cgroup or not. The default value is 'false'.
This flag makes the child cgroup to call the post_clone callback of all
the subsystem, if it is available.
At present, the cpuset is the only one which had implemented the
post_clone callback.
The option can be set at mount time by specifying the 'clone_children'
mount option.
Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Jamal Hadi Salim <hadi@cyberus.ca>
Cc: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-block: (39 commits)
cfq-iosched: Fix a gcc 4.5 warning and put some comments
block: Turn bvec_k{un,}map_irq() into static inline functions
block: fix accounting bug on cross partition merges
block: Make the integrity mapped property a bio flag
block: Fix double free in blk_integrity_unregister
block: Ensure physical block size is unsigned int
blkio-throttle: Fix possible multiplication overflow in iops calculations
blkio-throttle: limit max iops value to UINT_MAX
blkio-throttle: There is no need to convert jiffies to milli seconds
blkio-throttle: Fix link failure failure on i386
blkio: Recalculate the throttled bio dispatch time upon throttle limit change
blkio: Add root group to td->tg_list
blkio: deletion of a cgroup was causes oops
blkio: Do not export throttle files if CONFIG_BLK_DEV_THROTTLING=n
block: set the bounce_pfn to the actual DMA limit rather than to max memory
block: revert bad fix for memory hotplug causing bounces
Fix compile error in blk-exec.c for !CONFIG_DETECT_HUNG_TASK
block: set the bounce_pfn to the actual DMA limit rather than to max memory
block: Prevent hang_check firing during long I/O
cfq: improve fsync performance for small files
...
Fix up trivial conflicts due to __rcu sparse annotation in include/linux/genhd.h
Some documentation to provide help with tunables.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Below you will find an updated version from the original series bunching all patches into one big patch
updating broken web addresses that are located in Documentation/*
Some of the addresses date as far far back as 1995 etc... so searching became a bit difficult,
the best way to deal with these is to use web.archive.org to locate these addresses that are outdated.
Now there are also some addresses pointing to .spec files some are located, but some(after searching
on the companies site)where still no where to be found. In this case I just changed the address
to the company site this way the users can contact the company and they can locate them for the users.
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Signed-off-by: Thomas Weber <weber@corscience.de>
Signed-off-by: Mike Frysinger <vapier.adi@gmail.com>
Cc: Paulo Marques <pmarques@grupopie.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Michael Neuling <mikey@neuling.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Some information are old, and I think current document doesn't work as "a
guide for users". We need summary of all of our controls, at least.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds support for moving charge of file pages, which include
normal file, tmpfs file and swaps of tmpfs file. It's enabled by setting
bit 1 of <target cgroup>/memory.move_charge_at_immigrate.
Unlike the case of anonymous pages, file pages(and swaps) in the range
mmapped by the task will be moved even if the task hasn't done page fault,
i.e. they might not be the task's "RSS", but other task's "RSS" that maps
the same file. And mapcount of the page is ignored(the page can be moved
even if page_mapcount(page) > 1). So, conditions that the page/swap
should be met to be moved is that it must be in the range mmapped by the
target task and it must be charged to the old cgroup.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>