OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Tejun Heo	755bf5ee86	cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write() Make cgroup_subtree_control_write() first calculate new subtree_control (new_sc), child_subsys_mask (new_ss) and css_enable/disable masks before applying them to the cgroup. Also, store the original subtree_control (old_sc) and child_subsys_mask (old_ss) and use them to restore the orignal state after failure. This patch shouldn't cause any behavior changes. This prepares for a fix for a bug in the async css offline wait logic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com>	2014-11-18 02:49:50 -05:00
Tejun Heo	0f060deb5c	cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask() cgroup_refresh_child_subsys_mask() calculates and updates the effective @cgrp->child_subsys_maks according to the current @cgrp->subtree_control. Separate out the calculation part into cgroup_calc_child_subsys_mask(). This will be used to fix a bug in the async css offline wait logic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com>	2014-11-18 02:49:50 -05:00
Linus Torvalds	c798360cd1	Merge branch 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu updates from Tejun Heo: "A lot of activities on percpu front. Notable changes are... - percpu allocator now can take @gfp. If @gfp doesn't contain GFP_KERNEL, it tries to allocate from what's already available to the allocator and a work item tries to keep the reserve around certain level so that these atomic allocations usually succeed. This will replace the ad-hoc percpu memory pool used by blk-throttle and also be used by the planned blkcg support for writeback IOs. Please note that I noticed a bug in how @gfp is interpreted while preparing this pull request and applied the fix `6ae833c7fe` ("percpu: fix how @gfp is interpreted by the percpu allocator") just now. - percpu_ref now uses longs for percpu and global counters instead of ints. It leads to more sparse packing of the percpu counters on 64bit machines but the overhead should be negligible and this allows using percpu_ref for refcnting pages and in-memory objects directly. - The switching between percpu and single counter modes of a percpu_ref is made independent of putting the base ref and a percpu_ref can now optionally be initialized in single or killed mode. This allows avoiding percpu shutdown latency for cases where the refcounted objects may be synchronously created and destroyed in rapid succession with only a fraction of them reaching fully operational status (SCSI probing does this when combined with blk-mq support). It's also planned to be used to implement forced single mode to detect underflow more timely for debugging. There's a separate branch percpu/for-3.18-consistent-ops which cleans up the duplicate percpu accessors. That branch causes a number of conflicts with s390 and other trees. I'll send a separate pull request w/ resolutions once other branches are merged" * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits) percpu: fix how @gfp is interpreted by the percpu allocator blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky percpu_ref: add PERCPU_REF_INIT_* flags percpu_ref: decouple switching to percpu mode and reinit percpu_ref: decouple switching to atomic mode and killing percpu_ref: add PCPU_REF_DEAD percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch percpu_ref: replace pcpu_ prefix with percpu_ percpu_ref: minor code and comment updates percpu_ref: relocate percpu_ref_reinit() Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe" Revert "percpu: free percpu allocation info for uniprocessor system" percpu-refcount: make percpu_ref based on longs instead of ints percpu-refcount: improve WARN messages percpu: fix locking regression in the failure path of pcpu_alloc() percpu-refcount: add @gfp to percpu_ref_init() proportions: add @gfp to init functions percpu_counter: add @gfp to percpu_counter_init() percpu_counter: make percpu_counters_lock irq-safe ...	2014-10-10 07:26:02 -04:00
Linus Torvalds	b211e9d7c8	Merge branch 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Nothing too interesting. Just a handful of cleanup patches" * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: Revert "cgroup: remove redundant variable in cgroup_mount()" cgroup: remove redundant variable in cgroup_mount() cgroup: fix missing unlock in cgroup_release_agent() cgroup: remove CGRP_RELEASABLE flag perf/cgroup: Remove perf_put_cgroup() cgroup: remove redundant check in cgroup_ino() cpuset: simplify proc_cpuset_show() cgroup: simplify proc_cgroup_show() cgroup: use a per-cgroup work for release agent cgroup: remove bogus comments cgroup: remove redundant code in cgroup_rmdir() cgroup: remove some useless forward declarations cgroup: fix a typo in comment.	2014-10-10 07:24:40 -04:00
Zefan Li	e756c7b698	Revert "cgroup: remove redundant variable in cgroup_mount()" This reverts commit `0c7bf3e8ca`. If there are child cgroups in the cgroupfs and then we umount it, the superblock will be destroyed but the cgroup_root will be kept around. When we mount it again, cgroup_mount() will find this cgroup_root and allocate a new sb for it. So with this commit we will be trapped in a dead loop in the case described above, because kernfs_pin_sb() keeps returning NULL. Currently I don't see how we can avoid using both pinned_sb and new_sb, so just revert it. Cc: Al Viro <viro@ZenIV.linux.org.uk> Reported-by: Andrey Wagin <avagin@gmail.com> Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-09-26 00:16:23 -04:00
Tejun Heo	2aad2a86f6	percpu_ref: add PERCPU_REF_INIT_* flags With the recent addition of percpu_ref_reinit(), percpu_ref now can be used as a persistent switch which can be turned on and off repeatedly where turning off maps to killing the ref and waiting for it to drain; however, there currently isn't a way to initialize a percpu_ref in its off (killed and drained) state, which can be inconvenient for certain persistent switch use cases. Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic selection of operation mode; however, currently a newly initialized percpu_ref is always in percpu mode making it impossible to avoid the latency overhead of switching to atomic mode. This patch adds @flags to percpu_ref_init() and implements the following flags. * PERCPU_REF_INIT_ATOMIC : start ref in atomic mode * PERCPU_REF_INIT_DEAD : start ref killed and drained These flags should be able to serve the above two use cases. v2: target_core_tpg.c conversion was missing. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org>	2014-09-24 13:31:50 -04:00
Tejun Heo	d06efebf0c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into for-3.18 This is to receive `0a30288da1` ("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe") which implements __percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The commit reverted and patches to implement proper fix will be added. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de>	2014-09-24 13:00:21 -04:00
Zefan Li	0c7bf3e8ca	cgroup: remove redundant variable in cgroup_mount() Both pinned_sb and new_sb indicate if a new superblock is needed, so we can just remove new_sb. Note now we must check if kernfs_tryget_sb() returns NULL, because when it returns NULL, kernfs_mount() may still re-use an existing superblock, which is just allocated by another concurent mount. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-09-20 13:09:35 -04:00
Zefan Li	3e2cd91ab9	cgroup: fix missing unlock in cgroup_release_agent() The patch 971ff4935538: "cgroup: use a per-cgroup work for release agent" from Sep 18, 2014, leads to the following static checker warning: kernel/cgroup.c:5310 cgroup_release_agent() warn: 'mutex:&cgroup_mutex' is sometimes locked here and sometimes unlocked. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-09-20 12:23:35 -04:00
Zefan Li	a25eb52e81	cgroup: remove CGRP_RELEASABLE flag We call put_css_set() after setting CGRP_RELEASABLE flag in cgroup_task_migrate(), but in other places we call it without setting the flag. I don't see the necessity of this flag. Moreover once the flag is set, it will never be cleared, unless writing to the notify_on_release control file, so it can be quite confusing if we look at the output of debug.releasable. # mount -t cgroup -o debug xxx /cgroup # mkdir /cgroup/child # cat /cgroup/child/debug.releasable 0 <-- shows 0 though the cgroup is empty # echo $$ > /cgroup/child/tasks # cat /cgroup/child/debug.releasable 0 # echo $$ > /cgroup/tasks && echo $$ > /cgroup/child/tasks # cat /proc/child/debug.releasable 1 <-- shows 1 though the cgroup is not empty This patch removes the flag, and now debug.releasable shows if the cgroup is empty or not. Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-09-19 09:29:32 -04:00
Zefan Li	006f4ac497	cgroup: simplify proc_cgroup_show() Use the ONE macro instead of REG, and we can simplify proc_cgroup_show(). Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-09-18 13:27:23 -04:00
Zefan Li	971ff49355	cgroup: use a per-cgroup work for release agent Instead of using a global work to schedule release agent on removable cgroups, we change to use a per-cgroup work to do this, which makes the code much simpler. v2: use a dedicated work instead of reusing css->destroy_work. (Tejun) Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-09-18 13:14:22 -04:00
Zefan Li	eb4aec84d6	cgroup: fix unbalanced locking cgroup_pidlist_start() holds cgrp->pidlist_mutex and then calls pidlist_array_load(), and cgroup_pidlist_stop() releases the mutex. It is wrong that we release the mutex in the failure path in pidlist_array_load(), because cgroup_pidlist_stop() will be called no matter if cgroup_pidlist_start() returns errno or not. Fixes: `4bac00d16a` Cc: <stable@vger.kernel.org> # 3.14+ Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Cong Wang <xiyou.wangcong@gmail.com>	2014-09-18 12:32:52 -04:00
Li Zefan	0c8fc2c121	cgroup: remove bogus comments We never grab cgroup mutex in fork and exit paths no matter whether notify_on_release is set or not. Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-09-18 06:34:16 +09:00
Li Zefan	244bb9a633	cgroup: remove redundant code in cgroup_rmdir() We no longer clear kn->priv in cgroup_rmdir(), so we don't need to get an extra refcnt. Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-09-18 06:34:15 +09:00
Li Zefan	6213daab25	cgroup: remove some useless forward declarations Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-09-18 06:34:15 +09:00
Tejun Heo	9253b279f4	Merge branch 'for-3.17-fixes' of ra.kernel.org:/pub/scm/linux/kernel/git/tj/cgroup into for-3.18 Pull to receive `a4189487da` ("cgroup: delay the clearing of cgrp->kn->priv") for the scheduled clean up patches. Signed-off-by: Tejun Heo <tj@kernel.org>	2014-09-18 06:29:05 +09:00
Tejun Heo	a34375ef9e	percpu-refcount: add @gfp to percpu_ref_init() Percpu allocator now supports allocation mask. Add @gfp to percpu_ref_init() so that !GFP_KERNEL allocation masks can be used with percpu_refs too. This patch doesn't make any functional difference. v2: blk-mq conversion was missing. Updated. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <koverstreet@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Cc: Jens Axboe <axboe@kernel.dk>	2014-09-08 09:51:30 +09:00
Li Zefan	aa32362f01	cgroup: check cgroup liveliness before unbreaking kernfs When cgroup_kn_lock_live() is called through some kernfs operation and another thread is calling cgroup_rmdir(), we'll trigger the warning in cgroup_get(). ------------[ cut here ]------------ WARNING: CPU: 1 PID: 1228 at kernel/cgroup.c:1034 cgroup_get+0x89/0xa0() ... Call Trace: [<c16ee73d>] dump_stack+0x41/0x52 [<c10468ef>] warn_slowpath_common+0x7f/0xa0 [<c104692d>] warn_slowpath_null+0x1d/0x20 [<c10bb999>] cgroup_get+0x89/0xa0 [<c10bbe58>] cgroup_kn_lock_live+0x28/0x70 [<c10be3c1>] __cgroup_procs_write.isra.26+0x51/0x230 [<c10be5b2>] cgroup_tasks_write+0x12/0x20 [<c10bb7b0>] cgroup_file_write+0x40/0x130 [<c11aee71>] kernfs_fop_write+0xd1/0x160 [<c1148e58>] vfs_write+0x98/0x1e0 [<c114934d>] SyS_write+0x4d/0xa0 [<c16f656b>] sysenter_do_call+0x12/0x12 ---[ end trace 6f2e0c38c2108a74 ]--- Fix this by calling css_tryget() instead of cgroup_get(). v2: - move cgroup_tryget() right below cgroup_get() definition. (Tejun) Cc: <stable@vger.kernel.org> # 3.15+ Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-09-05 01:36:19 +09:00
Li Zefan	a4189487da	cgroup: delay the clearing of cgrp->kn->priv Run these two scripts concurrently: for ((; ;)) { mkdir /cgroup/sub rmdir /cgroup/sub } for ((; ;)) { echo $$ > /cgroup/sub/cgroup.procs echo $$ > /cgroup/cgroup.procs } A kernel bug will be triggered: BUG: unable to handle kernel NULL pointer dereference at 00000038 IP: [<c10bbd69>] cgroup_put+0x9/0x80 ... Call Trace: [<c10bbe19>] cgroup_kn_unlock+0x39/0x50 [<c10bbe91>] cgroup_kn_lock_live+0x61/0x70 [<c10be3c1>] __cgroup_procs_write.isra.26+0x51/0x230 [<c10be5b2>] cgroup_tasks_write+0x12/0x20 [<c10bb7b0>] cgroup_file_write+0x40/0x130 [<c11aee71>] kernfs_fop_write+0xd1/0x160 [<c1148e58>] vfs_write+0x98/0x1e0 [<c114934d>] SyS_write+0x4d/0xa0 [<c16f656b>] sysenter_do_call+0x12/0x12 We clear cgrp->kn->priv in the end of cgroup_rmdir(), but another concurrent thread can access kn->priv after the clearing. We should move the clearing to css_release_work_fn(). At that time no one is holding reference to the cgroup and no one can gain a new reference to access it. v2: - move RCU_INIT_POINTER() into the else block. (Tejun) - remove the cgroup_parent() check. (Tejun) - update the comment in css_tryget_online_from_dir(). Cc: <stable@vger.kernel.org> # 3.15+ Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-09-05 01:36:18 +09:00
Dongsheng Yang	251f8c0364	cgroup: fix a typo in comment. There is no function named cgroup_enable_task_cg_links(). Instead, the correct function name in this comment should be cgroup_enabled_task_cg_lists(). Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-08-25 10:49:29 -04:00
Vivek Goyal	fa8137be6b	cgroup: Display legacy cgroup files on default hierarchy Kernel command line parameter cgroup__DEVEL__legacy_files_on_dfl forces legacy cgroup files to show up on default hierarhcy if susbsystem does not have any files defined for default hierarchy. But this seems to be working only if legacy files are defined in ss->legacy_cftypes. If one adds some cftypes later using cgroup_add_legacy_cftypes(), these files don't show up on default hierarchy. Update the function accordingly so that the dynamically added legacy files also show up in the default hierarchy if the target subsystem is also using the base legacy files for the default hierarchy. tj: Patch description and comment updates. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-08-22 13:20:40 -04:00
Alban Crequy	71b1fb5c44	cgroup: reject cgroup names with '\n' /proc/<pid>/cgroup contains one cgroup path on each line. If cgroup names are allowed to contain "\n", applications cannot parse /proc/<pid>/cgroup safely. Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org	2014-08-18 10:18:57 -04:00
Linus Torvalds	47dfe4037e	Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup changes from Tejun Heo: "Mostly changes to get the v2 interface ready. The core features are mostly ready now and I think it's reasonable to expect to drop the devel mask in one or two devel cycles at least for a subset of controllers. - cgroup added a controller dependency mechanism so that block cgroup can depend on memory cgroup. This will be used to finally support IO provisioning on the writeback traffic, which is currently being implemented. - The v2 interface now uses a separate table so that the interface files for the new interface are explicitly declared in one place. Each controller will explicitly review and add the files for the new interface. - cpuset is getting ready for the hierarchical behavior which is in the similar style with other controllers so that an ancestor's configuration change doesn't change the descendants' configurations irreversibly and processes aren't silently migrated when a CPU or node goes down. All the changes are to the new interface and no behavior changed for the multiple hierarchies" * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits) cpuset: fix the WARN_ON() in update_nodemasks_hier() cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core cgroup: distinguish the default and legacy hierarchies when handling cftypes cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes() cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes cgroup: split cgroup_base_files[] into cgroup_{dfl\|legacy}_base_files[] cpuset: export effective masks to userspace cpuset: allow writing offlined masks to cpuset.cpus/mems cpuset: enable onlined cpu/node in effective masks cpuset: refactor cpuset_hotplug_update_tasks() cpuset: make cs->{cpus, mems}_allowed as user-configured masks cpuset: apply cs->effective_{cpus,mems} cpuset: initialize top_cpuset's configured masks at mount cpuset: use effective cpumask to build sched domains cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty cpuset: update cs->effective_{cpus, mems} when config changes cpuset: update cpuset->effective_{cpus,mems} at hotplug cpuset: add cs->effective_cpus and cs->effective_mems cgroup: clean up sane_behavior handling ...	2014-08-04 10:11:28 -07:00
Linus Torvalds	f2a84170ed	Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu updates from Tejun Heo: - Major reorganization of percpu header files which I think makes things a lot more readable and logical than before. - percpu-refcount is updated so that it requires explicit destruction and can be reinitialized if necessary. This was pulled into the block tree to replace the custom percpu refcnting implemented in blk-mq. - In the process, percpu and percpu-refcount got cleaned up a bit * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (21 commits) percpu-refcount: implement percpu_ref_reinit() and percpu_ref_is_zero() percpu-refcount: require percpu_ref to be exited explicitly percpu-refcount: use unsigned long for pcpu_count pointer percpu-refcount: add helpers for ->percpu_count accesses percpu-refcount: one bit is enough for REF_STATUS percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc() workqueue: stronger test in process_one_work() workqueue: clear POOL_DISASSOCIATED in rebind_workers() percpu: Use ALIGN macro instead of hand coding alignment calculation percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations percpu: preffity percpu header files percpu: use raw_cpu_() to define __this_cpu_() percpu: reorder macros in percpu header files percpu: move {raw\|this}_cpu_() definitions to include/linux/percpu-defs.h percpu: move generic {raw\|this}_cpu__N() definitions to include/asm-generic/percpu.h percpu: only allow sized arch overrides for {raw\|this}_cpu_*() ops percpu: reorganize include/linux/percpu-defs.h percpu: move accessors from include/linux/percpu.h to percpu-defs.h percpu: include/asm-generic/percpu.h should contain only arch-overridable parts percpu: introduce arch_raw_cpu_ptr() ...	2014-08-04 10:09:27 -07:00
Tejun Heo	5de4fa13c4	cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test cgrp_dfl_root_inhibit_ss_mask determines which subsystems are not supported on the default hierarchy and is currently initialized statically and just includes the debug subsystem. Now that there's cgroup_subsys->dfl_files, we can easily tell which subsystems support the default hierarchy or not. Let's initialize cgrp_dfl_root_inhibit_ss_mask by testing whether cgroup_subsys->dfl_files is NULL. After all, subsystems with NULL ->dfl_files aren't useable on the default hierarchy anyway. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-07-15 11:05:10 -04:00
Tejun Heo	05ebb6e60f	cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core cgroup now distinguishes cftypes for the default and legacy hierarchies more explicitly by using separate arrays and CFTYPE_ONLY_ON_DFL and CFTYPE_INSANE should be and are used only inside cgroup core proper. Let's make it clear that the flags are internal by prefixing them with double underscores. CFTYPE_INSANE is renamed to __CFTYPE_NOT_ON_DFL for consistency. The two flags are also collected and assigned bits >= 16 so that they aren't mixed with the published flags. v2: Convert the extra ones in cgroup_exit_cftypes() which are added by revision to the previous patch. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-07-15 11:05:10 -04:00
Tejun Heo	a8ddc8215e	cgroup: distinguish the default and legacy hierarchies when handling cftypes Until now, cftype arrays carried files for both the default and legacy hierarchies and the files which needed to be used on only one of them were flagged with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE. This gets confusing very quickly and we may end up exposing interface files to the default hierarchy without thinking it through. This patch makes cgroup core provide separate sets of interfaces for cftype handling so that the cftypes for the default and legacy hierarchies are clearly distinguished. The previous two patches renamed the existing ones so that they clearly indicate that they're for the legacy hierarchies. This patch adds the interface for the default hierarchy and apply them selectively depending on the hierarchy type. * cftypes added through cgroup_subsys->dfl_cftypes and cgroup_add_dfl_cftypes() only show up on the default hierarchy. * cftypes added through cgroup_subsys->legacy_cftypes and cgroup_add_legacy_cftypes() only show up on the legacy hierarchies. * cgroup_subsys->dfl_cftypes and ->legacy_cftypes can point to the same array for the cases where the interface files are identical on both types of hierarchies. * This makes all the existing subsystem interface files legacy-only by default and all subsystems will have no interface file created when enabled on the default hierarchy. Each subsystem should explicitly review and compose the interface for the default hierarchy. * A boot param "cgroup__DEVEL__legacy_files_on_dfl" is added which makes subsystems which haven't decided the interface files for the default hierarchy to present the legacy files on the default hierarchy so that its behavior on the default hierarchy can be tested. As the awkward name suggests, this is for development only. * memcg's CFTYPE_INSANE on "use_hierarchy" is noop now as the whole array isn't used on the default hierarchy. The flag is removed. v2: Updated documentation for cgroup__DEVEL__legacy_files_on_dfl. v3: Clear CFTYPE_ONLY_ON_DFL and CFTYPE_INSANE when cfts are removed as suggested by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Aristeu Rozanski <aris@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2014-07-15 11:05:10 -04:00
Tejun Heo	2cf669a58d	cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes() Currently, cftypes added by cgroup_add_cftypes() are used for both the unified default hierarchy and legacy ones and subsystems can mark each file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear only on one of them. This is quite hairy and error-prone. Also, we may end up exposing interface files to the default hierarchy without thinking it through. cgroup_subsys will grow two separate cftype addition functions and apply each only on the hierarchies of the matching type. This will allow organizing cftypes in a lot clearer way and encourage subsystems to scrutinize the interface which is being exposed in the new default hierarchy. In preparation, this patch adds cgroup_add_legacy_cftypes() which currently is a simple wrapper around cgroup_add_cftypes() and replaces all cgroup_add_cftypes() usages with it. While at it, this patch drops a completely spurious return from __hugetlb_cgroup_file_init(). This patch doesn't introduce any functional differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2014-07-15 11:05:09 -04:00
Tejun Heo	5577964e64	cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes Currently, cgroup_subsys->base_cftypes is used for both the unified default hierarchy and legacy ones and subsystems can mark each file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear only on one of them. This is quite hairy and error-prone. Also, we may end up exposing interface files to the default hierarchy without thinking it through. cgroup_subsys will grow two separate cftype arrays and apply each only on the hierarchies of the matching type. This will allow organizing cftypes in a lot clearer way and encourage subsystems to scrutinize the interface which is being exposed in the new default hierarchy. In preparation, this patch renames cgroup_subsys->base_cftypes to cgroup_subsys->legacy_cftypes. This patch is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Aristeu Rozanski <aris@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2014-07-15 11:05:09 -04:00
Tejun Heo	a14c6874be	cgroup: split cgroup_base_files[] into cgroup_{dfl\|legacy}_base_files[] Currently cgroup_base_files[] contains the cgroup core interface files for both legacy and default hierarchies with each file tagged with CFTYPE_INSANE and CFTYPE_ONLY_ON_DFL. This is difficult to read. Let's separate it out to two separate tables, cgroup_dfl_base_files[] and cgroup_legacy_base_files[], and use the appropriate one in cgroup_mkdir() depending on the hierarchy type. This makes tagging each file unnecessary. This patch doesn't introduce any behavior changes. v2: cgroup_dfl_base_files[] was missing the termination entry triggering WARN in cgroup_init_cftypes() for 0day kernel testing robot. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Jet Chen <jet.chen@intel.com>	2014-07-15 11:05:09 -04:00
Tejun Heo	7b9a6ba56e	cgroup: clean up sane_behavior handling After the previous patch to remove sane_behavior support from non-default hierarchies, CGRP_ROOT_SANE_BEHAVIOR is used only to indicate the default hierarchy while parsing mount options. This patch makes the following cleanups around it. * Don't show it in the mount option. Eventually the default hierarchy will be assigned a different filesystem type. * As sane_behavior is no longer effective on non-default hierarchies and the default hierarchy doesn't accept any mount options, parse_cgroupfs_options() can consider sane_behavior mount option as indicating the default hierarchy and fail if any other options are specified with it. While at it, remove one of the double blank lines in the function. * cgroup_mount() can now simply test CGRP_ROOT_SANE_BEHAVIOR to tell whether to mount the default hierarchy or not. * As CGROUP_ROOT_SANE_BEHAVIOR's only role now is indicating whether to select the default hierarchy or not during mount, it doesn't need to be set in the default hierarchy itself. cgroup_init_early() updated accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-07-09 10:08:08 -04:00
Tejun Heo	aa6ec29bee	cgroup: remove sane_behavior support on non-default hierarchies sane_behavior has been used as a development vehicle for the default unified hierarchy. Now that the default hierarchy is in place, the flag became redundant and confusing as its usage is allowed on all hierarchies. There are gonna be either the default hierarchy or legacy ones. Let's make that clear by removing sane_behavior support on non-default hierarchies. This patch replaces cgroup_sane_behavior() with cgroup_on_dfl(). The comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of cgroup_on_dfl() with sane_behavior specific part dropped. On the default and legacy hierarchies w/o sane_behavior, this shouldn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz>	2014-07-09 10:08:08 -04:00
Tejun Heo	c1d5d42efd	cgroup: make interface file "cgroup.sane_behavior" legacy-only "cgroup.sane_behavior" is added to help distinguishing whether sane_behavior is in effect or not. We now have the default hierarchy where the flag is always in effect and are planning to remove supporting sane behavior on the legacy hierarchies making this file on the default hierarchy rather pointless. Let's make it legacy only and thus always zero. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-07-09 10:08:08 -04:00
Tejun Heo	7450e90bbb	cgroup: remove CGRP_ROOT_OPTION_MASK cgroup_root->flags only contains CGRP_ROOT_* flags and there's no reason to mask the flags. Remove CGRP_ROOT_OPTION_MASK. This doesn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-07-09 10:08:07 -04:00
Tejun Heo	af0ba6789c	cgroup: implement cgroup_subsys->depends_on Currently, the blkio subsystem attributes all of writeback IOs to the root. One of the issues is that there's no way to tell who originated a writeback IO from block layer. Those IOs are usually issued asynchronously from a task which didn't have anything to do with actually generating the dirty pages. The memory subsystem, when enabled, already keeps track of the ownership of each dirty page and it's desirable for blkio to piggyback instead of adding its own per-page tag. blkio piggybacking on memory is an implementation detail which preferably should be handled automatically without requiring explicit userland action. To achieve that, this patch implements cgroup_subsys->depends_on which contains the mask of subsystems which should be enabled together when the subsystem is enabled. The previous patches already implemented the support for enabled but invisible subsystems and cgroup_subsys->depends_on can be easily implemented by updating cgroup_refresh_child_subsys_mask() so that it calculates cgroup->child_subsys_mask considering cgroup_subsys->depends_on of the explicitly enabled subsystems. Documentation/cgroups/unified-hierarchy.txt is updated to explain that subsystems may not become immediately available after being unused from userland and that dependency could be a factor in it. As subsystems may already keep residual references, this doesn't significantly change how subsystem rebinding can be used. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org>	2014-07-08 18:02:57 -04:00
Tejun Heo	b4536f0cab	cgroup: implement cgroup_subsys->css_reset() cgroup is implementing support for subsystem dependency which would require a way to enable a subsystem even when it's not directly configured through "cgroup.subtree_control". The previous patches added support for explicitly and implicitly enabled subsystems and showing/hiding their interface files. An explicitly enabled subsystem may become implicitly enabled if it's turned off through "cgroup.subtree_control" but there are subsystems depending on it. In such cases, the subsystem, as it's turned off when seen from userland, shouldn't enforce any resource control. Also, the subsystem may be explicitly turned on later again and its interface files should be as close to the intial state as possible. This patch adds cgroup_subsys->css_reset() which is invoked when a css is hidden. The callback should disable resource control and reset the state to the vanilla state. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org>	2014-07-08 18:02:57 -04:00
Tejun Heo	f63070d350	cgroup: make interface files visible iff enabled on cgroup->subtree_control cgroup is implementing support for subsystem dependency which would require a way to enable a subsystem even when it's not directly configured through "cgroup.subtree_control". The preceding patch distinguished cgroup->subtree_control and ->child_subsys_mask where the former is the subsystems explicitly configured by the userland and the latter is all enabled subsystems currently is equal to the former but will include subsystems implicitly enabled through dependency. Subsystems which are enabled due to dependency shouldn't be visible to userland. This patch updates cgroup_subtree_control_write() and create_css() such that interface files are not created for implicitly enabled subsytems. * @visible paramter is added to create_css(). Interface files are created only when true. * If an already implicitly enabled subsystem is turned on through "cgroup.subtree_control", the existing css should be used. css draining is skipped. * cgroup_subtree_control_write() computes the new target cgroup->child_subsys_mask and create/kill or show/hide csses accordingly. As the two subsystem masks are still kept identical, this patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org>	2014-07-08 18:02:57 -04:00
Tejun Heo	667c249171	cgroup: introduce cgroup->subtree_control cgroup is implementing support for subsystem dependency which would require a way to enable a subsystem even when it's not directly configured through "cgroup.subtree_control". Previously, cgroup->child_subsys_mask directly reflected "cgroup.subtree_control" and the enabled subsystems in the child cgroups. This patch adds cgroup->subtree_control which "cgroup.subtree_control" operates on. cgroup->child_subsys_mask is now calculated from cgroup->subtree_control by cgroup_refresh_child_subsys_mask(), which sets it identical to cgroup->subtree_control for now. This will allow using cgroup->child_subsys_mask for all the enabled subsystems including the implicit ones and ->subtree_control for tracking the explicitly requested ones. This patch keeps the two masks identical and doesn't introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org>	2014-07-08 18:02:56 -04:00
Tejun Heo	c29adf24e0	cgroup: reorganize cgroup_subtree_control_write() Make the following two reorganizations to cgroup_subtree_control_write(). These are to prepare for future changes and shouldn't cause any functional difference. * Move availability above css offlining wait. * Move cgrp->child_subsys_mask update above new css creation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org>	2014-07-08 18:02:56 -04:00
Li Zefan	3a32bd72d7	cgroup: fix a race between cgroup_mount() and cgroup_kill_sb() We've converted cgroup to kernfs so cgroup won't be intertwined with vfs objects and locking, but there are dark areas. Run two instances of this script concurrently: for ((; ;)) { mount -t cgroup -o cpuacct xxx /cgroup umount /cgroup } After a while, I saw two mount processes were stuck at retrying, because they were waiting for a subsystem to become free, but the root associated with this subsystem never got freed. This can happen, if thread A is in the process of killing superblock but hasn't called percpu_ref_kill(), and at this time thread B is mounting the same cgroup root and finds the root in the root list and performs percpu_ref_try_get(). To fix this, we try to increase both the refcnt of the superblock and the percpu refcnt of cgroup root. v2: - we should try to get both the superblock refcnt and cgroup_root refcnt, because cgroup_root may have no superblock assosiated with it. - adjust/add comments. tj: Updated comments. Renamed @sb to @pinned_sb. Cc: <stable@vger.kernel.org> # 3.15 Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-06-30 10:16:26 -04:00
Li Zefan	970317aa48	cgroup: fix mount failure in a corner case # cat test.sh #! /bin/bash mount -t cgroup -o cpu xxx /cgroup umount /cgroup mount -t cgroup -o cpu,cpuacct xxx /cgroup umount /cgroup # ./test.sh mount: xxx already mounted or /cgroup busy mount: according to mtab, xxx is already mounted on /cgroup It's because the cgroupfs_root of the first mount was under destruction asynchronously. Fix this by delaying and then retrying mount for this case. v3: - put the refcnt immediately after getting it. (Tejun) v2: - use percpu_ref_tryget_live() rather that introducing percpu_ref_alive(). (Tejun) - adjust comment. tj: Updated the comment a bit. Cc: <stable@vger.kernel.org> # 3.15 Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-06-30 10:16:25 -04:00
Tejun Heo	9a1049da9b	percpu-refcount: require percpu_ref to be exited explicitly Currently, a percpu_ref undoes percpu_ref_init() automatically by freeing the allocated percpu area when the percpu_ref is killed. While seemingly convenient, this has the following niggles. * It's impossible to re-init a released reference counter without going through re-allocation. * In the similar vein, it's impossible to initialize a percpu_ref count with static percpu variables. * We need and have an explicit destructor anyway for failure paths - percpu_ref_cancel_init(). This patch removes the automatic percpu counter freeing in percpu_ref_kill_rcu() and repurposes percpu_ref_cancel_init() into a generic destructor now named percpu_ref_exit(). percpu_ref_destroy() is considered but it gets confusing with percpu_ref_kill() while "exit" clearly indicates that it's the counterpart of percpu_ref_init(). All percpu_ref_cancel_init() users are updated to invoke percpu_ref_exit() instead and explicit percpu_ref_exit() calls are added to the destruction path of all percpu_ref users. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Cc: Li Zefan <lizefan@huawei.com>	2014-06-28 08:10:14 -04:00
Li Zefan	99bae5f941	cgroup: fix broken css_has_online_children() After running: # mount -t cgroup cpu xxx /cgroup && mkdir /cgroup/sub && \ rmdir /cgroup/sub && umount /cgroup I found the cgroup root still existed: # cat /proc/cgroups #subsys_name hierarchy num_cgroups enabled cpuset 0 1 1 cpu 1 1 1 ... It turned out css_has_online_children() is broken. Signed-off-by: Li Zefan <lizefan@huawei.com> Sigend-off-by: Tejun Heo <tj@kernel.org>	2014-06-17 18:52:53 -04:00
Linus Torvalds	14208b0ec5	Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot of activities on cgroup side. Heavy restructuring including locking simplification took place to improve the code base and enable implementation of the unified hierarchy, which currently exists behind a __DEVEL__ mount option. The core support is mostly complete but individual controllers need further work. To explain the design and rationales of the the unified hierarchy Documentation/cgroups/unified-hierarchy.txt is added. Another notable change is css (cgroup_subsys_state - what each controller uses to identify and interact with a cgroup) iteration update. This is part of continuing updates on css object lifetime and visibility. cgroup started with reference count draining on removal way back and is now reaching a point where csses behave and are iterated like normal refcnted objects albeit with some complexities to allow distinguishing the state where they're being deleted. The css iteration update isn't taken advantage of yet but is planned to be used to simplify memcg significantly" * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits) cgroup: disallow disabled controllers on the default hierarchy cgroup: don't destroy the default root cgroup: disallow debug controller on the default hierarchy cgroup: clean up MAINTAINERS entries cgroup: implement css_tryget() device_cgroup: use css_has_online_children() instead of has_children() cgroup: convert cgroup_has_live_children() into css_has_online_children() cgroup: use CSS_ONLINE instead of CGRP_DEAD cgroup: iterate cgroup_subsys_states directly cgroup: introduce CSS_RELEASED and reduce css iteration fallback window cgroup: move cgroup->serial_nr into cgroup_subsys_state cgroup: link all cgroup_subsys_states in their sibling lists cgroup: move cgroup->sibling and ->children into cgroup_subsys_state cgroup: remove cgroup->parent device_cgroup: remove direct access to cgroup->children memcg: update memcg_has_children() to use css_next_child() memcg: remove tasks/children test from mem_cgroup_force_empty() cgroup: remove css_parent() cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css cgroup: use cgroup->self.refcnt for cgroup refcnting ...	2014-06-09 15:03:33 -07:00
Li Zefan	c731ae1d0f	cgroup: disallow disabled controllers on the default hierarchy After booting with cgroup_disable=memory, I still saw memcg files in the default hierarchy, and I can write to them, though it won't take effect. # dmesg ... Disabling memory control group subsystem ... # mount -t cgroup -o __DEVEL__sane_behavior xxx /cgroup # ls /cgroup ... memory.failcnt memory.move_charge_at_immigrate memory.force_empty memory.numa_stat memory.limit_in_bytes memory.oom_control ... # cat /cgroup/memory.usage_in_bytes 0 tj: Minor comment update. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-06-05 09:52:51 -04:00
Li Zefan	1f779fb28a	cgroup: don't destroy the default root The default root is allocated and initialized at boot phase, so we shouldn't destroy the default root when it's umounted, otherwise it will lead to disaster. Just try mount and then umount the default root, and the kernel will crash immediately. v2: - No need to check for CSS_NO_REF in cgroup_get/put(). (Tejun) - Better call cgroup_put() for the default root in kill_sb(). (Tejun) - Add a comment. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-06-04 21:21:51 -04:00
Jianyu Zhan	c9482a5bdc	kernfs: move the last knowledge of sysfs out from kernfs There is still one residue of sysfs remaining: the sb_magic SYSFS_MAGIC. However this should be kernfs user specific, so this patch moves it out. Kerrnfs user should specify their magic number while mouting. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-06-03 08:11:18 -07:00
Tejun Heo	5533e01144	cgroup: disallow debug controller on the default hierarchy The debug controller, as its name suggests, exposes cgroup core internals to userland to aid debugging. Unfortunately, except for the name, there's no provision to prevent its usage in production configurations and the controller is widely enabled and mounted leaking internal details to userland. Like most other debug information, the information exposed by debug isn't interesting even for debugging itself once the related parts are working reliably. This controller has no reason for existing. This patch implements cgrp_dfl_root_inhibit_ss_mask which can suppress specific subsystems on the default hierarchy and adds the debug subsystem to it so that it can be gradually deprecated as usages move towards the unified hierarchy. Signed-off-by: Tejun Heo <tj@kernel.org>	2014-05-19 16:37:06 -04:00
Tejun Heo	f3d4650015	cgroup: convert cgroup_has_live_children() into css_has_online_children() Now that cgroup liveliness and css onliness are the same state, convert cgroup_has_live_children() into css_has_online_children() so that it can be used for actual csses too. The function now uses css_for_each_child() for iteration and is published. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-16 13:22:52 -04:00
Tejun Heo	184faf3232	cgroup: use CSS_ONLINE instead of CGRP_DEAD Use CSS_ONLINE on the self css to indicate whether a cgroup has been killed instead of CGRP_DEAD. This will allow re-using css online test for cgroup liveliness test. This doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-16 13:22:51 -04:00
Tejun Heo	c2931b70a3	cgroup: iterate cgroup_subsys_states directly Currently, css_next_child() is implemented as finding the next child cgroup which has the css enabled, which used to be the only way to do it as only cgroups participated in sibling lists and thus could be iteratd. This works as long as what's required during iteration is not missing online csses; however, it turns out that there are use cases where offlined but not yet released csses need to be iterated. This is difficult to implement through cgroup iteration the unified hierarchy as there may be multiple dying csses for the same subsystem associated with single cgroup. After the recent changes, the cgroup self and regular csses behave identically in how they're linked and unlinked from the sibling lists including assertion of CSS_RELEASED and css_next_child() can simply switch to iterating csses directly. This both simplifies the logic and ensures that all visible non-released csses are included in the iteration whether there are multiple dying csses for a subsystem or not. As all other iterators depend on css_next_child() for sibling iteration, this changes behaviors of all css iterators. Add and update explanations on the css states which are included in traversal to all iterators. As css iteration could always contain offlined csses, this shouldn't break any of the current users and new usages which need iteration of all on and offline csses can make use of the new semantics. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org>	2014-05-16 13:22:51 -04:00
Tejun Heo	de3f034182	cgroup: introduce CSS_RELEASED and reduce css iteration fallback window css iterations allow the caller to drop RCU read lock. As long as the caller keeps the current position accessible, it can simply re-grab RCU read lock later and continue iteration. This is achieved by using CGRP_DEAD to detect whether the current positions next pointer is safe to dereference and if not re-iterate from the beginning to the next position using ->serial_nr. CGRP_DEAD is used as the marker to invalidate the next pointer and the only requirement is that the marker is set before the next sibling starts its RCU grace period. Because CGRP_DEAD is set at the end of cgroup_destroy_locked() but the cgroup is unlinked when the reference count reaches zero, we currently have a rather large window where this fallback re-iteration logic can be triggered. This patch introduces CSS_RELEASED which is set when a css is unlinked from its sibling list. This still keeps the re-iteration logic working while drastically reducing the window of its activation. While at it, rewrite the comment in css_next_child() to reflect the new flag and better explain the synchronization. This will also enable iterating csses directly instead of through cgroups. v2: CSS_RELEASED now assigned to 1 << 2 as 1 << 0 is used by CSS_NO_REF. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-16 13:22:49 -04:00
Tejun Heo	0cb51d71c1	cgroup: move cgroup->serial_nr into cgroup_subsys_state We're moving towards using cgroup_subsys_states as the fundamental structural blocks. All csses including the cgroup->self and actual ones now form trees through css->children and ->sibling which follow the same rules as what cgroup->children and ->sibling followed. This patch moves cgroup->serial_nr which is used to implement css iteration into css. Note that all csses, regardless of their types, allocate their serial numbers from the same monotonically increasing counter. This doesn't affect the ordering needed by css iteration or cause any other material behavior changes. This will be used to update css iteration. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-16 13:22:49 -04:00
Tejun Heo	1fed1b2e36	cgroup: link all cgroup_subsys_states in their sibling lists Currently, while all csses have ->children and ->sibling, only the self csses of cgroups make use of them. This patch makes all other csses to link themselves on the sibling lists too. This will be used to update css iteration. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-16 13:22:49 -04:00
Tejun Heo	d5c419b68e	cgroup: move cgroup->sibling and ->children into cgroup_subsys_state We're moving towards using cgroup_subsys_states as the fundamental structural blocks. Let's move cgroup->sibling and ->children into cgroup_subsys_state. This is pure move without functional change and only cgroup->self's fields are actually used. Other csses will make use of the fields later. While at it, update init_and_link_css() so that it zeroes the whole css before initializing it and remove explicit zeroing of ->flags. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-16 13:22:48 -04:00
Tejun Heo	d51f39b05c	cgroup: remove cgroup->parent cgroup->parent is redundant as cgroup->self.parent can also be used to determine the parent cgroup and we're moving towards using cgroup_subsys_states as the fundamental structural blocks. This patch introduces cgroup_parent() which follows cgroup->self.parent and removes cgroup->parent. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-16 13:22:48 -04:00
Tejun Heo	5c9d535b89	cgroup: remove css_parent() cgroup in general is moving towards using cgroup_subsys_state as the fundamental structural component and css_parent() was introduced to convert from using cgroup->parent to css->parent. It was quite some time ago and we're moving forward with making css more prominent. This patch drops the trivial wrapper css_parent() and let the users dereference css->parent. While at it, explicitly mark fields of css which are public and immutable. v2: New usage from device_cgroup.c converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Johannes Weiner <hannes@cmpxchg.org>	2014-05-16 13:22:48 -04:00
Tejun Heo	3b514d24e2	cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css `9395a45004` ("cgroup: enable refcnting for root csses") enabled reference counting for root csses (cgroup_subsys_states) so that cgroup's self csses can be used to manage the lifetime of the containing cgroups. Unfortunately, this change was incorrect. During early init, cgrp_dfl_root self css refcnt is used. percpu_ref can't initialized during early init and its initialization is deferred till cgroup_init() time. This means that cpu was using percpu_ref which wasn't properly initialized. Due to the way percpu variables are laid out on x86, this didn't blow up immediately on x86 but ended up incrementing and decrementing the percpu variable at offset zero, whatever it may be; however, on other archs, this caused fault and early boot failure. As cgroup self csses for root cgroups of non-dfl hierarchies need working refcounting, we can't revert `9395a45004`. This patch adds CSS_NO_REF which explicitly inhibits reference counting on the css and sets it on all normal (non-self) csses and cgroup_dfl_root self css. v2: cgrp_dfl_root.self is the offending one. Set the flag on it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Stephen Warren <swarren@nvidia.com> Tested-by: Stephen Warren <swarren@nvidia.com> Fixes: `9395a45004` ("cgroup: enable refcnting for root csses")	2014-05-16 13:22:47 -04:00
Tejun Heo	9d755d33f0	cgroup: use cgroup->self.refcnt for cgroup refcnting Currently cgroup implements refcnting separately using atomic_t cgroup->refcnt. The destruction paths of cgroup and css are rather complex and bear a lot of similiarities including the use of RCU and bouncing to a work item. This patch makes cgroup use the refcnt of self css for refcnting instead of using its own. This makes cgroup refcnting use css's percpu refcnt and share the destruction mechanism. * css_release_work_fn() and css_free_work_fn() are updated to handle both csses and cgroups. This is a bit messy but should do until we can make cgroup->self a full css, which currently can't be done thanks to multiple hierarchies. * cgroup_destroy_locked() now performs percpu_ref_kill(&cgrp->self.refcnt) instead of cgroup_put(cgrp). * Negative refcnt sanity check in cgroup_get() is no longer necessary as percpu_ref already handles it. * Similarly, as a cgroup which hasn't been killed will never be released regardless of its refcnt value and percpu_ref has sanity check on kill, cgroup_is_dead() sanity check in cgroup_put() is no longer necessary. * As whether a refcnt reached zero or not can only be decided after the reference count is killed, cgroup_root->cgrp's refcnting can no longer be used to decide whether to kill the root or not. Let's make cgroup_kill_sb() explicitly initiate destruction if the root doesn't have any children. This makes sense anyway as unmounted cgroup hierarchy without any children should be destroyed. While this is a bit messy, this will allow pushing more bookkeeping towards cgroup->self and thus handling cgroups and csses in more uniform way. In the very long term, it should be possible to introduce a base subsystem and convert the self css to a proper one making things whole lot simpler and unified. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-14 09:15:02 -04:00
Tejun Heo	9395a45004	cgroup: enable refcnting for root csses Currently, css_get(), css_tryget() and css_tryget_online() are noops for root csses as an optimization; however, we're planning to use css refcnts to track of cgroup lifetime too and root cgroups also need to be reference counted. Since css has been converted to percpu_refcnt, the overhead of refcnting is miniscule and this optimization isn't too meaningful anymore. Furthermore, controllers which optimize the root cgroup often never even invoke these functions in their hot paths. This patch enables refcnting for root csses too. This makes CSS_ROOT flag unused and removes it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-14 09:15:02 -04:00
Tejun Heo	25e15d8350	cgroup: bounce css release through css->destroy_work css release is planned to do more and would require process context. Bounce it through css->destroy_work. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-14 09:15:02 -04:00
Tejun Heo	249f3468a2	cgroup: remove cgroup_destory_css_killed() cgroup_destroy_css_killed() is cgroup destruction stage which happens after all csses are offlined. After the recent updates, it no longer does anything other than putting the base reference. This patch removes the function and makes cgroup_destroy_locked() put the base ref at the end isntead. This also makes cgroup->nr_css unnecessary. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-14 09:15:01 -04:00
Tejun Heo	4e4e284723	cgroup: move cgroup->sibling unlinking to cgroup_put() Move cgroup->sibling unlinking from cgroup_destroy_css_killed() to cgroup_put(). This is later but still before the RCU grace period, so it doesn't break css_next_child() although there now is a larger window in which a dead cgroup is visible during css iteration. As css iteration always could have included offline csses, this doesn't affect correctness; however, it does make css_next_child() fall back to reiterting mode more often. This also makes cgroup_put() directly take cgroup_mutex, which limits where it can be called from. These are not immediately problematic and will be dealt with later. This change enables simplification of cgroup destruction path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-14 09:15:01 -04:00
Tejun Heo	9e4173e1f2	cgroup: move check_for_release(parent) call to the end of cgroup_destroy_locked() Currently, check_for_release() on the parent of a destroyed cgroup is invoked from cgroup_destroy_css_killed(). This is because this is where the destroyed cgroup can be removed from the parent's children list. check_for_release() tests the emptiness of the list directly, so invoking it before removing the cgroup from the list makes it think that the parent still has children even when it no longer does. This patch updates check_for_release() to use cgroup_has_live_children() instead of directly testing ->children emptiness and moves check_for_release(parent) earlier to the end of cgroup_destroy_locked(). As cgroup_has_live_children() ignores cgroups marked DEAD, check_for_release() functions correctly as long as it's called after asserting DEAD. This makes release notification slightly more timely and more importantly enables further simplification of cgroup destruction path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-14 09:15:01 -04:00
Tejun Heo	cbc125efad	cgroup: separate out cgroup_has_live_children() from cgroup_destroy_locked() We're expecting another user. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-14 09:15:01 -04:00
Tejun Heo	9d800df12d	cgroup: rename cgroup->dummy_css to ->self and move it to the top cgroup->dummy_css is used as the placeholder css when performing css oriended operations on the cgroup. We're gonna shift more cgroup management to this css. Let's rename it to ->self and move it to the top. This is pure rename and field relocation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-14 09:15:00 -04:00
Tejun Heo	a015edd26e	cgroup: use restart_syscall() for mount retries cgroup_mount() uses dumb delay-and-retry logic to wait for cgroup_root which is being destroyed. The retry currently loops inside cgroup_mount() proper. This patch makes it return with restart_syscall() instead so that retry travels out to userland boundary. This slightly simplifies the logic and more importantly makes the retry logic behave better when the wait for some reason becomes lengthy or infinite by allowing the operation to be suspended or terminated from userland. v2: The original patch forgot to free memory allocated for @opts. Fixed. Caught by Li Zefan. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-14 09:15:00 -04:00
Tejun Heo	8353da1f91	cgroup: remove cgroup_tree_mutex cgroup_tree_mutex was introduced to work around the circular dependency between cgroup_mutex and kernfs active protection - some kernfs file and directory operations needed cgroup_mutex putting cgroup_mutex under active protection but cgroup also needs to be able to access cgroup hierarchies and cftypes to determine which kernfs_nodes need to be removed. cgroup_tree_mutex nested above both cgroup_mutex and kernfs active protection and used to protect the hierarchy and cftypes. While this worked, it added a lot of double lockings and was generally cumbersome. kernfs provides a mechanism to opt out of active protection and cgroup was already using it for removal and subtree_control. There's no reason to mix both methods of avoiding circular locking dependency and the preceding cgroup_kn_lock_live() changes applied it to all relevant cgroup kernfs operations making it unnecessary to nest cgroup_mutex under kernfs active protection. The previous patch reversed the original lock ordering and put cgroup_mutex above kernfs active protection. After these changes, all cgroup_tree_mutex usages are now accompanied by cgroup_mutex making the former completely redundant. This patch removes cgroup_tree_mutex and all its usages. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:19:23 -04:00
Tejun Heo	01f6474ce0	cgroup: nest kernfs active protection under cgroup_mutex After the recent cgroup_kn_lock_live() changes, cgroup_mutex is no longer nested below kernfs active protection. The two don't have any relationship now. This patch nests kernfs active protection under cgroup_mutex. All cftype operations now require both cgroup_tree_mutex and cgroup_mutex, temporary cgroup_mutex releases over kernfs operations are removed, and cgroup_add/rm_cftypes() grab both mutexes. This makes cgroup_tree_mutex redundant, which will be removed by the next patch. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:19:23 -04:00
Tejun Heo	e76ecaeef6	cgroup: use cgroup_kn_lock_live() in other cgroup kernfs methods Make __cgroup_procs_write() and cgroup_release_agent_write() use cgroup_kn_lock_live() and cgroup_kn_unlock() instead of cgroup_lock_live_group(). This puts the operations under both cgroup_tree_mutex and cgroup_mutex protection without circular dependency from kernfs active protection. Also, this means that cgroup_mutex is no longer nested below kernfs active protection. There is no longer any place where the two locks interact. This leaves cgroup_lock_live_group() without any user. Removed. This will help simplifying cgroup locking. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:19:23 -04:00
Tejun Heo	a9746d8da7	cgroup: factor out cgroup_kn_lock_live() and cgroup_kn_unlock() cgroup_mkdir(), cgroup_rmdir() and cgroup_subtree_control_write() share the logic to break active protection so that they can grab cgroup_tree_mutex which nests above active protection and/or remove self. Factor out this logic into cgroup_kn_lock_live() and cgroup_kn_unlock(). This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:19:22 -04:00
Tejun Heo	cfc79d5bec	cgroup: move cgroup->kn->priv clearing to cgroup_rmdir() The ->priv field of a cgroup directory kernfs_node points back to the cgroup. This field is RCU cleared in cgroup_destroy_locked() for non-kernfs accesses from css_tryget_from_dir() and cgroupstats_build(). As these are only applicable to cgroups which finished creation successfully and fully initialized cgroups are always removed by cgroup_rmdir(), this can be safely moved to the end of cgroup_rmdir(). This will help simplifying cgroup locking and shouldn't introduce any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:19:22 -04:00
Tejun Heo	ddab2b6e0e	cgroup: grab cgroup_mutex earlier in cgroup_subtree_control_write() Move cgroup_lock_live_group() invocation upwards to right below cgroup_tree_mutex in cgroup_subtree_control_write(). This is to help the planned locking simplification. This doesn't make any userland-visible behavioral changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:19:22 -04:00
Tejun Heo	b3bfd983ca	cgroup: collapse cgroup_create() into croup_mkdir() cgroup_mkdir() is the sole user of cgroup_create(). Let's collapse the latter into the former. This will help simplifying locking. While at it, remove now stale comment about inode locking. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:19:22 -04:00
Tejun Heo	ba0f4d7615	cgroup: reorganize cgroup_create() Reorganize cgroup_create() so that all paths share unlock out path. * All err_* labels are renamed to out_* as they're now shared by both success and failure paths. * @err renamed to @ret for the similar reason as above and so that it's more consistent with other functions. * cgroup memory allocation moved after locking so that freeing failed cgroup happens before unlocking. While this moves more code inside critical section, memory allocations inside cgroup locking are already pretty common and this is unlikely to make any noticeable difference. * While at it, replace a stray @parent->root dereference with @root. This reorganization will help simplifying locking. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:19:22 -04:00
Tejun Heo	b7fc5ad235	cgroup: remove cgroup->control_kn Now that cgroup_subtree_control_write() has access to the associated kernfs_open_file and thus the kernfs_node, there's no need to cache it in cgroup->control_kn on creation. Remove cgroup->control_kn and use @of->kn directly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:16:22 -04:00
Tejun Heo	acbef755f4	cgroup: convert "tasks" and "cgroup.procs" handle to use cftype->write() cgroup_tasks_write() and cgroup_procs_write() are currently using cftype->write_u64(). This patch converts them to use cftype->write() instead. This allows access to the associated kernfs_open_file which will be necessary to implement the planned kernfs active protection manipulation for these files. This shifts buffer parsing to attach_task_by_pid() and makes it return @nbytes on success. Let's rename it to __cgroup_procs_write() to clearly indicate that this is a write handler implementation. This patch doesn't introduce any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:16:22 -04:00
Tejun Heo	6770c64e5c	cgroup: replace cftype->trigger() with cftype->write() cftype->trigger() is pointless. It's trivial to ignore the input buffer from a regular ->write() operation. Convert all ->trigger() users to ->write() and remove ->trigger(). This patch doesn't introduce any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz>	2014-05-13 12:16:21 -04:00
Tejun Heo	451af504df	cgroup: replace cftype->write_string() with cftype->write() Convert all cftype->write_string() users to the new cftype->write() which maps directly to kernfs write operation and has full access to kernfs and cgroup contexts. The conversions are mostly mechanical. * @css and @cft are accessed using of_css() and of_cft() accessors respectively instead of being specified as arguments. * Should return @nbytes on success instead of 0. * @buf is not trimmed automatically. Trim if necessary. Note that blkcg and netprio don't need this as the parsers already handle whitespaces. cftype->write_string() has no user left after the conversions and removed. While at it, remove unnecessary local variable @p in cgroup_subtree_control_write() and stale comment about CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c. This patch doesn't introduce any visible behavior changes. v2: netprio was missing from conversion. Converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Aristeu Rozanski <arozansk@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: "David S. Miller" <davem@davemloft.net>	2014-05-13 12:16:21 -04:00
Tejun Heo	b41686401e	cgroup: implement cftype->write() During the recent conversion to kernfs, cftype's seq_file operations are updated so that they are directly mapped to kernfs operations and thus can fully access the associated kernfs and cgroup contexts; however, write path hasn't seen similar updates and none of the existing write operations has access to, for example, the associated kernfs_open_file. Let's introduce a new operation cftype->write() which maps directly to the kernfs write operation and has access to all the arguments and contexts. This will replace ->write_string() and ->trigger() and ease manipulation of kernfs active protection from cgroup file operations. Two accessors - of_cft() and of_css() - are introduced to enable accessing the associated cgroup context from cftype->write() which only takes kernfs_open_file for the context information. The accessors for seq_file operations - seq_cft() and seq_css() - are rewritten to wrap the of_ accessors. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:16:21 -04:00
Tejun Heo	ec903c0c85	cgroup: rename css_tryget() to css_tryget_online() Unlike the more usual refcnting, what css_tryget() provides is the distinction between online and offline csses instead of protection against upping a refcnt which already reached zero. cgroup is planning to provide actual tryget which fails if the refcnt already reached zero. Let's rename the existing trygets so that they clearly indicate that they're onliness. I thought about keeping the existing names as-are and introducing new names for the planned actual tryget; however, given that each controller participates in the synchronization of the online state, it seems worthwhile to make it explicit that these functions are about on/offline state. Rename css_tryget() to css_tryget_online() and css_tryget_from_dir() to css_tryget_online_from_dir(). This is pure rename. v2: cgroup_freezer grew new usages of css_tryget(). Update accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>	2014-05-13 12:11:01 -04:00
Tejun Heo	46cfeb043b	cgroup: use release_agent_path_lock in cgroup_release_agent_show() release_path is now protected by release_agent_path_lock to allow accessing it without grabbing cgroup_mutex; however, cgroup_release_agent_show() was still grabbing cgroup_mutex. Let's convert it to release_agent_path_lock so that we don't have to worry about this one for the planned locking updates. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:11:00 -04:00
Tejun Heo	7d331fa985	cgroup: use restart_syscall() for retries after offline waits in cgroup_subtree_control_write() After waiting for a child to finish offline, cgroup_subtree_control_write() jumps up to retry from after the input parsing and active protection breaking. This retry makes the scheduled locking update - removal of cgroup_tree_mutex - more difficult. Let's simplify it by returning with restart_syscall() for retries. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:11:00 -04:00
Tejun Heo	d37167ab7b	cgroup: update and fix parsing of "cgroup.subtree_control" I was confused that strsep() was equivalent to strtok_r() in skipping over consecutive delimiters. strsep() just splits at the first occurrence of one of the delimiters which makes the parsing very inflexible, which makes allowing multiple whitespace chars as delimters kinda moot. Let's just be consistently strict and require list of tokens separated by spaces. This is what Documentation/cgroups/unified-hierarchy.txt describes too. Also, parsing may access beyond the end of the string if the string ends with spaces or is zero-length. Make sure it skips zero-length tokens. Note that this also ensures that the parser doesn't puke on multiple consecutive spaces. v2: Add zero-length token skipping. v3: Added missing space after "==". Spotted by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:10:59 -04:00
Tejun Heo	0ab7a60dea	cgroup: css_release() shouldn't clear cgroup->subsys[] `c1a71504e9` ("cgroup: don't recycle cgroup id until all csses' have been destroyed") made cgroup ID persist until a cgroup is released and add cgroup->subsys[] clearing to css_release() so that css_from_id() doesn't return a css which has already been released which happens before cgroup release; however, the right change here was updating offline_css() to clear cgroup->subsys[] which was done by `e329780310` ("cgroup: cgroup->subsys[] should be cleared after the css is offlined") instead of clearing it from css_release(). We're now clearing cgroup->subsys[] twice. This is okay for traditional hierarchies as a css's lifetime is the same as its cgroup's; however, this confuses unified hierarchy and turning on and off a controller repeatedly using "cgroup.subtree_control" can lead to an oops like the following which happens because cgroup->subsys[] is incorrectly cleared asynchronously by css_release(). BUG: unable to handle kernel NULL pointer dereference at 00000000000000 08 IP: [<ffffffff81130c11>] kill_css+0x21/0x1c0 PGD 1170d067 PUD f0ab067 PMD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU: 2 PID: 459 Comm: bash Not tainted 3.15.0-rc2-work+ #5 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff880009296710 ti: ffff88000e198000 task.ti: ffff88000e198000 RIP: 0010:[<ffffffff81130c11>] [<ffffffff81130c11>] kill_css+0x21/0x1c0 RSP: 0018:ffff88000e199dc8 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001 RDX: 0000000000000001 RSI: ffffffff8238a968 RDI: ffff880009296f98 RBP: ffff88000e199de0 R08: 0000000000000001 R09: 02b0000000000000 R10: 0000000000000000 R11: ffff880009296fc0 R12: 0000000000000001 R13: ffff88000db6fc58 R14: 0000000000000001 R15: ffff8800139dcc00 FS: 00007ff9160c5740(0000) GS:ffff88001fb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 0000000013947000 CR4: 00000000000006e0 Stack: ffff88000e199de0 ffffffff82389160 0000000000000001 ffff88000e199e80 ffffffff8113537f 0000000000000007 ffff88000e74af00 ffff88000e199e48 ffff880009296710 ffff88000db6fc00 ffffffff8239c100 0000000000000002 Call Trace: [<ffffffff8113537f>] cgroup_subtree_control_write+0x85f/0xa00 [<ffffffff8112fd18>] cgroup_file_write+0x38/0x1d0 [<ffffffff8126fc97>] kernfs_fop_write+0xe7/0x170 [<ffffffff811f2ae6>] vfs_write+0xb6/0x1c0 [<ffffffff811f35ad>] SyS_write+0x4d/0xc0 [<ffffffff81d0acd2>] system_call_fastpath+0x16/0x1b Code: 5c 41 5d 41 5e 41 5f 5d c3 90 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb 48 83 ec 08 8b 05 37 ad 29 01 85 c0 0f 85 df 00 00 00 <48> 8b 43 08 48 8b 3b be 01 00 00 00 8b 48 5c d3 e6 e8 49 ff ff RIP [<ffffffff81130c11>] kill_css+0x21/0x1c0 RSP <ffff88000e199dc8> CR2: 0000000000000008 ---[ end trace e7aae1f877c4e1b4 ]--- Remove the unnecessary cgroup->subsys[] clearing from css_release(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:10:59 -04:00
Tejun Heo	54504e977c	cgroup: cgroup_idr_lock should be bh cgroup_idr_remove() can be invoked from bh leading to lockdep detecting possible AA deadlock (IN_BH/ON_BH). Make the lock bh-safe. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:10:59 -04:00
Tejun Heo	0cee8b7786	cgroup: fix offlining child waiting in cgroup_subtree_control_write() cgroup_subtree_control_write() waits for offline to complete child-by-child before enabling a controller; however, it has a couple bugs. * It doesn't initialize the wait_queue_t. This can lead to infinite hang on the following schedule() among other things. * It forgets to pin the child before releasing cgroup_tree_mutex and performing schedule(). The child may already be gone by the time it wakes up and invokes finish_wait(). Pin the child being waited on. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 12:10:59 -04:00
Tejun Heo	f21a4f7594	Merge branch 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-3.16 Pull to receive `e37a06f109` ("cgroup: fix the retry path of cgroup_mount()") to avoid unnecessary conflicts with planned cgroup_tree_mutex removal and also to be able to remove the temp fix added by `36c38fb714` ("blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats()") afterwards. Signed-off-by: Tejun Heo <tj@kernel.org>	2014-05-13 11:30:04 -04:00
Tejun Heo	5024ae29cd	cgroup: introduce task_css_is_root() Determining the css of a task usually requires RCU read lock as that's the only thing which keeps the returned css accessible till its reference is acquired; however, testing whether a task belongs to the root can be performed without dereferencing the returned css by comparing the returned pointer against the root one in init_css_set[] which never changes. Implement task_css_is_root() which can be invoked in any context. This will be used by the scheduled cgroup_freezer change. v2: cgroup no longer supports modular controllers. No need to export init_css_set. Pointed out by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-13 11:26:27 -04:00
Fabian Frederick	60106946ca	kernel/cgroup.c: fix 2 kernel-doc warnings Fix typo and variable name. tj: Updated @cgrp argument description in cgroup_destroy_css_killed() Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-05-05 14:33:04 -04:00
Tejun Heo	15a4c835e4	cgroup, memcg: implement css->id and convert css_from_id() to use it Until now, cgroup->id has been used to identify all the associated csses and css_from_id() takes cgroup ID and returns the matching css by looking up the cgroup and then dereferencing the css associated with it; however, now that the lifetimes of cgroup and css are separate, this is incorrect and breaks on the unified hierarchy when a controller is disabled and enabled back again before the previous instance is released. This patch adds css->id which is a subsystem-unique ID and converts css_from_id() to look up by the new css->id instead. memcg is the only user of css_from_id() and also converted to use css->id instead. For traditional hierarchies, this shouldn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jianyu Zhan <nasa4836@gmail.com> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-04 15:09:14 -04:00
Tejun Heo	ddfcadab35	cgroup: update init_css() into init_and_link_css() init_css() takes the cgroup the new css belongs to as an argument and initializes the new css's ->cgroup and ->parent pointers but doesn't acquire the matching reference counts. After the previous patch, create_css() puts init_css() and reference acquisition right next to each other. Let's move reference acquistion into init_css() and rename the function to init_and_link_css(). This makes sense and is easier to follow. This makes the root csses to hold a reference on cgrp_dfl_root.cgrp, which is harmless. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-04 15:09:14 -04:00
Tejun Heo	a2bed8209a	cgroup: use RCU free in create_css() failure path Currently, when create_css() fails in the middle, the half-initialized css is freed by invoking cgroup_subsys->css_free() directly. This patch updates the function so that it invokes RCU free path instead. As the RCU free path puts the parent css and owning cgroup, their references are now acquired right after a new css is successfully allocated. This doesn't make any visible difference now but is to enable implementing css->id and RCU protected lookup by such IDs. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-04 15:09:14 -04:00
Tejun Heo	6fa4918d03	cgroup: protect cgroup_root->cgroup_idr with a spinlock Currently, cgroup_root->cgroup_idr is protected by cgroup_mutex, which ends up requiring cgroup_put() to be invoked under sleepable context. This is okay for now but is an unusual requirement and we'll soon add css->id which will have the same problem but won't be able to simply grab cgroup_mutex as removal will have to happen from css_release() which can't sleep. Introduce cgroup_idr_lock and idr_alloc/replace/remove() wrappers which protects the idr operations with the lock and use them for cgroup_root->cgroup_idr. cgroup_put() no longer needs to grab cgroup_mutex and css_from_id() is updated to always require RCU read lock instead of either RCU read lock or cgroup_mutex, which doesn't affect the existing users. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-04 15:09:13 -04:00
Tejun Heo	7d699ddb2b	cgroup, memcg: allocate cgroup ID from 1 Currently, cgroup->id is allocated from 0, which is always assigned to the root cgroup; unfortunately, memcg wants to use ID 0 to indicate invalid IDs and ends up incrementing all IDs by one. It's reasonable to reserve 0 for special purposes. This patch updates cgroup core so that ID 0 is not used and the root cgroups get ID 1. The ID incrementing is removed form memcg. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-04 15:09:13 -04:00
Tejun Heo	69dfa00ccb	cgroup: make flags and subsys_masks unsigned int There's no reason to use atomic bitops for cgroup_subsys_state->flags, cgroup_root->flags and various subsys_masks. This patch updates those to use bitwise and/or operations instead and converts them form unsigned long to unsigned int. This makes the fields occupy (marginally) smaller space and makes it clear that they don't require atomicity. This patch doesn't cause any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-05-04 15:09:13 -04:00
Joe Perches	ed3d261b53	cgroup: Use more current logging style Use pr_fmt and remove embedded prefixes. Realign modified multi-line statements to open parenthesis. Convert embedded function name to "%s: ", __func__ Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-04-25 18:28:03 -04:00
Jianyu Zhan	a2a1f9eaf9	cgroup: replace pr_warning with preferred pr_warn As suggested by scripts/checkpatch.pl, substitude all pr_warning() with pr_warn(). No functional change. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-04-25 18:28:03 -04:00
Jianyu Zhan	f8719ccf7b	cgroup: remove orphaned cgroup_pidlist_seq_operations `6612f05b88` ("cgroup: unify pidlist and other file handling") has removed the only user of cgroup_pidlist_seq_operations : cgroup_pidlist_open(). This patch removes it. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-04-25 18:28:03 -04:00
Jianyu Zhan	2f0edc04e7	cgroup: clean up obsolete comment for parse_cgroupfs_options() `1d5be6b287` ("cgroup: move module ref handling into rebind_subsystems()") makes parse_cgroupfs_options() no longer takes refcounts on subsystems. And unified hierachy makes parse_cgroupfs_options not need to call with cgroup_mutex held to protect the cgroup_subsys[]. So this patch removes BUG_ON() and the comment. As the comment doesn't contain useful information afterwards, the whole comment is removed. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-04-25 18:28:03 -04:00
Tejun Heo	842b597ee0	cgroup: implement cgroup.populated for the default hierarchy cgroup users often need a way to determine when a cgroup's subhierarchy becomes empty so that it can be cleaned up. cgroup currently provides release_agent for it; unfortunately, this mechanism is riddled with issues. * It delivers events by forking and execing a userland binary specified as the release_agent. This is a long deprecated method of notification delivery. It's extremely heavy, slow and cumbersome to integrate with larger infrastructure. * There is single monitoring point at the root. There's no way to delegate management of a subtree. * The event isn't recursive. It triggers when a cgroup doesn't have any tasks or child cgroups. Events for internal nodes trigger only after all children are removed. This again makes it impossible to delegate management of a subtree. * Events are filtered from the kernel side. "notify_on_release" file is used to subscribe to or suppress release event. This is unnecessarily complicated and probably done this way because event delivery itself was expensive. This patch implements interface file "cgroup.populated" which can be used to monitor whether the cgroup's subhierarchy has tasks in it or not. Its value is 0 if there is no task in the cgroup and its descendants; otherwise, 1, and kernfs_notify() notificaiton is triggers when the value changes, which can be monitored through poll and [di]notify. This is a lot ligther and simpler and trivially allows delegating management of subhierarchy - subhierarchy monitoring can block further propgation simply by putting itself or another process in the root of the subhierarchy and monitor events that it's interested in from there without interfering with monitoring higher in the tree. v2: Patch description updated as per Serge. v3: "cgroup.subtree_populated" renamed to "cgroup.populated". The subtree_ prefix was a bit confusing because "cgroup.subtree_control" uses it to denote the tree rooted at the cgroup sans the cgroup itself while the populated state includes the cgroup itself. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Lennart Poettering <lennart@poettering.net>	2014-04-25 18:28:02 -04:00
Tejun Heo	f8f22e53a2	cgroup: implement dynamic subtree controller enable/disable on the default hierarchy cgroup is switching away from multiple hierarchies and will use one unified default hierarchy where controllers can be dynamically enabled and disabled per subtree. The default hierarchy will serve as the unified hierarchy to which all controllers are attached and a css on the default hierarchy would need to also serve the tasks of descendant cgroups which don't have the controller enabled - ie. the tree may be collapsed from leaf towards root when viewed from specific controllers. This has been implemented through effective css in the previous patches. This patch finally implements dynamic subtree controller enable/disable on the default hierarchy via a new knob - "cgroup.subtree_control" which controls which controllers are enabled on the child cgroups. Let's assume a hierarchy like the following. root - A - B - C \ D root's "cgroup.subtree_control" determines which controllers are enabled on A. A's on B. B's on C and D. This coincides with the fact that controllers on the immediate sub-level are used to distribute the resources of the parent. In fact, it's natural to assume that resource control knobs of a child belong to its parent. Enabling a controller in "cgroup.subtree_control" declares that distribution of the respective resources of the cgroup will be controlled. Note that this means that controller enable states are shared among siblings. The default hierarchy has an extra restriction - only cgroups which don't contain any task may have controllers enabled in "cgroup.subtree_control". Combined with the other properties of the default hierarchy, this guarantees that, from the view point of controllers, tasks are only on the leaf cgroups. In other words, only leaf csses may contain tasks. This rules out situations where child cgroups compete against internal tasks of the parent, which is a competition between two different types of entities without any clear way to determine resource distribution between the two. Different controllers handle it differently and all the implemented behaviors are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this structural constraints imposed from cgroup core removes the burden from controller implementations and enables showing one consistent behavior across all controllers. When a controller is enabled or disabled, css associations for the controller in the subtrees of each child should be updated. After enabling, the whole subtree of a child should point to the new css of the child. After disabling, the whole subtree of a child should point to the cgroup's css. This is implemented by first updating cgroup states such that cgroup_e_css() result points to the appropriate css and then invoking cgroup_update_dfl_csses() which migrates all tasks in the affected subtrees to the self cgroup on the default hierarchy. * When read, "cgroup.subtree_control" lists all the currently enabled controllers on the children of the cgroup. * White-space separated list of controller names prefixed with either '+' or '-' can be written to "cgroup.subtree_control". The ones prefixed with '+' are enabled on the controller and '-' disabled. * A controller can be enabled iff the parent's "cgroup.subtree_control" enables it and disabled iff no child's "cgroup.subtree_control" has it enabled. * If a cgroup has tasks, no controller can be enabled via "cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has some controllers enabled, tasks can't be migrated into the cgroup. * All controllers which aren't bound on other hierarchies are automatically associated with the root cgroup of the default hierarchy. All the controllers which are bound to the default hierarchy are listed in the read-only file "cgroup.controllers" in the root directory. * "cgroup.controllers" in all non-root cgroups is read-only file whose content is equal to that of "cgroup.subtree_control" of the parent. This indicates which controllers can be used in the cgroup's "cgroup.subtree_control". This is still experimental and there are some holes, one of which is that ->can_attach() failure during cgroup_update_dfl_csses() may leave the cgroups in an undefined state. The issues will be addressed by future patches. v2: Non-root cgroups now also have "cgroup.controllers". Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:16 -04:00
Tejun Heo	f817de9851	cgroup: prepare migration path for unified hierarchy Unified hierarchy implementation would require re-migrating tasks onto the same cgroup on the default hierarchy to reflect updated effective csses. Update cgroup_migrate_prepare_dst() so that it accepts NULL as the destination cgrp. When NULL is specified, the destination is considered to be the cgroup on the default hierarchy associated with each css_set. After this change, the identity check in cgroup_migrate_add_src() isn't sufficient for noop detection as the associated csses may change without any cgroup association changing. The only way to tell whether a migration is noop or not is testing whether the source and destination csets are identical. The noop check in cgroup_migrate_add_src() is removed and cset identity test is added to cgroup_migreate_prepare_dst(). If it's detected that source and destination csets are identical, the cset is removed removed from @preloaded_csets and all the migration nodes are cleared which makes cgroup_migrate() ignore the cset. Also, make the function append the destination css_sets to @preloaded_list so that destination css_sets always come after source css_sets. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:16 -04:00
Tejun Heo	7fd8c565d8	cgroup: update subsystem rebind restrictions Because the default root couldn't have any non-root csses attached to it, rebinding away from it was always allowed; however, the default hierarchy will soon host the unified hierarchy and have non-root csses so the rebind restrictions need to be updated accordingly. Instead of special casing rebinding from the default hierarchy and then checking whether the source hierarchy has children cgroups, which implies non-root csses for !dfl hierarchies, simply check whether the source hierarchy has non-root csses for the subsystem using css_next_child(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:16 -04:00
Tejun Heo	6803c00628	cgroup: add css_set->dfl_cgrp To implement the unified hierarchy behavior, we'll need to be able to determine the associated cgroup on the default hierarchy from css_set. Let's add css_set->dfl_cgrp so that it can be accessed conveniently and efficiently. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:16 -04:00
Tejun Heo	bd53d617b3	cgroup: allow cgroup creation and suppress automatic css creation in the unified hierarchy Now that effective css handling has been added and iterators updated accordingly, it's safe to allow cgroup creation in the default hierarchy. Unblock cgroup creation in the default hierarchy. As the default hierarchy will implement explicit enabling and disabling of controllers on each cgroup, suppress automatic css enabling on cgroup creation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:16 -04:00
Tejun Heo	e329780310	cgroup: cgroup->subsys[] should be cleared after the css is offlined After a css finishes offlining, offline_css() mistakenly performs RCU_INIT_POINTER(css->cgroup->subsys[ss->id], css) which just sets the cgroup->subsys[] pointer to the current value. The intention was to clear it after offline is complete, not reassign the same value. Update it to assign NULL instead of the current value. This makes cgroup_css() to return NULL once offline is complete. All the existing users of the function either can handle NULL return already or guarantee that the css doesn't get offlined. While this is a bugfix, as css lifetime is currently tied to the cgroup it belongs to, this bug doesn't cause any actual problems. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:15 -04:00
Tejun Heo	3ebb2b6ef3	cgroup: teach css_task_iter about effective csses Currently, css_task_iter iterates tasks associated with a css by visiting each css_set associated with the owning cgroup and walking tasks of each of them. This works fine for !unified hierarchies as each cgroup has its own css for each associated subsystem on the hierarchy; however, on the planned unified hierarchy, a cgroup may not have csses associated and its tasks would be considered associated with the matching css of the nearest ancestor which has the subsystem enabled. This means that on the default unified hierarchy, just walking all tasks associated with a cgroup isn't enough to walk all tasks which are associated with the specified css. If any of its children doesn't have the matching css enabled, task iteration should also include all tasks from the subtree. We already added cgroup->e_csets[] to list all css_sets effectively associated with a given css and walk css_sets on that list instead to achieve such iteration. This patch updates css_task_iter iteration such that it walks css_sets on cgroup->e_csets[] instead of cgroup->cset_links if iteration is requested on an non-dummy css. Thanks to the previous iteration update, this change can be achieved with the addition of css_task_iter->ss and minimal updates to css_advance_task_iter() and css_task_iter_start(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:15 -04:00
Tejun Heo	0f0a2b4fa6	cgroup: reorganize css_task_iter This patch reorganizes css_task_iter so that adding effective css support is easier. * s/->cset_link/->cset_pos/ and s/->task/->task_pos/ for consistency * ->origin_css is used to determine whether the iteration reached the last css_set. Replace it with explicit ->cset_head so that css_advance_task_iter() doesn't have to know the termination condition directly. * css_task_iter_next() currently assumes that it's walking list of cgrp_cset_link and reaches into the current cset through the current link to determine the termination conditions for task walking. As this won't always be true for effective css walking, add ->tasks_head and ->mg_tasks_head and use them to control task walking so that css_task_iter_next() doesn't have to know how css_sets are being walked. This patch doesn't make any behavior changes. The iteration logic stays unchanged after the patch. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:15 -04:00
Tejun Heo	3b281afbc3	cgroup: make css_next_child() skip missing csses css_next_child() walks the children of the specified css. It does this by finding the next cgroup and then returning the requested css. On the default unified hierarchy, a cgroup may not have a css associated with it even if the hierarchy has the subsystem enabled. This patch updates css_next_child() so that it skips children without the requested css associated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:15 -04:00
Tejun Heo	2d8f243a5e	cgroup: implement cgroup->e_csets[] On the default unified hierarchy, a cgroup may be associated with csses of its ancestors, which means that a css of a given cgroup may be associated with css_sets of descendant cgroups. This means that we can't walk all tasks associated with a css by iterating the css_sets associated with the cgroup as there are css_sets which are pointing to the css but linked on the descendants. This patch adds per-subsystem list heads cgroup->e_csets[]. Any css_set which is pointing to a css is linked to css->cgroup->e_csets[$SUBSYS_ID] through css_set->e_cset_node[$SUBSYS_ID]. The lists are protected by css_set_rwsem and will allow us to walk all css_sets associated with a given css so that we can find out all associated tasks. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:15 -04:00
Tejun Heo	aec3dfcb2e	cgroup: introduce effective cgroup_subsys_state In the planned default unified hierarchy, controllers may get dynamically attached to and detached from a cgroup and a cgroup may not have csses for all the controllers associated with the hierarchy. When a cgroup doesn't have its own css for a given controller, the css of the nearest ancestor with the controller enabled will be used, which is called the effective css. This patch introduces cgroup_e_css() and for_each_e_css() to access the effective csses and convert compare_css_sets(), find_existing_css_set() and cgroup_migrate() to use the effective csses so that they can handle cgroups with partial csses correctly. This means that for two css_sets to be considered identical, they should have both matching csses and cgroups. compare_css_sets() already compares both, not for correctness but for optimization. As this now becomes a matter of correctness, update the comments accordingly. For all !default hierarchies, cgroup_e_css() always equals cgroup_css(), so this patch doesn't change behavior. While at it, fix incorrect locking comment for for_each_css(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:14 -04:00
Tejun Heo	f392e51cd6	cgroup: update cgroup->subsys_mask to ->child_subsys_mask and restore cgroup_root->subsys_mask `944196278d` ("cgroup: move ->subsys_mask from cgroupfs_root to cgroup") moved ->subsys_mask from cgroup_root to cgroup to prepare for the unified hierarhcy; however, it turns out that carrying the subsys_mask of the children in the parent, instead of itself, is a lot more natural. This patch restores cgroup_root->subsys_mask and morphs cgroup->subsys_mask into cgroup->child_subsys_mask. * Uses of root->cgrp.subsys_mask are restored to root->subsys_mask. * Remove automatic setting and clearing of cgrp->subsys_mask and instead just inherit ->child_subsys_mask from the parent during cgroup creation. Note that this doesn't affect any current behaviors. * Undo __kill_css() separation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:14 -04:00
Tejun Heo	ea8fd3b47f	cgroup: cgroup_apply_cftypes() shouldn't skip the default hierarhcy cgroup_apply_cftypes() skip creating or removing files if the subsystem is attached to the default hierarchy, which led to missing files in the root of the default hierarchy. Skipping made sense when the default hierarchy was dummy; however, now that the default hierarchy is full functional and planned to be used as the unified hierarchy, it shouldn't be skipped over. Reported-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-04-23 11:13:14 -04:00
Li Zefan	e37a06f109	cgroup: fix the retry path of cgroup_mount() If we hit the retry path, we'll call parse_cgroupfs_options() again, but the string we pass to it has been modified by the previous call to this function. This bug can be observed by: # mount -t cgroup -o name=foo,cpuset xxx /mnt && umount /mnt && \ mount -t cgroup -o name=foo,cpuset xxx /mnt mount: wrong fs type, bad option, bad superblock on xxx, missing codepage or helper program, or other error ... The second mount passed "name=foo,cpuset" to the parser, and then it hit the retry path and call the parser again, but this time the string passed to the parser is "name=foo". To fix this, we avoid calling parse_cgroupfs_options() again in this case. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-04-17 11:18:06 -04:00
Tejun Heo	49957f8e2a	cgroup: newly created dirs and files should be owned by the creator While converting cgroup to kernfs, `2bd59d48eb` ("cgroup: convert to kernfs") accidentally dropped the logic which makes newly created cgroup dirs and files owned by the current uid / gid. This broke cases where cgroup subtree management is delegated to !root as the sub manager wouldn't be able to create more than single level of hierarchy or put tasks into child cgroups it created. Among other things, this breaks user session management in systemd and one of the symptoms was 90s hang during shutdown. User session systemd running as the user creates a sub-service to initiate shutdown and tries to put kill(1) into it but fails because cgroup.procs is owned by root. This leads to 90s hang during shutdown. Implement cgroup_kn_set_ugid() which sets a kn's uid and gid to those of the caller and use it from file and dir creation paths. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-04-07 16:44:47 -04:00
Li Zefan	c6b3d5bcd6	cgroup: fix top cgroup refcnt leak As mount() and kill_sb() is not a one-to-one match, If we mount the same cgroupfs in serveral mount points, and then umount all of them, kill_sb() will be called only once. Try: # mount -t cgroup -o cpuacct xxx /cgroup # mount -t cgroup -o cpuacct xxx /cgroup2 # cat /proc/cgroups \| grep cpuacct cpuacct 2 1 1 # umount /cgroup # umount /cgroup2 # cat /proc/cgroups \| grep cpuacct cpuacct 2 1 1 You'll see cgroupfs will never be freed. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-04-04 08:22:27 -04:00
Linus Torvalds	32d01dc7be	Merge branch 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot updates for cgroup: - The biggest one is cgroup's conversion to kernfs. cgroup took after the long abandoned vfs-entangled sysfs implementation and made it even more convoluted over time. cgroup's internal objects were fused with vfs objects which also brought in vfs locking and object lifetime rules. Naturally, there are places where vfs rules don't fit and nasty hacks, such as credential switching or lock dance interleaving inode mutex and cgroup_mutex with object serial number comparison thrown in to decide whether the operation is actually necessary, needed to be employed. After conversion to kernfs, internal object lifetime and locking rules are mostly isolated from vfs interactions allowing shedding of several nasty hacks and overall simplification. This will also allow implmentation of operations which may affect multiple cgroups which weren't possible before as it would have required nesting i_mutexes. - Various simplifications including dropping of module support, easier cgroup name/path handling, simplified cgroup file type handling and task_cg_lists optimization. - Prepatory changes for the planned unified hierarchy, which is still a patchset away from being actually operational. The dummy hierarchy is updated to serve as the default unified hierarchy. Controllers which aren't claimed by other hierarchies are associated with it, which BTW was what the dummy hierarchy was for anyway. - Various fixes from Li and others. This pull request includes some patches to add missing slab.h to various subsystems. This was triggered xattr.h include removal from cgroup.h. cgroup.h indirectly got included a lot of files which brought in xattr.h which brought in slab.h. There are several merge commits - one to pull in kernfs updates necessary for converting cgroup (already in upstream through driver-core), others for interfering changes in the fixes branch" * 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits) cgroup: remove useless argument from cgroup_exit() cgroup: fix spurious lockdep warning in cgroup_exit() cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c cgroup: break kernfs active_ref protection in cgroup directory operations cgroup: fix cgroup_taskset walking order cgroup: implement CFTYPE_ONLY_ON_DFL cgroup: make cgrp_dfl_root mountable cgroup: drop const from @buffer of cftype->write_string() cgroup: rename cgroup_dummy_root and related names cgroup: move ->subsys_mask from cgroupfs_root to cgroup cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding cgroup: remove NULL checks from [pr_cont_]cgroup_{name\|path}() cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root cgroup: reorganize cgroup bootstrapping cgroup: relocate setting of CGRP_DEAD cpuset: use rcu_read_lock() to protect task_cs() cgroup_freezer: document freezer_fork() subtleties cgroup: update cgroup_transfer_tasks() to either succeed or fail cgroup: drop task_lock() protection around task->cgroups cgroup: update how a newly forked task gets associated with css_set ...	2014-04-03 13:05:42 -07:00
Li Zefan	1ec41830e0	cgroup: remove useless argument from cgroup_exit() Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-03-29 09:15:54 -04:00
Li Zefan	e8604cb436	cgroup: fix spurious lockdep warning in cgroup_exit() cgroup_exit() is called in fork and exit path. If it's called in the failure path during fork, PF_EXITING isn't set, and then lockdep will complain. Fix this by removing cgroup_exit() in that failure path. cgroup_fork() does nothing that needs cleanup. Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-03-29 09:15:53 -04:00
Monam Agarwal	01a9714061	cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c This patch replaces rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL) The rcu_assign_pointer() ensures that the initialization of a structure is carried out before storing a pointer to that structure. And in the case of the NULL pointer, there is no structure to initialize. So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL) Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-03-24 08:48:02 -04:00
Tejun Heo	e1b2dc176f	cgroup: break kernfs active_ref protection in cgroup directory operations cgroup_tree_mutex should nest above the kernfs active_ref protection; however, cgroup_create() and cgroup_rename() were grabbing cgroup_tree_mutex while under kernfs active_ref protection. This has actualy possibility to lead to deadlocks in case these operations race against cgroup_rmdir() which invokes kernfs_remove() on directory kernfs_node while holding cgroup_tree_mutex. Neither cgroup_create() or cgroup_rename() requires active_ref protection. The former already has enough synchronization through cgroup_lock_live_group() and the latter doesn't care, so this can be fixed by updating both functions to break all active_ref protections before grabbing cgroup_tree_mutex. While this patch fixes the immediate issue, it probably needs further work in the long term - kernfs directories should enable lockdep annotations and maybe the better way to handle this is marking directory nodes as not needing active_ref protection rather than breaking it in each operation. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-03-20 11:10:15 -04:00
Tejun Heo	1b9aba49ea	cgroup: fix cgroup_taskset walking order cgroup_taskset is used to track and iterate target tasks while migrating a task or process and should guarantee that the first task iterated is the task group leader if a process is being migrated. `b3dc094e93` ("cgroup: use css_set->mg_tasks to track target tasks during migration") replaced flex array cgroup_taskset->tc_array with css_set->mg_tasks list to remove process size limit and dynamic allocation during migration; unfortunately, it incorrectly used list operations which don't preserve order breaking the guarantee that cgroup_taskset_first() returns the leader for a process target. Fix it by using order preserving list operations. Note that as multiple src_csets may map to a single dst_cset, the iteration order may change across cgroup_task_migrate(); however, the leader is still guaranteed to be the first entry. The switch to list_splice_tail_init() at the end of cgroup_migrate() isn't strictly necessary. Let's still do it for consistency. Signed-off-by: Tejun Heo <tj@kernel.org>	2014-03-19 17:43:21 -04:00
Tejun Heo	8cbbf2c972	cgroup: implement CFTYPE_ONLY_ON_DFL This cftype flag makes the file only appear on the default hierarchy. This will later be used for cgroup.controllers file. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-03-19 10:23:55 -04:00
Tejun Heo	a2dd424750	cgroup: make cgrp_dfl_root mountable cgrp_dfl_root will be used as the default unified hierarchy. This patch makes cgrp_dfl_root mountable by making the following changes. * cgroup_init_early() now initializes cgrp_dfl_root w/ CGRP_ROOT_SANE_BEHAVIOR. The default hierarchy is always sane. * parse_cgroupfs_options() and cgroup_mount() are updated such that cgrp_dfl_root is mounted if sane_behavior is specified w/o any subsystems. * rebind_subsystems() now populates the root directory of cgrp_dfl_root. Note that the function still guarantees success of rebinding subsystems to cgrp_dfl_root. If populating fails while rebinding to cgrp_dfl_root, it whines but ignores the error. * For backward compatibility, the default hierarchy shows up in /proc/$PID/cgroup only after it's explicitly mounted so that userland which doesn't make use of it doesn't see any change. * "current_css_set_cg_links" file of debug cgroup now treats the default hierarchy the same as other hierarchies. This is visible to userland. Given that it's for debug controller, this should be fine. * While at it, implement cgroup_on_dfl() which tests whether a give cgroup is on the default hierarchy or not. The above changes make cgrp_dfl_root mostly equivalent to other controllers but the actual unified hierarchy behaviors are not implemented yet. Let's plug child cgroup creation in cgrp_dfl_root from create_cgroup() for now. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-03-19 10:23:55 -04:00
Tejun Heo	4d3bb511b5	cgroup: drop const from @buffer of cftype->write_string() cftype->write_string() just passes on the writeable buffer from kernfs and there's no reason to add const restriction on the buffer. The only thing const achieves is unnecessarily complicating parsing of the buffer. Drop const from @buffer. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Daniel Borkmann <dborkman@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>	2014-03-19 10:23:54 -04:00
Tejun Heo	3dd06ffa9d	cgroup: rename cgroup_dummy_root and related names The dummy root will be repurposed to serve as the default unified hierarchy. Let's rename things in preparation. * s/cgroup_dummy_root/cgrp_dfl_root/ * s/cgroupfs_root/cgroup_root/ as we don't do fs part directly anymore * s/cgroup_root->top_cgroup/cgroup_root->cgrp/ for brevity This is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-03-19 10:23:54 -04:00
Tejun Heo	944196278d	cgroup: move ->subsys_mask from cgroupfs_root to cgroup cgroupfs_root->subsys_mask represents the controllers attached to the hierarchy. This patch moves the field to cgroup. Subsystem initialization and rebinding updates the top cgroup's subsys_mask. For !root cgroups, the subsys_mask bits are set from create_css() and cleared from kill_css(), which effectively means that all cgroups will have the same subsys_mask as the top cgroup. While this doesn't make any difference now, this will help implementation of the default unified hierarchy where !root cgroups may have subsets of the top_cgroup's subsys_mask. While at it, __kill_css() is split out of kill_css(). The former doesn't care about the subsys_mask while the latter becomes noop if the controller is already killed and clears the matching bit if not before proceeding to killing the css. This will be used later by the default unified hierarchy implementation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-03-19 10:23:54 -04:00
Tejun Heo	5df3603229	cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding Currently, while rebinding, cgroup_dummy_root serves as the anchor point. In addition to the target root, rebind_subsystems() takes @added_mask and @removed_mask. The subsystems specified in the former are expected to be on the dummy root and then moved to the target root. The ones in the latter are moved from non-dummy root to dummy. Now that the dummy root is a fully functional one and we're planning to use it for the default unified hierarchy, this level of distinction between dummy and non-dummy roots is quite awkward. This patch updates rebind_subsystems() to take the target root and one subsystem mask and move the specified subsystmes to the target root which may or may not be the dummy root. IOW, unbinding now becomes moving the subsystems to the dummy root and binding to non-dummy root. This makes the dummy root mostly equivalent to other hierarchies in terms of the mechanism of moving subsystems around; however, we still retain all the semantical restrictions so that this patch doesn't introduce any visible behavior differences. Another noteworthy detail is that rebind_subsystems() guarantees that moving a subsystem to the dummy root never fails so that valid unmounting attempts always succeed. This unifies binding and unbinding of subsystems. The invocation points of ->bind() were inconsistent between the two and now moved after whole rebinding is complete. This doesn't break the current users and generally makes more sense. All rebind_subsystems() users are converted accordingly. Note that cgroup_remount() now makes two calls to rebind_subsystems() to bind and then unbind the requested subsystems. This will allow repurposing of the dummy hierarchy as the default unified hierarchy and shouldn't make any userland visible behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-03-19 10:23:54 -04:00
Tejun Heo	985ed67014	cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root cgroup_dummy_root is used to host controllers which aren't attached to any other hierarchy. The root is minimally set up during kernfs bootstrap and didn't go through full hierarchy initialization. We're planning to use cgroup_dummy_root for the default unified hierarchy and thus want it to be fully functional. Replace the special initialization, which was collected into cgroup_init() by the previous patch, with an invocation of cgroup_setup_root(). This simplifies the init path and makes cgroup_dummy_root a full hierarchy with its own kernfs_root and all. As this puts the dummy hierarchy on the cgroup_roots list, rename for_each_active_root() to for_each_root() and update its users to skip the dummy root for now. This patch doesn't cause any userland visible behavior changes at this point. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-03-19 10:23:53 -04:00
Tejun Heo	172a2c0685	cgroup: reorganize cgroup bootstrapping * Fields of init_css_set and css_set_count are now set using initializer instead of programmatically from cgroup_init_early(). * init_cgroup_root() now also takes @opts and performs the optional part of initialization too. The leftover part of cgroup_root_from_opts() is collapsed into its only caller - cgroup_mount(). * Initialization of cgroup_root_count and linking of init_css_set are moved from cgroup_init_early() to to cgroup_init(). None of the early_init users depends on init_css_set being linked. * Subsystem initializations are moved after dummy hierarchy init and init_css_set linking. These changes reorganize the bootstrap logic so that the dummy hierarchy can share the usual hierarchy init path and be made more normal. These changes don't make noticeable behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-03-19 10:23:53 -04:00
Tejun Heo	5d77381fd8	cgroup: relocate setting of CGRP_DEAD In cgroup_destroy_locked(), move setting of CGRP_DEAD above invocations of kill_css(). This doesn't make any visible behavior difference now but will be used to inhibit manipulating controller enable states of a dying cgroup on the unified hierarchy. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-03-19 10:23:53 -04:00
Li Zefan	3eb59ec64f	cgroup: fix a failure path in create_css() If online_css() fails, we should remove cgroup files belonging to css->ss. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-03-18 17:15:36 -04:00
Tejun Heo	952aaa1254	cgroup: update cgroup_transfer_tasks() to either succeed or fail cgroup_transfer_tasks() can currently fail in the middle due to memory allocation failure. When that happens, the function just aborts and returns error code and there's no way to tell how many actually got migrated at the point of failure and or to revert the partial migration. Update it to use cgroup_migrate{_add_src\|prepare_dst\|migrate\|finish}() so that the function either succeeds or fails as a whole as long as ->can_attach() doesn't fail. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-25 10:04:03 -05:00
Tejun Heo	0e1d768f1b	cgroup: drop task_lock() protection around task->cgroups For optimization, task_lock() is additionally used to protect task->cgroups. The optimization is pretty dubious as either css_set_rwsem is grabbed anyway or PF_EXITING already protects task->cgroups. It adds only overhead and confusion at this point. Let's drop task_[un]lock() and update comments accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-25 10:04:03 -05:00
Tejun Heo	eaf797abc5	cgroup: update how a newly forked task gets associated with css_set When a new process is forked, cgroup_fork() associates it with the css_set of its parent but doesn't link it into it. After the new process is linked to tasklist, cgroup_post_fork() does the linking. This is problematic for cgroup_transfer_tasks() as there's no way to tell whether there are tasks which are pointing to a css_set but not linked yet. It is impossible to implement an operation which transfer all tasks of a cgroup to another and the current cgroup_transfer_tasks() can easily be tricked into leaving a newly forked process behind if it gets called between cgroup_fork() and cgroup_post_fork(). Let's make association with a css_set and linking atomic by moving it to cgroup_post_fork(). cgroup_fork() sets child->cgroups to init_css_set as a placeholder and cgroup_post_fork() is updated to perform both the association with the parent's cgroup and linking there. This means that a newly created task will point to init_css_set without holding a ref to it much like what it does on the exit path. Empty cg_list is used to indicate that the task isn't holding a ref to the associated css_set. This fixes an actual bug with cgroup_transfer_tasks(); however, I'm not marking it for -stable. The whole thing is broken in multiple other ways which require invasive updates to fix and I don't think it's worthwhile to bother with backporting this particular one. Fortunately, the only user is cpuset and these bugs don't crash the machine. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-25 10:04:03 -05:00
Tejun Heo	1958d2d53d	cgroup: split process / task migration into four steps Currently, process / task migration is a single operation which may fail depending on memory pressure or the involved controllers' ->can_attach() callbacks. One problem with this approach is migration of multiple targets. It's impossible to tell whether a given target will be successfully migrated beforehand and cgroup core can't keep track of enough states to roll back after intermediate failure. This is already an issue with cgroup_transfer_tasks(). Also, we're gonna need multiple target migration for unified hierarchy. This patch splits migration into four stages - cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(), cgroup_migrate() and cgroup_migrate_finish(), where cgroup_migrate_prepare_dst() performs all the operations which may fail due to allocation failure without actually migrating the target. The four separate stages mean that, disregarding ->can_attach() failures, the success or failure of multi target migration can be determined before performing any actual migration. If preparations of all targets succeed, the whole thing will succeed. If not, the whole operation can fail without any side-effect. Since the previous patch to use css_set->mg_tasks to keep track of migration targets, the only thing which may need memory allocation during migration is the target css_sets. cgroup_migrate_prepare() pins all source and target css_sets and link them up. Note that this can be performed without holding threadgroup_lock even if the target is a process. As long as cgroup_mutex is held, no new css_set can be put into play. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-25 10:04:03 -05:00
Tejun Heo	ceb6a081f6	cgroup: separate out cset_group_from_root() from task_cgroup_from_root() This will be used by the planned migration path update. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-25 10:04:02 -05:00
Tejun Heo	b3dc094e93	cgroup: use css_set->mg_tasks to track target tasks during migration Currently, while migrating tasks from one cgroup to another, cgroup_attach_task() builds a flex array of all target tasks; unfortunately, this has a couple issues. * Flex array has size limit. On 64bit, struct task_and_cgroup is 24bytes making the flex element limit around 87k. It is a high number but not impossible to hit. This means that the current cgroup implementation can't migrate a process with more than 87k threads. * Process migration involves memory allocation whose size is dependent on the number of threads the process has. This means that cgroup core can't guarantee success or failure of multi-process migrations as memory allocation failure can happen in the middle. This is in part because cgroup can't grab threadgroup locks of multiple processes at the same time, so when there are multiple processes to migrate, it is imposible to tell how many tasks are to be migrated beforehand. Note that this already affects cgroup_transfer_tasks(). cgroup currently cannot guarantee atomic success or failure of the operation. It may fail in the middle and after such failure cgroup doesn't have enough information to roll back properly. It just aborts with some tasks migrated and others not. To resolve the situation, this patch updates the migration path to use task->cg_list to track target tasks. The previous patch already added css_set->mg_tasks and updated iterations in non-migration paths to include them during task migration. This patch updates migration path to actually make use of it. Instead of putting onto a flex_array, each target task is moved from its css_set->tasks list to css_set->mg_tasks and the migration path keeps trace of all the source css_sets and the associated cgroups. Once all source css_sets are determined, the destination css_set for each is determined, linked to the matching source css_set and put on a separate list. To iterate the target tasks, migration path just needs to iterat through either the source or target css_sets, depending on whether migration has been committed or not, and the tasks on their ->mg_tasks lists. cgroup_taskset is updated to contain the list_heads for source and target css_sets and the iteration cursor. cgroup_taskset_*() are accordingly updated to walk through css_sets and their ->mg_tasks. This resolves the above listed issues with moderate additional complexity. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-25 10:04:01 -05:00
Tejun Heo	c75611282c	cgroup: add css_set->mg_tasks Currently, while migrating tasks from one cgroup to another, cgroup_attach_task() builds a flex array of all target tasks; unfortunately, this has a couple issues. * Flex array has size limit. On 64bit, struct task_and_cgroup is 24bytes making the flex element limit around 87k. It is a high number but not impossible to hit. This means that the current cgroup implementation can't migrate a process with more than 87k threads. * Process migration involves memory allocation whose size is dependent on the number of threads the process has. This means that cgroup core can't guarantee success or failure of multi-process migrations as memory allocation failure can happen in the middle. This is in part because cgroup can't grab threadgroup locks of multiple processes at the same time, so when there are multiple processes to migrate, it is imposible to tell how many tasks are to be migrated beforehand. Note that this already affects cgroup_transfer_tasks(). cgroup currently cannot guarantee atomic success or failure of the operation. It may fail in the middle and after such failure cgroup doesn't have enough information to roll back properly. It just aborts with some tasks migrated and others not. To resolve the situation, we're going to use task->cg_list during migration too. Instead of building a separate array, target tasks will be linked into a dedicated migration list_head on the owning css_set. Tasks on the migration list are treated the same as tasks on the usual tasks list; however, being on a separate list allows cgroup migration code path to keep track of the target tasks by simply keeping the list of css_sets with tasks being migrated, making unpredictable dynamic allocation unnecessary. In prepartion of such migration path update, this patch introduces css_set->mg_tasks list and updates css_set task iterations so that they walk both css_set->tasks and ->mg_tasks. Note that ->mg_tasks isn't used yet. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-25 10:04:01 -05:00
Tejun Heo	f153ad11bc	Merge branch 'cgroup/for-3.14-fixes' into cgroup/for-3.15 Pull in for-3.14-fixes to receive `532de3fc72` ("cgroup: update cgroup_enable_task_cg_lists() to grab siglock") which conflicts with `afeb0f9fd4` ("cgroup: relocate cgroup_enable_task_cg_lists()") and the following cg_lists updates. This is likely to cause further conflicts down the line too, so let's merge it early. As cgroup_enable_task_cg_lists() is relocated in for-3.15, this merge causes conflict in the original position. It's resolved by applying siglock changes to the updated version in the new location. Conflicts: kernel/cgroup.c Signed-off-by: Tejun Heo <tj@kernel.org>	2014-02-25 09:56:49 -05:00
Tejun Heo	532de3fc72	cgroup: update cgroup_enable_task_cg_lists() to grab siglock Currently, there's nothing preventing cgroup_enable_task_cg_lists() from missing set PF_EXITING and race against cgroup_exit(). Depending on the timing, cgroup_exit() may finish with the task still linked on css_set leading to list corruption. Fix it by grabbing siglock in cgroup_enable_task_cg_lists() so that PF_EXITING is guaranteed to be visible. This whole on-demand cg_list optimization is extremely fragile and has ample possibility to lead to bugs which can cause things like once-a-year oops during boot. I'm wondering whether the better approach would be just adding "cgroup_disable=all" handling which disables the whole cgroup rather than tempting fate with this on-demand craziness. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org	2014-02-18 18:23:18 -05:00
Li Zefan	dc5736ed7a	cgroup: add a validation check to cgroup_add_cftyps() Fengguang reported this bug: BUG: unable to handle kernel NULL pointer dereference at 0000003c IP: [<cc90b4ad>] cgroup_cfts_commit+0x27/0x1c1 ... Call Trace: [<cc9d1129>] ? kmem_cache_alloc_trace+0x33f/0x3b7 [<cc90c6fc>] cgroup_add_cftypes+0x8f/0xca [<cd78b646>] cgroup_init+0x6a/0x26a [<cd764d7d>] start_kernel+0x4d7/0x57a [<cd7642ef>] i386_start_kernel+0x92/0x96 This happens in a corner case. If CGROUP_SCHED=y but CFS_BANDWIDTH=n && FAIR_GROUP_SCHED=n && RT_GROUP_SCHED=n, we have: cpu_files[] = { { } /* terminate */ } When we pass cpu_files to cgroup_apply_cftypes(), as cpu_files[0].ss is NULL, we'll access NULL pointer. The bug was introduced by commit `de00ffa56e` ("cgroup: make cgroup_subsys->base_cftypes use cgroup_add_cftypes()"). Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-02-18 18:20:09 -05:00
Li Zefan	6534fd6c15	cgroup: fix memory leak in cgroup_mount() We should free the memory allocated in parse_cgroupfs_options() before calling this function again. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-02-14 10:52:40 -05:00
Li Zefan	bad3466034	cgroup: fix locking in cgroupstats_build() css_set_lock has been converted to css_set_rwsem, and rwsem can't nest inside rcu_read_lock. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-02-14 10:52:39 -05:00
Fengguang Wu	430af8ad9d	cgroup: fix coccinelle warnings kernel/cgroup.c:2256:1-3: WARNING: PTR_RET can be used Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR Generated by: coccinelle/api/ptr_ret.cocci Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-02-13 16:42:45 -05:00
Tejun Heo	8541fecc04	cgroup: unexport functions With module support gone, a lot of functions no longer need to be exported. Unexport them. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:43 -05:00
Tejun Heo	9db8de3722	cgroup: cosmetic updates to cgroup_attach_task() cgroup_attach_task() is planned to go through restructuring. Let's tidy it up a bit in preparation. * Update cgroup_attach_task() to receive the target task argument in @leader instead of @tsk. * Rename @tsk to @task. * Rename @retval to @ret. This is purely cosmetic. v2: get_nr_threads() was using uninitialized @task instead of @leader. Fixed. Reported by Dan Carpenter. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Dan Carpenter <dan.carpenter@oracle.com>	2014-02-13 06:58:43 -05:00
Tejun Heo	bc668c7519	cgroup: remove cgroup_taskset_cur_css() and cgroup_taskset_size() The two functions don't have any users left. Remove them along with cgroup_taskset->cur_cgrp. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:43 -05:00
Tejun Heo	cb0f1fe9ba	cgroup: move css_set_rwsem locking outside of cgroup_task_migrate() Instead of repeatedly locking and unlocking css_set_rwsem inside cgroup_task_migrate(), update cgroup_attach_task() to grab it outside of the loop and update cgroup_task_migrate() to use put_css_set_locked(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:41 -05:00
Tejun Heo	89c5509b0d	cgroup: separate out put_css_set_locked() and remove put_css_set_taskexit() put_css_set() is performed in two steps - it first tries to put without grabbing css_set_rwsem if such put wouldn't make the count zero. If that fails, it puts after write-locking css_set_rwsem. This patch separates out the second phase into put_css_set_locked() which should be called with css_set_rwsem locked. Also, put_css_set_taskexit() is droped and put_css_set() is made to take @taskexit. There are only a handful users of these functions. No point in providing different variants. put_css_locked() will be used by later changes. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:40 -05:00
Tejun Heo	889ed9ceaa	cgroup: remove css_scan_tasks() css_scan_tasks() doesn't have any user left. Remove it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:40 -05:00
Tejun Heo	96d365e0b8	cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem Currently there are two ways to walk tasks of a cgroup - css_task_iter_start/next/end() and css_scan_tasks(). The latter builds on the former but allows blocking while iterating. Unfortunately, the way css_scan_tasks() is implemented is rather nasty, it uses a priority heap of pointers to extract some number of tasks in task creation order and loops over them invoking the callback and repeats that until it reaches the end. It requires either preallocated heap or may fail under memory pressure, while unlikely to be problematic, the complexity is O(N^2), and in general just nasty. We're gonna convert all css_scan_users() to css_task_iter_start/next/end() and remove css_scan_users(). As css_scan_tasks() users may block, let's convert css_set_lock to a rwsem so that tasks can block during css_task_iter_*() is in progress. While this does increase the chance of possible deadlock scenarios, given the current usage, the probability is relatively low, and even if that happens, the right thing to do is updating the iteration in the similar way to css iterators so that it can handle blocking. Most conversions are trivial; however, task_cgroup_path() now expects to be called with css_set_rwsem locked instead of locking itself. This is because the function is called with RCU read lock held and rwsem locking should nest outside RCU read lock. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:40 -05:00
Tejun Heo	e406d1cfff	cgroup: reimplement cgroup_transfer_tasks() without using css_scan_tasks() Reimplement cgroup_transfer_tasks() so that it repeatedly fetches the first task in the cgroup and then tranfers it. This achieves the same result without using css_scan_tasks() which is scheduled to be removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:39 -05:00
Tejun Heo	07bc356ed2	cgroup: implement cgroup_has_tasks() and unexport cgroup_task_count() cgroup_task_count() read-locks css_set_lock and walks all tasks to count them and then returns the result. The only thing all the users want is determining whether the cgroup is empty or not. This patch implements cgroup_has_tasks() which tests whether cgroup->cset_links is empty, replaces all cgroup_task_count() usages and unexports it. Note that the test isn't synchronized. This is the same as before. The test has always been racy. This will help planned css_set locking update. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>	2014-02-13 06:58:39 -05:00
Tejun Heo	afeb0f9fd4	cgroup: relocate cgroup_enable_task_cg_lists() Move it above so that prototype isn't necessary. Let's also move the definition of use_task_css_set_links next to it. This is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:39 -05:00
Tejun Heo	56fde9e01d	cgroup: enable task_cg_lists on the first cgroup mount Tasks are not linked on their css_sets until cgroup task iteration is actually used. This is to avoid incurring overhead on the fork and exit paths for systems which have cgroup compiled in but don't use it. This lazy binding also affects the task migration path. It has to be careful so that it doesn't link tasks to css_sets when task_cg_lists linking is not enabled yet. Unfortunately, this conditional linking in the migration path interferes with planned migration updates. This patch moves the lazy binding a bit earlier, to the first cgroup mount. It's a clear indication that cgroup is being used on the system and task_cg_lists linking is highly likely to be enabled soon anyway through "tasks" and "cgroup.procs" files. This allows cgroup_task_migrate() to always link @tsk->cg_list. Note that it may still race with cgroup_post_fork() but who wins that race is inconsequential. While at it, make use_task_css_set_links a bool, add sanity checks in cgroup_enable_task_cg_lists() and css_task_iter_start(), and update the former so that it's guaranteed and assumes to run only once. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:38 -05:00
Tejun Heo	3558557305	cgroup: drop CGRP_ROOT_SUBSYS_BOUND Before kernfs conversion, due to the way super_block lookup works, cgroup roots were created and made visible before being fully initialized. This in turn required a special flag to mark that the root hasn't been fully initialized so that the destruction path can tell fully bound ones from half initialized. That flag is CGRP_ROOT_SUBSYS_BOUND and no longer necessary after the kernfs conversion as the lookup and creation of new root are atomic w.r.t. cgroup_mutex. This patch removes the flag and passes the requests subsystem mask to cgroup_setup_root() so that it can set the respective mask bits as subsystems are bound. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:38 -05:00
Tejun Heo	d3ba07c3aa	cgroup: disallow xattr, release_agent and name if sane_behavior Disallow more mount options if sane_behavior. Note that xattr used to generate warning. While at it, simplify option check in cgroup_mount() and update sane_behavior comment in cgroup.h. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-13 06:58:38 -05:00
Tejun Heo	1a11533fbd	Revert "cgroup: use an ordered workqueue for cgroup destruction" This reverts commit `ab3f5faa62`. Explanation from Hugh: It's because more thorough testing, by others here, found that it wasn't always solving the problem: so I asked Tejun privately to hold off from sending it in, until we'd worked out why not. Most of our testing being on a v3,11-based kernel, it was perfectly possible that the problem was merely our own e.g. missing Tejun's `8a2b753844` ("workqueue: fix ordered workqueues in NUMA setups"). But that turned out not to be enough to fix it either. Then Filipe pointed out how percpu_ref_kill_and_confirm() uses call_rcu_sched() before we ever get to put the offline on to the workqueue: by the time we get to the workqueue, the ordering has already been lost. So, thanks for the Acks, but I'm afraid that this ordered workqueue solution is just not good enough: we should simply forget that patch and provide a different answer." Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com>	2014-02-12 19:08:28 -05:00
Tejun Heo	776f02fa4e	cgroup: remove cgroupfs_root->refcnt Currently, cgroupfs_root and its ->top_cgroup are separated reference counted and the latter's is ignored. There's no reason to do this separately. This patch removes cgroupfs_root->refcnt and destroys cgroupfs_root when the top_cgroup is released. * cgroup_put() updated to ignore cgroup_is_dead() test for top cgroups. cgroup_free_fn() updated to handle root destruction when releasing a top cgroup. * As root destruction is now bounced through cgroup destruction, it is asynchronous. Update cgroup_mount() so that it waits for pending release which is currently implemented using msleep(). Converting this to proper wait_queue isn't hard but likely unnecessary. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-12 09:29:50 -05:00
Tejun Heo	3c9c825b8b	cgroup: rename cgroupfs_root->number_of_cgroups to ->nr_cgrps and make it atomic_t root->number_of_cgroups is currently an integer protected with cgroup_mutex. Except for sanity checks and proc reporting, the only place it's used is to check whether the root has any child during remount; however, this is a bit flawed as the counter is not decremented when the cgroup is unlinked but when it's released, meaning that there could be an extended period where all cgroups are removed but remount is still not allowed because some internal objects are lingering. While not perfect either, it'd be better to use emptiness test on root->top_cgroup.children. This patch updates cgroup_remount() to test top_cgroup's children instead, which makes number_of_cgroups only actual usage statistics printing in proc implemented in proc_cgroupstats_show(). Let's shorten its name and make it an atomic_t so that we don't have to worry about its synchronization. It's purely auxiliary at this point. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-12 09:29:50 -05:00
Tejun Heo	e61734c55c	cgroup: remove cgroup->name cgroup->name handling became quite complicated over time involving dedicated struct cgroup_name for RCU protection. Now that cgroup is on kernfs, we can drop all of it and simply use kernfs_name/path() and friends. Replace cgroup->name and all related code with kernfs name/path constructs. * Reimplement cgroup_name() and cgroup_path() as thin wrappers on top of kernfs counterparts, which involves semantic changes. pr_cont_cgroup_name() and pr_cont_cgroup_path() added. * cgroup->name handling dropped from cgroup_rename(). * All users of cgroup_name/path() updated to the new semantics. Users which were formatting the string just to printk them are converted to use pr_cont_cgroup_name/path() instead, which simplifies things quite a bit. As cgroup_name() no longer requires RCU read lock around it, RCU lockings which were protecting only cgroup_name() are removed. v2: Comment above oom_info_lock updated as suggested by Michal. v3: dummy_top doesn't have a kn associated and pr_cont_cgroup_name/path() ended up calling the matching kernfs functions with NULL kn leading to oops. Test for NULL kn and print "/" if so. This issue was reported by Fengguang Wu. v4: Rebased on top of `0ab02ca8f8` ("cgroup: protect modifications to cgroup_idr with cgroup_mutex"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>	2014-02-12 09:29:50 -05:00
Tejun Heo	6f30558f37	cgroup: make cgroup hold onto its kernfs_node cgroup currently releases its kernfs_node when it gets removed. While not buggy, this makes cgroup->kn access rules complicated than necessary and leads to things like get/put protection around kernfs_remove() in cgroup_destroy_locked(). In addition, we want to use kernfs_name/path() and friends but also want to be able to determine a cgroup's name between removal and release. This patch makes cgroup hold onto its kernfs_node until freed so that cgroup->kn is always accessible. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-12 09:29:50 -05:00
Tejun Heo	21a2d3430b	cgroup: simplify dynamic cftype addition and removal Dynamic cftype addition and removal using cgroup_add/rm_cftypes() respectively has been quite hairy due to vfs i_mutex. As i_mutex nests outside cgroup_mutex, cgroup_mutex has to be released and regrabbed on each iteration through the hierarchy complicating the process. Now that i_mutex is no longer in play, it can be simplified. * Just holding cgroup_tree_mutex is enough. No need to meddle with cgroup_mutex. * No reason to play the unlock - relock - check serial_nr dancing. Everything can be atomically while holding cgroup_tree_mutex. * cgroup_cfts_prepare() is replaced with direct locking of cgroup_tree_mutex. * cgroup_cfts_commit() no longer fiddles with locking. It just applies the cftypes change to the existing cgroups in the hierarchy. Renamed to cgroup_cfts_apply(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-12 09:29:49 -05:00
Tejun Heo	0adb070426	cgroup: remove cftype_set cftype_set was added primarily to allow registering the same cftype array more than once for different subsystems. Nobody uses or needs such thing and it's already broken because each cftype has ->ss pointer which is initialized during registration. Let's add list_head ->node to cftype and use the first cftype entry in the array to link them instead of allocating separate cftype_set. While at it, trigger WARN if cft seems previously initialized during registration. This simplifies cftype handling a bit. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-12 09:29:48 -05:00
Tejun Heo	80b1358699	cgroup: relocate cgroup_rm_cftypes() cftype handling is about to be revamped. Relocate cgroup_rm_cftypes() above cgroup_add_cftypes() in preparation. This is pure relocation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-12 09:29:48 -05:00
Tejun Heo	86bf4b6875	cgroup: warn if "xattr" is specified with "sane_behavior" Mount option "xattr" is no longer necessary as it's enabled by default on kernfs. Warn if "xattr" is specified with "sane_behavior" so that the option can be removed in the future. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-12 09:29:48 -05:00
Tejun Heo	2bd59d48eb	cgroup: convert to kernfs cgroup filesystem code was derived from the original sysfs implementation which was heavily intertwined with vfs objects and locking with the goal of re-using the existing vfs infrastructure. That experiment turned out rather disastrous and sysfs switched, a long time ago, to distributed filesystem model where a separate representation is maintained which is queried by vfs. Unfortunately, cgroup stuck with the failed experiment all these years and accumulated even more problems over time. Locking and object lifetime management being entangled with vfs is probably the most egregious. vfs is never designed to be misused like this and cgroup ends up jumping through various convoluted dancing to make things work. Even then, operations across multiple cgroups can't be done safely as it'll deadlock with rename locking. Recently, kernfs is separated out from sysfs so that it can be used by users other than sysfs. This patch converts cgroup to use kernfs, which will bring the following benefits. * Separation from vfs internals. Locking and object lifetime management is contained in cgroup proper making things a lot simpler. This removes significant amount of locking convolutions, hairy object lifetime rules and the restriction on multi-cgroup operations. * Can drop a lot of code to implement filesystem interface as most are provided by kernfs. * Proper "severing" semantics, which allows controllers to not worry about lingering file accesses after offline. While the preceding patches did as much as possible to make the transition less painful, large part of the conversion has to be one discrete step making this patch rather large. The rest of the commit message lists notable changes in different areas. Overall ------- * vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn, cgroupfs_root->sb w/ ->kf_root. * All dentry accessors are removed. Helpers to map from kernfs constructs are added. * All vfs plumbing around dentry, inode and bdi removed. * cgroup_mount() now directly looks for matching root and then proceeds to create a new one if not found. Synchronization and object lifetime ----------------------------------- * vfs inode locking removed. Among other things, this removes the need for the convolution in cgroup_cfts_commit(). Future patches will further simplify it. * vfs refcnting replaced with cgroup internal ones. cgroup->refcnt, cgroupfs_root->refcnt added. cgroup_put_root() now directly puts root->refcnt and when it reaches zero proceeds to destroy it thus merging cgroup_put_root() and the former cgroup_kill_sb(). Simliarly, cgroup_put() now directly schedules cgroup_free_rcu() when refcnt reaches zero. * Unlike before, kernfs objects don't hold onto cgroup objects. When cgroup destroys a kernfs node, all existing operations are drained and the association is broken immediately. The same for cgroupfs_roots and mounts. * All operations which come through kernfs guarantee that the associated cgroup is and stays valid for the duration of operation; however, there are two paths which need to find out the associated cgroup from dentry without going through kernfs - css_tryget_from_dir() and cgroupstats_build(). For these two, kernfs_node->priv is RCU managed so that they can dereference it under RCU read lock. File and directory handling --------------------------- * File and directory operations converted to kernfs_ops and kernfs_syscall_ops. * xattrs is implicitly supported by kernfs. No need to worry about it from cgroup. This means that "xattr" mount option is no longer necessary. A future patch will add a deprecated warning message when sane_behavior. * When cftype->max_write_len > PAGE_SIZE, it's necessary to make a private copy of one of the kernfs_ops to set its atomic_write_len. cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated to handle it. * cftype->lockdep_key added so that kernfs lockdep annotation can be per cftype. * Inidividual file entries and open states are now managed by kernfs. No need to worry about them from cgroup. cfent, cgroup_open_file and their friends are removed. * kernfs_nodes are created deactivated and kernfs_activate() invocations added to places where creation of new nodes are committed. * cgroup_rmdir() uses kernfs_[un]break_active_protection() for self-removal. v2: - Li pointed out in an earlier patch that specifying "name=" during mount without subsystem specification should succeed if there's an existing hierarchy with a matching name although it should fail with -EINVAL if a new hierarchy should be created. Prior to the conversion, this used by handled by deferring failure from NULL return from cgroup_root_from_opts(), which was necessary because root was being created before checking for existing ones. Note that cgroup_root_from_opts() returned an ERR_PTR() value for error conditions which require immediate mount failure. As we now have separate search and creation steps, deferring failure from cgroup_root_from_opts() is no longer necessary. cgroup_root_from_opts() is updated to always return ERR_PTR() value on failure. - The logic to match existing roots is updated so that a mount attempt with a matching name but different subsys_mask are rejected. This was handled by a separate matching loop under the comment "Check for name clashes with existing mounts" but got lost during conversion. Merge the check into the main search loop. - Add __rcu __force casting in RCU_INIT_POINTER() in cgroup_destroy_locked() to avoid the sparse address space warning reported by kbuild test bot. Maybe we want an explicit interface to use kn->priv as RCU protected pointer? v3: Make CONFIG_CGROUPS select CONFIG_KERNFS. v4: Rebased on top of `0ab02ca8f8` ("cgroup: protect modifications to cgroup_idr with cgroup_mutex"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: kbuild test robot fengguang.wu@intel.com>	2014-02-11 11:52:49 -05:00
Tejun Heo	f2e85d574e	cgroup: relocate functions in preparation of kernfs conversion Relocate cgroup_init/exit_root_id(), cgroup_free_root(), cgroup_kill_sb() and cgroup_file_name() in preparation of kernfs conversion. These are pure relocations to make kernfs conversion easier to follow. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-11 11:52:49 -05:00
Tejun Heo	59f5296b51	cgroup: misc preps for kernfs conversion * Un-inline seq_css(). After kernfs conversion, the function will need to dereference internal data structures. * Add cgroup_get/put_root() and replace direct super_block->s_active manipulatinos with them. These will be converted to kernfs_root refcnting. * Add cgroup_get/put() and replace dget/put() on cgrp->dentry with them. These will be converted to kernfs refcnting. * Update current_css_set_cg_links_read() to use cgroup_name() instead of reaching into the dentry name. The end result is the same. These changes don't make functional differences but will make transition to kernfs easier. v2: Rebased on top of `0ab02ca8f8` ("cgroup: protect modifications to cgroup_idr with cgroup_mutex"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-11 11:52:49 -05:00
Tejun Heo	b166492406	cgroup: introduce cgroup_ino() mm/memory-failure.c::hwpoison_filter_task() has been reaching into cgroup to extract the associated ino to be used as a filtering criterion. This is an implementation detail which shouldn't be depended upon from outside cgroup proper and is about to change with the scheduled kernfs conversion. This patch introduces a proper interface to determine the associated ino, cgroup_ino(), and updates hwpoison_filter_task() to use it instead of reaching directly into cgroup. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Wu Fengguang <fengguang.wu@intel.com>	2014-02-11 11:52:49 -05:00
Tejun Heo	2da440a26c	cgroup: introduce cgroup_init/exit_cftypes() Factor out cft->ss initialization into cgroup_init_cftypes() from cgroup_add_cftypes() and add cft->ss clearing to cgroup_rm_cftypes() through cgroup_exit_cftypes(). This doesn't make any meaningful difference now but the two new functions will be expanded during kernfs transition. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-11 11:52:48 -05:00
Tejun Heo	5f46990787	cgroup: update the meaning of cftype->max_write_len cftype->max_write_len is used to extend the maximum size of writes. It's interpreted in such a way that the actual maximum size is one less than the specified value. The default size is defined by CGROUP_LOCAL_BUFFER_SIZE. Its interpretation is quite confusing - its value is decremented by 1 and then compared for equality with max size, which means that the actual default size is CGROUP_LOCAL_BUFFER_SIZE - 2, which is 62 chars. There's no point in having a limit that low. Update its definition so that it means the actual string length sans termination and anything below PAGE_SIZE-1 is treated as PAGE_SIZE-1. .max_write_len for "release_agent" is updated to PATH_MAX-1 and cgroup_release_agent_write() is updated so that the redundant strlen() check is removed and it uses strlcpy() instead of strcpy(). .max_write_len initializations in blk-throttle.c and cfq-iosched.c are no longer necessary and removed. The one in cpuset is kept unchanged as it's an approximated value to begin with. This will also make transition to kernfs smoother. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-11 11:52:48 -05:00
Tejun Heo	de00ffa56e	cgroup: make cgroup_subsys->base_cftypes use cgroup_add_cftypes() Currently, cgroup_subsys->base_cftypes registration is different from dynamic cftypes registartion. Instead of going through cgroup_add_cftypes(), cgroup_init_subsys() invokes cgroup_init_cftsets() which makes use of cgroup_subsys->base_cftset which doesn't involve dynamic allocation. While avoiding dynamic allocation is somewhat nice, having two separate paths for cftypes registration is nasty, especially as we're planning to add more operations during cftypes registration. This patch drops cgroup_init_cftsets() and cgroup_subsys->base_cftset and registers base_cftypes using cgroup_add_cftypes(). This is done as a separate step in cgroup_init() instead of a part of cgroup_init_subsys(). This is because cgroup_init_subsys() can be called very early during boot when kmalloc() isn't available yet. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-11 11:52:48 -05:00
Tejun Heo	8d7e6fb0a1	cgroup: update cgroup name handling Straightforward updates to cgroup name handling in preparation of kernfs conversion. * cgroup_alloc_name() is updated to take const char * isntead of dentry * for name source. * cgroup name formatting is separated out into cgroup_file_name(). While at it, buffer length protection is added. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-11 11:52:48 -05:00
Tejun Heo	d427dfeb12	cgroup: factor out cgroup_setup_root() from cgroup_mount() Factor out new root initialization into cgroup_setup_root() from cgroup_mount(). This makes it easier to follow and will ease kernfs conversion. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-11 11:52:48 -05:00
Tejun Heo	8e30e2b8ba	cgroup: restructure locking and error handling in cgroup_mount() cgroup is scheduled to be converted to kernfs. After conversion, cgroup_mount() won't use the sget() machinery for finding out existing super_blocks but instead would do that directly. It'll search the existing cgroupfs_roots for a matching one and create a new one iff a match doesn't exist. To ease such conversion, this patch restructures locking and error handling of the function. cgroup_tree_mutex and cgroup_mutex are grabbed from the get-go and held until return. For now, due to the way vfs locks nest outside cgroup mutexes, the two cgroup mutexes are temporarily dropped across sget() and inode mutex locking, which looks quite ridiculous; however, these will be removed through kernfs conversion and structuring the code this way makes the conversion less painful. The error goto labels are consolidated to two. This looks unwieldy now but the next patch will factor out creation of new root into a separate function with accompanying error handling and it'll look a lot better. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-11 11:52:48 -05:00
Tejun Heo	4ac0601744	cgroup: release cgroup_mutex over file removals Now that cftypes and all tree modification operations are protected by cgroup_tree_mutex, we can drop cgroup_mutex while deleting files and directories. Drop cgroup_mutex over removals. This doesn't make any noticeable difference now but is to help kernfs conversion. In kernfs, removals are sync points which drain in-flight operations as those operations would grab cgroup_mutex, trying to delete under cgroup_mutex would deadlock. This can be resolved by just holding the outer cgroup_tree_mutex which nests outside both kernfs active reference and cgroup_mutex. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-11 11:52:47 -05:00
Tejun Heo	ace2bee813	cgroup: introduce cgroup_tree_mutex Currently cgroup uses combination of inode->i_mutex'es and cgroup_mutex for synchronization. With the scheduled kernfs conversion, i_mutex'es will be removed. Unfortunately, just using cgroup_mutex isn't possible. All kernfs file and syscall operations, most of which require grabbing cgroup_mutex, will be called with kernfs active ref held and, if we try to perform kernfs removals under cgroup_mutex, it can deadlock as kernfs_remove() tries to drain the target node. Let's introduce a new outer mutex, cgroup_tree_mutex, which protects stuff used during hierarchy changing operations - cftypes and all the operations which may affect the cgroupfs. It also covers css association and iteration. This allows cgroup_css(), for_each_css() and other css iterators to be called under cgroup_tree_mutex. The new mutex will nest above both kernfs's active ref protection and cgroup_mutex. By protecting tree modifications with a separate outer mutex, we can get rid of the forementioned deadlock condition. Actual file additions and removals now require cgroup_tree_mutex instead of cgroup_mutex. Currently, cgroup_tree_mutex is never used without cgroup_mutex; however, we'll soon add hierarchy modification sections which are only protected by cgroup_tree_mutex. In the future, we might want to make the locking more granular by better splitting the coverages of the two mutexes. For now, this should do. v2: Rebased on top of `0ab02ca8f8` ("cgroup: protect modifications to cgroup_idr with cgroup_mutex"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-11 11:52:47 -05:00
Tejun Heo	5a17f543ed	cgroup: improve css_from_dir() into css_tryget_from_dir() css_from_dir() returns the matching css (cgroup_subsys_state) given a dentry and subsystem. The function doesn't pin the css before returning and requires the caller to be holding RCU read lock or cgroup_mutex and handling pinning on the caller side. Given that users of the function are likely to want to pin the returned css (both existing users do) and that getting and putting css's are very cheap, there's no reason for the interface to be tricky like this. Rename css_from_dir() to css_tryget_from_dir() and make it try to pin the found css and return it only if pinning succeeded. The callers are updated so that they no longer do RCU locking and pinning around the function and just use the returned css. This will also ease converting cgroup to kernfs. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>	2014-02-11 11:52:47 -05:00
Tejun Heo	398f878789	Merge branch 'cgroup/for-3.14-fixes' into cgroup/for-3.15 Pull for-3.14-fixes to receive `0ab02ca8f8` ("cgroup: protect modifications to cgroup_idr with cgroup_mutex") prior to kernfs conversion series to avoid non-trivial conflicts. Signed-off-by: Tejun Heo <tj@kernel.org>	2014-02-11 11:02:59 -05:00
Li Zefan	0ab02ca8f8	cgroup: protect modifications to cgroup_idr with cgroup_mutex Setup cgroupfs like this: # mount -t cgroup -o cpuacct xxx /cgroup # mkdir /cgroup/sub1 # mkdir /cgroup/sub2 Then run these two commands: # for ((; ;)) { mkdir /cgroup/sub1/tmp && rmdir /mnt/sub1/tmp; } & # for ((; ;)) { mkdir /cgroup/sub2/tmp && rmdir /mnt/sub2/tmp; } & After seconds you may see this warning: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 25243 at lib/idr.c:527 sub_remove+0x87/0x1b0() idr_remove called for id=6 which is not allocated. ... Call Trace: [<ffffffff8156063c>] dump_stack+0x7a/0x96 [<ffffffff810591ac>] warn_slowpath_common+0x8c/0xc0 [<ffffffff81059296>] warn_slowpath_fmt+0x46/0x50 [<ffffffff81300aa7>] sub_remove+0x87/0x1b0 [<ffffffff810f3f02>] ? css_killed_work_fn+0x32/0x1b0 [<ffffffff81300bf5>] idr_remove+0x25/0xd0 [<ffffffff810f2bab>] cgroup_destroy_css_killed+0x5b/0xc0 [<ffffffff810f4000>] css_killed_work_fn+0x130/0x1b0 [<ffffffff8107cdbc>] process_one_work+0x26c/0x550 [<ffffffff8107eefe>] worker_thread+0x12e/0x3b0 [<ffffffff81085f96>] kthread+0xe6/0xf0 [<ffffffff81570bac>] ret_from_fork+0x7c/0xb0 ---[ end trace 2d1577ec10cf80d0 ]--- It's because allocating/removing cgroup ID is not properly synchronized. The bug was introduced when we converted cgroup_ida to cgroup_idr. While synchronization is already done inside ida_simple_{get,remove}(), users are responsible for concurrent calls to idr_{alloc,remove}(). tj: Refreshed on top of `b58c89986a` ("cgroup: fix error return from cgroup_create()"). Fixes: `4e96ee8e98` ("cgroup: convert cgroup_ida to cgroup_idr") Cc: <stable@vger.kernel.org> #3.12+ Reported-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-02-11 10:38:30 -05:00
Tejun Heo	1a698a4aba	Merge branch 'for-3.14-fixes' into for-3.15 Pending kernfs conversion depends on fixes in for-3.14-fixes. Pull it into for-3.15. Signed-off-by: Tejun Heo <tj@kernel.org>	2014-02-08 10:37:14 -05:00
Tejun Heo	3417ae1f5f	cgroup: remove cgroup_root_mutex cgroup_root_mutex was added to avoid deadlock involving namespace_sem via cgroup_show_options(). It added a lot of overhead for the small purpose of it and, because it's nested under cgroup_mutex, it has very limited usefulness. The previous patch made cgroup_show_options() not use cgroup_root_mutex, so nobody needs it anymore. Remove it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-08 10:37:01 -05:00
Tejun Heo	69e943b7d3	cgroup: update locking in cgroup_show_options() cgroup_show_options() grabs cgroup_root_mutex to protect the options changing while printing; however, holding root_mutex or not doesn't really make much difference for the function. subsys_mask can be atomically tested and most of the options aren't allowed to change anyway once mounted. The only field which needs synchronization is ->release_agent_path. This patch introduces a dedicated spinlock to synchronize accesses to the field and drops cgroup_root_mutex locking from cgroup_show_options(). The next patch will remove cgroup_root_mutex. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-08 10:36:58 -05:00
Tejun Heo	aec25020f5	cgroup: rename cgroup_subsys->subsys_id to ->id It's no longer referenced outside cgroup core, so renaming is easy. Let's rename it for consistency & brevity. This patch is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-08 10:36:58 -05:00
Tejun Heo	073219e995	cgroup: clean up cgroup_subsys names and initialization cgroup_subsys is a bit messier than it needs to be. * The name of a subsys can be different from its internal identifier defined in cgroup_subsys.h. Most subsystems use the matching name but three - cpu, memory and perf_event - use different ones. * cgroup_subsys_id enums are postfixed with _subsys_id and each cgroup_subsys is postfixed with _subsys. cgroup.h is widely included throughout various subsystems, it doesn't and shouldn't have claim on such generic names which don't have any qualifier indicating that they belong to cgroup. * cgroup_subsys->subsys_id should always equal the matching cgroup_subsys_id enum; however, we require each controller to initialize it and then BUG if they don't match, which is a bit silly. This patch cleans up cgroup_subsys names and initialization by doing the followings. * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each cgroup_subsys with _cgrp_subsys. * With the above, renaming subsys identifiers to match the userland visible names doesn't cause any naming conflicts. All non-matching identifiers are renamed to match the official names. cpu_cgroup -> cpu mem_cgroup -> memory perf -> perf_event * controllers no longer need to initialize ->subsys_id and ->name. They're generated in cgroup core and set automatically during boot. * Redundant cgroup_subsys declarations removed. * While updating BUG_ON()s in cgroup_init_early(), convert them to WARN()s. BUGging that early during boot is stupid - the kernel can't print anything, even through serial console and the trap handler doesn't even link stack frame properly for back-tracing. This patch doesn't introduce any behavior changes. v2: Rebased on top of `fe1217c4f3` ("net: net_cls: move cgroupfs classid handling into core"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Ingo Molnar <mingo@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Thomas Graf <tgraf@suug.ch>	2014-02-08 10:36:58 -05:00
Tejun Heo	3ed80a62bf	cgroup: drop module support With module supported dropped from net_prio, no controller is using cgroup module support. None of actual resource controllers can be built as a module and we aren't gonna add new controllers which don't control resources. This patch drops module support from cgroup. * cgroup_[un]load_subsys() and cgroup_subsys->module removed. * As there's no point in distinguishing IS_BUILTIN() and IS_MODULE(), cgroup_subsys.h now uses IS_ENABLED() directly. * enum cgroup_subsys_id now exactly matches the list of enabled controllers as ordered in cgroup_subsys.h. * cgroup_subsys[] is now a contiguously occupied array. Size specification is no longer necessary and dropped. * for_each_builtin_subsys() is removed and for_each_subsys() is updated to not require any locking. * module ref handling is removed from rebind_subsystems(). * Module related comments dropped. v2: Rebased on top of `fe1217c4f3` ("net: net_cls: move cgroupfs classid handling into core"). v3: Added {} around the if (need_forkexit_callback) block in cgroup_post_fork() for readability as suggested by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-08 10:36:58 -05:00
Tejun Heo	48573a8933	cgroup: fix locking in cgroup_cfts_commit() cgroup_cfts_commit() walks the cgroup hierarchy that the target subsystem is attached to and tries to apply the file changes. Due to the convolution with inode locking, it can't keep cgroup_mutex locked while iterating. It currently holds only RCU read lock around the actual iteration and then pins the found cgroup using dget(). Unfortunately, this is incorrect. Although the iteration does check cgroup_is_dead() before invoking dget(), there's nothing which prevents the dentry from going away inbetween. Note that this is different from the usual css iterations where css_tryget() is used to pin the css - css_tryget() tests whether the css can be pinned and fails if not. The problem can be solved by simply holding cgroup_mutex instead of RCU read lock around the iteration, which actually reduces LOC. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org	2014-02-08 10:26:34 -05:00
Tejun Heo	b58c89986a	cgroup: fix error return from cgroup_create() cgroup_create() was returning 0 after allocation failures. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org	2014-02-08 10:26:33 -05:00
Tejun Heo	eb46bf8969	cgroup: fix error return value in cgroup_mount() When cgroup_mount() fails to allocate an id for the root, it didn't set ret before jumping to unlock_drop ending up returning 0 after a failure. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org	2014-02-08 10:26:33 -05:00
Hugh Dickins	ab3f5faa62	cgroup: use an ordered workqueue for cgroup destruction Sometimes the cleanup after memcg hierarchy testing gets stuck in mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0. There may turn out to be several causes, but a major cause is this: the workitem to offline parent can get run before workitem to offline child; parent's mem_cgroup_reparent_charges() circles around waiting for the child's pages to be reparented to its lrus, but it's holding cgroup_mutex which prevents the child from reaching its mem_cgroup_reparent_charges(). Just use an ordered workqueue for cgroup_destroy_wq. tj: Committing as the temporary fix until the reverse dependency can be removed from memcg. Comment updated accordingly. Fixes: `e5fca243ab` ("cgroup: use a dedicated workqueue for cgroup destruction") Suggested-by: Filipe Brandenburger <filbranden@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Tejun Heo <tj@kernel.org>	2014-02-07 10:21:12 -05:00
Linus Torvalds	f075e0f699	Merge branch 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "The bulk of changes are cleanups and preparations for the upcoming kernfs conversion. - cgroup_event mechanism which is and will be used only by memcg is moved to memcg. - pidlist handling is updated so that it can be served by seq_file. Also, the list is not sorted if sane_behavior. cgroup documentation explicitly states that the file is not sorted but it has been for quite some time. - All cgroup file handling now happens on top of seq_file. This is to prepare for kernfs conversion. In addition, all operations are restructured so that they map 1-1 to kernfs operations. - Other cleanups and low-pri fixes" * 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (40 commits) cgroup: trivial style updates cgroup: remove stray references to css_id doc: cgroups: Fix typo in doc/cgroups cgroup: fix fail path in cgroup_load_subsys() cgroup: fix missing unlock on error in cgroup_load_subsys() cgroup: remove for_each_root_subsys() cgroup: implement for_each_css() cgroup: factor out cgroup_subsys_state creation into create_css() cgroup: combine css handling loops in cgroup_create() cgroup: reorder operations in cgroup_create() cgroup: make for_each_subsys() useable under cgroup_root_mutex cgroup: css iterations and css_from_dir() are safe under cgroup_mutex cgroup: unify pidlist and other file handling cgroup: replace cftype->read_seq_string() with cftype->seq_show() cgroup: attach cgroup_open_file to all cgroup files cgroup: generalize cgroup_pidlist_open_file cgroup: unify read path so that seq_file is always used cgroup: unify cgroup_write_X64() and cgroup_write_string() cgroup: remove cftype->read(), ->read_map() and ->write() hugetlb_cgroup: convert away from cftype->read() ...	2014-01-21 17:51:34 -08:00
SeongJae Park	dd4b0a4676	cgroup: trivial style updates * Place newline before function opening brace in cgroup_kill_sb(). * Insert space before assignment in attach_task_by_pid() tj: merged two patches into one. Signed-off-by: SeongJae Park <sj38.park@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-01-18 08:56:11 -05:00
Li Zefan	c1a71504e9	cgroup: don't recycle cgroup id until all csses' have been destroyed Hugh reported this bug: > CONFIG_MEMCG_SWAP is broken in 3.13-rc. Try something like this: > > mkdir -p /tmp/tmpfs /tmp/memcg > mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs > mount -t cgroup -o memory memcg /tmp/memcg > mkdir /tmp/memcg/old > echo 512M >/tmp/memcg/old/memory.limit_in_bytes > echo $$ >/tmp/memcg/old/tasks > cp /dev/zero /tmp/tmpfs/zero 2>/dev/null > echo $$ >/tmp/memcg/tasks > rmdir /tmp/memcg/old > sleep 1 # let rmdir work complete > mkdir /tmp/memcg/new > umount /tmp/tmpfs > dmesg \| grep WARNING > rmdir /tmp/memcg/new > umount /tmp/memcg > > Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91 > res_counter_uncharge_locked+0x1f/0x2f() > > Breakage comes from `34c00c319c` ("memcg: convert to use cgroup id"). > > The lifetime of a cgroup id is different from the lifetime of the > css id it replaced: memsw's css_get()s do nothing to hold on to the > old cgroup id, it soon gets recycled to a new cgroup, which then > mysteriously inherits the old's swap, without any charge for it. Instead of removing cgroup id right after all the csses have been offlined, we should do that after csses have been destroyed. To make sure an invalid css pointer won't be returned after the css is destroyed, make sure css_from_id() returns NULL in this case. tj: Updated comment to note planned changes for cgrp->id. Reported-by: Hugh Dickins <hughd@google.com> Signed-off-by: Li Zefan <lizefan@huawei.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-12-17 08:11:52 -05:00
Vladimir Davydov	10bf2f7e7d	cgroup: fix fail path in cgroup_load_subsys() Calling cgroup_unload_subsys() from cgroup_load_subsys() after online_css() failure will result in a NULL ptr dereference on attempt to offline_css(), because online_css() only assigns css to cgroup on success. Let's fix that by skipping calls to offline_css() and css_free() in cgroup_unload_subsys() if there is no css, and freeing css in cgroup_load_subsys() on online_css() failure. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-12-13 15:46:49 -05:00
Wei Yongjun	0be8669dd5	cgroup: fix missing unlock on error in cgroup_load_subsys() Add the missing unlock before return from function cgroup_load_subsys() in the error handling case. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-12-12 10:45:36 -05:00
Tejun Heo	b85d20404c	cgroup: remove for_each_root_subsys() After the previous patch which introduced for_each_css(), for_each_root_subsys() only has two users left. This patch replaces it with for_each_subsys() + explicit subsys_mask testing and remove for_each_root_subsys() along with cgroupfs_root->subsys_list handling. This patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2013-12-06 15:11:57 -05:00

... 2 3 4 5 6 ...

884 Commits