In preparation of conversion to kernfs, cgroup file handling is
updated so that it can be easily mapped to kernfs. This patch renames
cgroup_pidlist_open_file to cgroup_open_file and updates it so that it
only contains a field to identify the specific file, ->cfe, and an
opaque ->priv pointer. When cgroup is converted to kernfs, this will
be replaced by kernfs_open_file which contains about the same
information.
As whether the file is "cgroup.procs" or "tasks" should now be
determined from cgroup_open_file->cfe, the cftype->private for the two
files now carry the file type and cgroup_pidlist_start() reads the
type through cfe->type->private. This makes the distinction between
cgroup_tasks_open() and cgroup_procs_open() unnecessary.
cgroup_pidlist_open() is now directly used as the open method.
This patch doesn't make any behavior changes.
v2: Refreshed on top of the updated "cgroup: introduce struct
cgroup_pidlist_open_file".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
With the recent removal of cftype->read() and ->read_map(), only three
operations are remaining, ->read_u64(), ->read_s64() and
->read_seq_string(). Currently, the first two are handled directly
while the last is handled through seq_file.
It is trivial to serve the first two through the seq_file path too.
This patch restructures read path so that all operations are served
through cgroup_seqfile_show(). This makes all cgroup files seq_file -
single_open/release() are now used by default,
cgroup_seqfile_operations is dropped, and cgroup_file_operations uses
seq_read() for read.
This simplifies the code and makes the read path easy to convert to
use kernfs.
Note that, while cgroup_file_operations uses seq_read() for read, it
still uses generic_file_llseek() for seeking instead of seq_lseek().
This is different from cgroup_seqfile_operations but shouldn't break
anything and brings the seeking behavior aligned with kernfs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup_write_X64() and cgroup_write_string() both implement about the
same buffering logic. Unify the two into cgroup_file_write() which
always allocates dynamic buffer for simplicity and uses kstrto*()
instead of simple_strto*().
This patch doesn't make any visible behavior changes except for
possibly different error value from kstrsto*().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
In preparation of conversion to kernfs, cgroup file handling is being
consolidated so that it can be easily mapped to the seq_file based
interface of kernfs.
After recent updates, ->read() and ->read_map() don't have any user
left and ->write() never had any user. Remove them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
For some reason, tasks and cgroup.procs guarantee that the result is
sorted. This is the only reason this whole pidlist logic is necessary
instead of just iterating through sorted member tasks. We can't do
anything about the existing interface but at least ensure that such
expectation doesn't exist for the new interface so that pidlist logic
may be removed in the distant future.
This patch scrambles the sort order if sane_behavior so that the
output is usually not sorted in the new interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
After the recent changes, pidlist ref is held only between
cgroup_pidlist_start() and cgroup_pidlist_stop() during which
cgroup->pidlist_mutex is also held. IOW, the reference count is
redundant now. While in use, it's always one and pidlist_mutex is
held - holding the mutex has exactly the same protection.
This patch collapses destroy_dwork queueing into cgroup_pidlist_stop()
so that pidlist_mutex is not released inbetween and drops
pidlist->use_count.
This patch shouldn't introduce any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Currently, pidlists are reference counted from file open and release
methods. This means that holding onto an open file may waste memory
and reads may return data which is very stale. Both aren't critical
because pidlists are keyed and shared per namespace and, well, the
user isn't supposed to have large delay between open and reads.
cgroup is planned to be converted to use kernfs and it'd be best if we
can stick to just the seq_file operations - start, next, stop and
show. This can be achieved by loading pidlist on demand from start
and release with time delay from stop, so that consecutive reads don't
end up reloading the pidlist on each iteration. This would remove the
need for hooking into open and release while also avoiding issues with
holding onto pidlist for too long.
The previous patches implemented delayed release and restructured
pidlist handling so that pidlists can be loaded and released from
seq_file start / stop. This patch actually moves pidlist load to
start and release to stop.
This means that pidlist is pinned only between start and stop and may
go away between two consecutive read calls if the two calls are apart
by more than CGROUP_PIDLIST_DESTROY_DELAY. cgroup_pidlist_start()
thus can't re-use the stored cgroup_pid_list_open_file->pidlist
directly. During start, it's only used as a hint indicating whether
this is the first start after open or not and pidlist is always looked
up or created.
pidlist_mutex locking and reference counting are moved out of
pidlist_array_load() so that pidlist_array_load() can perform lookup
and creation atomically. While this enlarges the area covered by
pidlist_mutex, given how the lock is used, it's highly unlikely to be
noticeable.
v2: Refreshed on top of the updated "cgroup: introduce struct
cgroup_pidlist_open_file".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup_pidlist locking is needlessly complicated. It has outer
cgroup->pidlist_mutex to protect the list of pidlists associated with
a cgroup and then each pidlist has rwsem to synchronize updates and
reads. Given that the only read access is from seq_file operations
which are always invoked back-to-back, the rwsem is a giant overkill.
All it does is adding unnecessary complexity.
This patch removes cgroup_pidlist->rwsem and protects all accesses to
pidlists belonging to a cgroup with cgroup->pidlist_mutex.
pidlist->rwsem locking is removed if it's nested inside
cgroup->pidlist_mutex; otherwise, it's replaced with
cgroup->pidlist_mutex locking.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Rename cgroup_pidlist_find() to cgroup_pidlist_find_create() and
separate out finding proper to cgroup_pidlist_find(). Also, move
locking to the caller.
This patch is preparation for pidlist restructure and doesn't
introduce any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
For pidlist files, seq_file->private pointed to the loaded
cgroup_pidlist; however, pidlist loading is planned to be moved to
cgroup_pidlist_start() for kernfs conversion and seq_file->private
needs to carry more information from open to allow that.
This patch introduces struct cgroup_pidlist_open_file which contains
type, cgrp and pidlist and updates pidlist seq_file->private to point
to it using seq_open_private() and seq_release_private(). Note that
this eventually will be replaced by kernfs_open_file.
While this patch makes more information available to seq_file
operations, they don't use it yet and this patch doesn't introduce any
behavior changes except for allocation of the extra private struct.
v2: use __seq_open_private() instead of seq_open_private() for brevity
as suggested by Li.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Currently, pidlists are reference counted from file open and release
methods. This means that holding onto an open file may waste memory
and reads may return data which is very stale. Both aren't critical
because pidlists are keyed and shared per namespace and, well, the
user isn't supposed to have large delay between open and reads.
cgroup is planned to be converted to use kernfs and it'd be best if we
can stick to just the seq_file operations - start, next, stop and
show. This can be achieved by loading pidlist on demand from start
and release with time delay from stop, so that consecutive reads don't
end up reloading the pidlist on each iteration. This would remove the
need for hooking into open and release while also avoiding issues with
holding onto pidlist for too long.
This patch implements delayed release of pidlist. As pidlists could
be lingering on cgroup removal waiting for the timer to expire, cgroup
free path needs to queue the destruction work item immediately and
flush. As those work items are self-destroying, each work item can't
be flushed directly. A new workqueue - cgroup_pidlist_destroy_wq - is
added to serve as flush domain.
Note that this patch just adds delayed release on top of the current
implementation and doesn't change where pidlist is loaded and
released. Following patches will make those changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Now that pidlist files don't use cftype->release(), it doesn't have
any user left. Remove it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Currently, cgroup_pidlist_open() skips seq_open() and pidlist loading
if the file is opened write-only, which is a sensible optimization as
pidlist loading can be costly and there often are occasions where
tasks or cgroup.procs is opened write-only. However, pidlist init and
release are planned to be moved to cgroup_pidlist_start/stop()
respectively which would make this optimization unnecessary.
This patch removes the optimization and always fully initializes
pidlist files regardless of open mode. This will help moving pidlist
handling to start/stop by unifying rw paths and removes the need for
specifying cftype->release() in addition to .release in
cgroup_pidlist_operations as file->f_op is now always overridden. As
pidlist files were the only user of cftype->release(), the next patch
will remove the method.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Pull to receive e605b36575 ("cgroup: fix cgroup_subsys_state leak
for seq_files") as for-3.14 is scheduled to have a lot of changes
which depend on it.
Signed-off-by: Tejun Heo <tj@kernel.org>
If a cgroup file implements either read_map() or read_seq_string(),
such file is served using seq_file by overriding file->f_op to
cgroup_seqfile_operations, which also overrides the release method to
single_release() from cgroup_file_release().
Because cgroup_file_open() didn't use to acquire any resources, this
used to be fine, but since f7d58818ba ("cgroup: pin
cgroup_subsys_state when opening a cgroupfs file"), cgroup_file_open()
pins the css (cgroup_subsys_state) which is put by
cgroup_file_release(). The patch forgot to update the release path
for seq_files and each open/release cycle leaks a css reference.
Fix it by updating cgroup_file_release() to also handle seq_files and
using it for seq_file release path too.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v3.12
Merge v3.12 based patch series to move cgroup_event implementation to
memcg into for-3.14. The following two commits cause a conflict in
kernel/cgroup.c
2ff2a7d03b ("cgroup: kill css_id")
79bd9814e5 ("cgroup, memcg: move cgroup_event implementation to memcg")
Each patch removes a struct definition from kernel/cgroup.c. As the
two are adjacent, they cause a context conflict. Easily resolved by
removing both structs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Now that cgroup_event is made memcg specific, the temporarily exported
functions are no longer necessary. Unexport cgroup_css() and remove
__file_cft() which doesn't have any user left.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
cgroup_event is being moved from cgroup core to memcg and the
implementation is already moved by the previous patch. This patch
moves the data fields and callbacks.
* cgroup->event_list[_lock] are moved to mem_cgroup.
* cftype->[un]register_event() are moved to cgroup_event. This makes
it impossible for individual cftype definitions to specify their
event callbacks. This is worked around by simply hard-coding
filename to event callback mapping in cgroup_write_event_control().
This is awkward and inflexible, which is actually desirable given
that we don't want to grow more usages of this feature.
* eventfd_ctx declaration is removed from cgroup.h, which makes
vmpressure.h miss eventfd_ctx declaration. Include eventfd.h from
vmpressure.h.
v2: Use file name from dentry instead of cftype. This will allow
removing all cftype handling in the function.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
cgroup_event is way over-designed and tries to build a generic
flexible event mechanism into cgroup - fully customizable event
specification for each user of the interface. This is utterly
unnecessary and overboard especially in the light of the planned
unified hierarchy as there's gonna be single agent. Simply generating
events at fixed points, or if that's too restrictive, configureable
cadence or single set of configureable points should be enough.
Thankfully, memcg is the only user and gets to keep it. Replacing it
with something simpler on sane_behavior is strongly recommended.
This patch moves cgroup_event and "cgroup.event_control"
implementation to mm/memcontrol.c. Clearing of events on cgroup
destruction is moved from cgroup_destroy_locked() to
mem_cgroup_css_offline(), which shouldn't make any noticeable
difference.
cgroup_css() and __file_cft() are exported to enable the move;
however, this will soon be reverted once the event code is updated to
be memcg specific.
Note that "cgroup.event_control" will now exist only on the hierarchy
with memcg attached to it. While this change is visible to userland,
it is unlikely to be noticeable as the file has never been meaningful
outside memcg.
Aside from the above change, this is pure code relocation.
v2: Per Li Zefan's comments, init/Kconfig updated accordingly and
poll.h inclusion moved from cgroup.c to memcontrol.c.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Since be44562613 ("cgroup: remove synchronize_rcu() from
cgroup_diput()"), cgroup destruction path makes use of workqueue. css
freeing is performed from a work item from that point on and a later
commit, ea15f8ccdb ("cgroup: split cgroup destruction into two
steps"), moves css offlining to workqueue too.
As cgroup destruction isn't depended upon for memory reclaim, the
destruction work items were put on the system_wq; unfortunately, some
controller may block in the destruction path for considerable duration
while holding cgroup_mutex. As large part of destruction path is
synchronized through cgroup_mutex, when combined with high rate of
cgroup removals, this has potential to fill up system_wq's max_active
of 256.
Also, it turns out that memcg's css destruction path ends up queueing
and waiting for work items on system_wq through work_on_cpu(). If
such operation happens while system_wq is fully occupied by cgroup
destruction work items, work_on_cpu() can't make forward progress
because system_wq is full and other destruction work items on
system_wq can't make forward progress because the work item waiting
for work_on_cpu() is holding cgroup_mutex, leading to deadlock.
This can be fixed by queueing destruction work items on a separate
workqueue. This patch creates a dedicated workqueue -
cgroup_destroy_wq - for this purpose. As these work items shouldn't
have inter-dependencies and mostly serialized by cgroup_mutex anyway,
giving high concurrency level doesn't buy anything and the workqueue's
@max_active is set to 1 so that destruction work items are executed
one by one on each CPU.
Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
cgroup_destroy_wq can't be allocated from cgroup_init(). Do it from a
separate core_initcall(). In the future, we probably want to reorder
so that workqueue init happens before cgroup_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Shawn Bohrer <shawn.bohrer@gmail.com>
Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
Cc: stable@vger.kernel.org # v3.9+
Rename simple_delete_dentry() to always_delete_dentry() and export it.
Export simple_dentry_operations, while we are at it, and get rid of
their duplicates
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull cgroup changes from Tejun Heo:
"Not too much activity this time around. css_id is finally killed and
a minor update to device_cgroup"
* 'for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
device_cgroup: remove can_attach
cgroup: kill css_id
memcg: stop using css id
memcg: fail to create cgroup if the cgroup id is too big
memcg: convert to use cgroup id
memcg: convert to use cgroup_is_descendant()
Pull cgroup fixes from Tejun Heo:
"Two late fixes for cgroup.
One fixes descendant walk introduced during this rc1 cycle. The other
fixes a post 3.9 bug during task attach which can lead to hang. Both
fixes are critical and the fixes are relatively straight-forward"
* 'for-3.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fix to break the while loop in cgroup_attach_task() correctly
cgroup: fix cgroup post-order descendant walk of empty subtree
Both Anjana and Eunki reported a stall in the while_each_thread loop
in cgroup_attach_task().
It's because, when we attach a single thread to a cgroup, if the cgroup
is exiting or is already in that cgroup, we won't break the loop.
If the task is already in the cgroup, the bug can lead to another thread
being attached to the cgroup unexpectedly:
# echo 5207 > tasks
# cat tasks
5207
# echo 5207 > tasks
# cat tasks
5207
5215
What's worse, if the task to be attached isn't the leader of the thread
group, we might never exit the loop, hence cpu stall. Thanks for Oleg's
analysis.
This bug was introduced by commit 081aa458c3
("cgroup: consolidate cgroup_attach_task() and cgroup_attach_proc()")
[ lizf: - fixed the first continue, pointed out by Oleg,
- rewrote changelog. ]
Cc: <stable@vger.kernel.org> # 3.9+
Reported-by: Eunki Kim <eunki_kim@samsung.com>
Reported-by: Anjana V Kumar <anjanavk12@gmail.com>
Signed-off-by: Anjana V Kumar <anjanavk12@gmail.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The only user of css_id was memcg, and it has been convered to use
cgroup->id, so kill css_id.
Signed-off-by: Li Zefan <lizefan@huwei.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
bd8815a6d8 ("cgroup: make css_for_each_descendant() and friends
include the origin css in the iteration") updated cgroup descendant
iterators to include the origin css; unfortuantely, it forgot to drop
special case handling in css_next_descendant_post() for empty subtree
leading to failure to visit the origin css without any child.
Fix it by dropping the special case handling and always returning the
leftmost descendant on the first iteration.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
kernel/cgroup.c is the only place in the tree that relies on eventfd.h
pulling file.h; move that include there. Switch from eventfd_fget()/fput()
to fdget()/fdput(), while we are at it - eventfd_ctx_fileget() will fail
on non-eventfd descriptors just fine, no need to do that check twice...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull cgroup updates from Tejun Heo:
"A lot of activities on the cgroup front. Most changes aren't visible
to userland at all at this point and are laying foundation for the
planned unified hierarchy.
- The biggest change is decoupling the lifetime management of css
(cgroup_subsys_state) from that of cgroup's. Because controllers
(cpu, memory, block and so on) will need to be dynamically enabled
and disabled, css which is the association point between a cgroup
and a controller may come and go dynamically across the lifetime of
a cgroup. Till now, css's were created when the associated cgroup
was created and stayed till the cgroup got destroyed.
Assumptions around this tight coupling permeated through cgroup
core and controllers. These assumptions are gradually removed,
which consists bulk of patches, and css destruction path is
completely decoupled from cgroup destruction path. Note that
decoupling of creation path is relatively easy on top of these
changes and the patchset is pending for the next window.
- cgroup has its own event mechanism cgroup.event_control, which is
only used by memcg. It is overly complex trying to achieve high
flexibility whose benefits seem dubious at best. Going forward,
new events will simply generate file modified event and the
existing mechanism is being made specific to memcg. This pull
request contains prepatory patches for such change.
- Various fixes and cleanups"
Fixed up conflict in kernel/cgroup.c as per Tejun.
* 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits)
cgroup: fix cgroup_css() invocation in css_from_id()
cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp()
cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup
cgroup: implement CFTYPE_NO_PREFIX
cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys
cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax
cgroup: fix cgroup_write_event_control()
cgroup: fix subsystem file accesses on the root cgroup
cgroup: change cgroup_from_id() to css_from_id()
cgroup: use css_get() in cgroup_create() to check CSS_ROOT
cpuset: remove an unncessary forward declaration
cgroup: RCU protect each cgroup_subsys_state release
cgroup: move subsys file removal to kill_css()
cgroup: factor out kill_css()
cgroup: decouple cgroup_subsys_state destruction from cgroup destruction
cgroup: replace cgroup->css_kill_cnt with ->nr_css
cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item
cgroup: move cgroup->subsys[] assignment to online_css()
cgroup: reorganize css init / exit paths
cgroup: add __rcu modifier to cgroup->subsys[]
...
On 3.11-rc we are seeing cgroup directories left behind when they should
have been removed. Here's a trivial reproducer:
cd /sys/fs/cgroup/memory
mkdir parent parent/child; rmdir parent/child parent
rmdir: failed to remove `parent': Device or resource busy
It's because cgroup_destroy_locked() (step 1 of destruction) leaves
cgroup on parent's children list, letting cgroup_offline_fn() (step 2 of
destruction) remove it; but step 2 is run by work queue, which may not
yet have removed the children when parent destruction checks the list.
Fix that by checking through a non-empty list of children: if every one
of them has already been marked CGRP_DEAD, then it's safe to proceed:
those children are invisible to userspace, and should not obstruct rmdir.
(I didn't see any reason to keep the cgrp->children checks under the
unrelated css_set_lock, so moved them out.)
tj: Flattened nested ifs a bit and updated comment so that it's
correct on both for-3.11-fixes and for-3.12.
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
ca8bdcaff0 ("cgroup: make cgroup_css() take cgroup_subsys * instead
and allow NULL subsys") missed one conversion in css_from_id(), which
was newly added. As css_from_id() doesn't have any user yet, this
doesn't break anything other than generating a build warning.
Convert it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
cgroup_event will be moved to its only user - memcg. Replace
__d_cgrp() usage with css_from_dir(), which is already exported. This
also simplifies the code a bit.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Currently, each registered cgroup_event holds an extra reference to
the cgroup. This is a bit weird as events are subsystem specific and
will also be incorrect in the planned unified hierarchy as css
(cgroup_subsys_state) may come and go dynamically across the lifetime
of a cgroup. Holding onto cgroup won't prevent the target css from
going away.
Update cgroup_event to hold onto the css the traget file belongs to
instead of cgroup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
When cgroup files are created, cgroup core automatically prepends the
name of the subsystem as prefix. This patch adds CFTYPE_NO_ which
disables the automatic prefix. This is to work around historical
baggages and shouldn't be used for new files.
This will be used to move "cgroup.event_control" from cgroup core to
memcg.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Glauber Costa <glommer@gmail.com>
cgroup_css() is no longer used in hot paths. Make it take struct
cgroup_subsys * and allow the users to specify NULL subsys to obtain
the dummy_css. This removes open-coded NULL subsystem testing in a
couple users and generally simplifies the code.
After this patch, css_from_dir() also allows NULL @ss and returns the
matching dummy_css. This behavior change doesn't affect its only user
- perf.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
cgroup_css_from_dir() will grow another user. In preparation, make
the following changes.
* All css functions are prefixed with just "css_", rename it to
css_from_dir().
* Take dentry * instead of file * as dentry is what ultimately
identifies a cgroup and file may not always be available. Note that
the function now checkes whether @dentry->d_inode is NULL as the
caller now may specify a negative dentry.
* Make it take cgroup_subsys * instead of integer subsys_id. This
simplifies the function and allows specifying no subsystem for
cgroup->dummy_css.
* Make return section a bit less verbose.
This patch doesn't introduce any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
81eeaf0411 ("cgroup: make cftype->[un]register_event() deal with
cgroup_subsys_state inst ead of cgroup") updated the cftype event
methods to take @css (cgroup_subsys_state) instead of @cgroup;
however, it incorrectly used @css passed to
cgroup_write_event_control(), which the dummy_css for the cgroup as
the file is a cgroup core file. This leads to oops on event
registration.
Fix it by using the css matching the event target file. Note that
cgroup_write_event_control() now disallows cgroup core files from
being event sources. This is for simplicity and doesn't matter as
cgroup_event will be moved and made specific to memcg.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
105347ba5 ("cgroup: make cgroup_file_open() rcu_read_lock() around
cgroup_css() and add cfent->css") added cfent->css to cache the
associted cgroup_subsys_state across file operations.
A cfent is associated with single css throughout its lifetime and the
origimal commit initialized the cache pointer during cgroup_add_file()
and verified that it matches the actual one in cgroup_file_open().
While this works fine for !root cgroups, it's broken for root cgroups
as files in a root cgroup are created before the css's are associated
with the cgroup and thus cgroup_css() call in cgroup_add_file()
returns NULL associating all cfents in the root cgroup with NULL css.
This makes cgroup_file_open() trigger WARN and fail with -ENODEV for
all !core subsystem files in the root cgroups.
There's no reason to initialize cfent->css separately from
cgroup_add_file(). As the association never changes,
cgroup_file_open() can set it unconditionally every time and
containing the logic in cgroup_file_open() makes more sense anyway as
the only reason it's necessary is file->private_data being already
occupied.
Fix it by setting cfent->css unconditionally from cgroup_file_open().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Now we want cgroup core to always provide the css to use to the
subsystems, so change this API to css_from_id().
Uninline css_from_id(), because it's getting bigger and cgroup_css()
has been unexported.
While at it, remove the #ifdef, and shuffle the order of the args.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
With the planned unified hierarchy, individual css's will be created
and destroyed dynamically across the lifetime of a cgroup. To enable
such usages, css destruction is being decoupled from cgroup
destruction. Most of the destruction path has been decoupled but the
actual free of css still depends on cgroup free path.
When all css refs are drained, css_release() kicks off
css_free_work_fn() which puts the cgroup. When the cgroup refcnt
reaches zero, cgroup_diput() is invoked which in turn schedules RCU
free of the cgroup. After a grace period, all css's are freed along
with the cgroup itself.
This patch moves the RCU grace period and css freeing from cgroup
release path to css release path. css_release(), instead of kicking
off css_free_work_fn() directly, schedules RCU callback
css_free_rcu_fn() which in turn kicks off css_free_work_fn() after a
RCU grace period. css_free_work_fn() is updated to free the css
directly.
The five-way punting - percpu ref kill confirmation, a work item,
percpu ref release, RCU grace period, and again a work item - is quite
hairy but the work items are there only to provide process context and
the actual sequence is kill confirm -> release -> RCU free, which
isn't simple but not too crazy.
This removes cgroup_css() usage after offline_css() allowing clearing
cgroup->subsys[] from offline_css(), which makes it consistent with
online_css() and brings it closer to proper lifetime management for
individual css's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
With the planned unified hierarchy, individual css's will be created
and destroyed dynamically across the lifetime of a cgroup. To enable
such usages, css destruction is being decoupled from cgroup
destruction. This patch moves subsys file removal from
cgroup_destroy_locked() to kill_css().
While this changes the order of destruction operations, the changes
shouldn't be noticeable to cgroup subsystems or userland.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Factor out css ref killing from cgroup_destroy_locked() into
kill_css(). We're gonna add more to the path and the factored out
function will eventually be called from other places too.
While at it, replace open coded percpu_ref_get() with css_get() for
consistency. This shouldn't cause any functional difference as the
function is not used for root cgroups.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Currently, css (cgroup_subsys_state) lifetime is tied to that of the
associated cgroup. css's are created when the associated cgroup is
created and destroyed when it gets destroyed. Also, individual css's
aren't RCU protected but the whole cgroup is. With the planned
unified hierarchy, css's will need to be dynamically created and
destroyed within the lifetime of a cgroup.
To enable such usages, this patch decouples css destruction from
cgroup destruction - offline_css() invocation and the final css_put()
are moved from cgroup_destroy_css_killed() to css_killed_work_fn().
Now each css is individually offlined and put as its reference count
is killed instead of waiting for all css's attached to the cgroup to
finish refcnt killing and then proceeding to offlining and putting
them together.
While this changes the order of destruction operations, the changes
shouldn't be noticeable to cgroup subsystems or userland.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Currently, css (cgroup_subsys_state) lifetime is tied to that of the
associated cgroup. With the planned unified hierarchy, css's will be
dynamically created and destroyed within the lifetime of a cgroup. To
enable such usages, css's will be individually RCU protected instead
of being tied to the cgroup.
cgroup->css_kill_cnt is used during cgroup destruction to wait for css
reference count disable; however, this model doesn't work once css's
lifetimes are managed separately from cgroup's. This patch replaces
it with cgroup->nr_css which is an cgroup_mutex protected integer
counting the number of attached css's. The count is incremented from
online_css() and decremented after refcnt kill is confirmed. If the
count reaches zero and the cgroup is marked dead, the second stage of
cgroup destruction is kicked off. If a cgroup doesn't have any css
attached at the time of rmdir, cgroup_destroy_locked() now invokes the
second stage directly as no css kill confirmation would happen.
cgroup_offline_fn() - the second step of cgroup destruction - is
renamed to cgroup_destroy_css_killed() and now expects to be called
with cgroup_mutex held.
While this patch changes how css destruction is punted to work items,
it shouldn't change any visible behavior.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
css (cgroup_subsys_state) offlining, which requires process context,
will be moved to ref kill confirmation. In preparation, bounce
css_killed handling through css->destroy_work.
css_ref_killed_fn() is renamed to css_killed_ref_fn() so that it's
consistent with the new css_killed_work_fn().
This patch adds an additional work item bouncing but doesn't change
the actual logic.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Currently, css (cgroup_subsys_state) lifetime is tied to that of the
associated cgroup. With the planned unified hierarchy, css's will be
dynamically created and destroyed within the lifetime of a cgroup. To
enable such usages, css's will be individually RCU protected instead
of being tied to the cgroup.
In preparation, this patch moves cgroup->subsys[] assignment from
init_css() to online_css(). As this means that a newly initialized
css should be remembered separately and that cgroup_css() returns NULL
between init and online, cgroup_create() is updated so that it stores
newly created css's in a local array css_ar[] and
cgroup_init/load_subsys() are updated to use local variable @css
instead of using cgroup_css(). This change also slightly simplifies
error path of cgroup_create().
While this patch changes when cgroup->subsys[] is initialized, this
change isn't visible to subsystems or userland.
v2: This patch wasn't updated accordingly after the previous "cgroup:
reorganize css init / exit paths" was updated leading to missing a
css_ar[] conversion in cgroup_create() and thus boot failure. Fix
it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
css (cgroup_subsys_state) lifetime management is about to be
restructured. In prepartion, make the following mostly trivial
changes.
* init_cgroup_css() is renamed to init_css() so that it's consistent
with other css handling functions.
* alloc_css_id(), online_css() and offline_css() updated to take @css
instead of cgroups and subsys IDs.
This patch doesn't make any functional changes.
v2: v1 merged two for_each_root_subsys() loops in cgroup_create() but
Li Zefan pointed out that it breaks error path. Dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
For the planned unified hierarchy, each css (cgroup_subsys_state) will
be RCU protected so that it can be created and destroyed individually
while allowing RCU accesses. Previous changes ensured that all
cgroup->subsys[] accesses use the cgroup_css() accessor. This patch
adds __rcu modifier to cgroup->subsys[], add matching RCU dereference
in cgroup_css() and convert all assignments to either
rcu_assign_pointer() or RCU_INIT_POINTER().
This change prepares for the actual RCUfication of css's and doesn't
introduce any visible behavior change. The conversion is verified
with sparse and all accesses are properly RCU annotated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
For the planned unified hierarchy, each css (cgroup_subsys_state) will
be RCU protected so that it can be created and destroyed individually
while allowing RCU accesses, and cgroup_css() will soon require either
holding cgroup_mutex or RCU read lock.
This patch updates cgroup_file_open() such that it acquires the
associated css under rcu_read_lock(). While cgroup_file_css() usages
in other file operations are safe due to the reference from open,
cgroup_css() wouldn't know that and will still trigger warnings. It'd
be cleanest to store the acquired css in file->prvidate_data for
further file operations but that's already used by seqfile. This
patch instead adds cfent->css to cache the associated css. Note that
while this field is initialized during cfe init, it should only be
considered valid while the file is open.
This patch doesn't change visible behavior.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup->subsys[] will become RCU protected and thus all cgroup_css()
usages should either be under RCU read lock or cgroup_mutex. This
patch updates cgroup_css_from_dir() which returns the matching
cgroup_subsys_state given a directory file and subsys_id so that it
requires RCU read lock and updates its sole user
perf_cgroup_connect().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>