Commit Graph

887 Commits

Author SHA1 Message Date
Linus Torvalds 8ab293e3a1 Merge branch 'for-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "Three late fixes for cgroup: Two cpuset ones, one trivial and the
  other pretty obscure, and a cgroup core fix for a bug which impacts
  cgroup v2 namespace users"

* 'for-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: fix invalid controller enable rejections with cgroup namespace
  cpuset: fix non static symbol warning
  cpuset: handle race between CPU hotplug and cpuset_hotplug_work
2016-09-27 16:43:11 -07:00
Tejun Heo 9157056da8 cgroup: fix invalid controller enable rejections with cgroup namespace
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Evgeny Vereshchagin <evvers@ya.ru>
Cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Cc: Aditya Kali <adityakali@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: stable@vger.kernel.org # v4.6+
Fixes: a79a908fd2 ("cgroup: introduce cgroup namespaces")
Link: https://github.com/systemd/systemd/pull/3589#issuecomment-249089541
2016-09-23 16:55:49 -04:00
Johannes Weiner d979a39d72 cgroup: duplicate cgroup reference when cloning sockets
When a socket is cloned, the associated sock_cgroup_data is duplicated
but not its reference on the cgroup.  As a result, the cgroup reference
count will underflow when both sockets are destroyed later on.

Fixes: bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup")
Link: http://lkml.kernel.org/r/20160914194846.11153-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: <stable@vger.kernel.org>	[4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:17 -07:00
Linus Torvalds 574c7e2333 Merge branch 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull more cgroup updates from Tejun Heo:
 "I forgot to include the patches which got applied to for-4.7-fixes
  late during last cycle.

  Eric's three patches fix bugs introduced with the namespace support"

* 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroupns: Only allow creation of hierarchies in the initial cgroup namespace
  cgroupns: Close race between cgroup_post_fork and copy_cgroup_ns
  cgroupns: Fix the locking in copy_cgroup_ns
2016-07-29 14:29:04 -07:00
Linus Torvalds 468fc7ed55 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Unified UDP encapsulation offload methods for drivers, from
    Alexander Duyck.

 2) Make DSA binding more sane, from Andrew Lunn.

 3) Support QCA9888 chips in ath10k, from Anilkumar Kolli.

 4) Several workqueue usage cleanups, from Bhaktipriya Shridhar.

 5) Add XDP (eXpress Data Path), essentially running BPF programs on RX
    packets as soon as the device sees them, with the option to mirror
    the packet on TX via the same interface.  From Brenden Blanco and
    others.

 6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet.

 7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli.

 8) Simplify netlink conntrack entry layout, from Florian Westphal.

 9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido
    Schimmel, Yotam Gigi, and Jiri Pirko.

10) Add SKB array infrastructure and convert tun and macvtap over to it.
    From Michael S Tsirkin and Jason Wang.

11) Support qdisc packet injection in pktgen, from John Fastabend.

12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy.

13) Add NV congestion control support to TCP, from Lawrence Brakmo.

14) Add GSO support to SCTP, from Marcelo Ricardo Leitner.

15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni.

16) Support MPLS over IPV4, from Simon Horman.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
  xgene: Fix build warning with ACPI disabled.
  be2net: perform temperature query in adapter regardless of its interface state
  l2tp: Correctly return -EBADF from pppol2tp_getname.
  net/mlx5_core/health: Remove deprecated create_singlethread_workqueue
  net: ipmr/ip6mr: update lastuse on entry change
  macsec: ensure rx_sa is set when validation is disabled
  tipc: dump monitor attributes
  tipc: add a function to get the bearer name
  tipc: get monitor threshold for the cluster
  tipc: make cluster size threshold for monitoring configurable
  tipc: introduce constants for tipc address validation
  net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()
  MAINTAINERS: xgene: Add driver and documentation path
  Documentation: dtb: xgene: Add MDIO node
  dtb: xgene: Add MDIO node
  drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset
  drivers: net: xgene: Use exported functions
  drivers: net: xgene: Enable MDIO driver
  drivers: net: xgene: Add backward compatibility
  drivers: net: phy: xgene: Add MDIO driver
  ...
2016-07-27 12:03:20 -07:00
Linus Torvalds b55b048718 Merge branch 'for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Nothing too exciting.

   - updates to the pids controller so that pid limit breaches can be
     noticed and monitored from userland.

   - cleanups and non-critical bug fixes"

* 'for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: remove duplicated include from cgroup.c
  cgroup: Use lld instead of ld when printing pids controller events_limit
  cgroup: Add pids controller event when fork fails because of pid limit
  cgroup: allow NULL return from ss->css_alloc()
  cgroup: remove unnecessary 0 check from css_from_id()
  cgroup: fix idr leak for the first cgroup root
2016-07-26 14:34:17 -07:00
Wei Yongjun 55094f5753 cgroup: remove duplicated include from cgroup.c
Remove duplicated include.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-07-19 14:28:04 -04:00
Eric W. Biederman 726a4994b0 cgroupns: Only allow creation of hierarchies in the initial cgroup namespace
Unprivileged users can't use hierarchies if they create them as they do not
have privilieges to the root directory.

Which means the only thing a hiearchy created by an unprivileged user
is good for is expanding the number of cgroup links in every css_set,
which is a DOS attack.

We could allow hierarchies to be created in namespaces in the initial
user namespace.  Unfortunately there is only a single namespace for
the names of heirarchies, so that is likely to create more confusion
than not.

So do the simple thing and restrict hiearchy creation to the initial
cgroup namespace.

Cc: stable@vger.kernel.org
Fixes: a79a908fd2 ("cgroup: introduce cgroup namespaces")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-07-15 08:04:27 -04:00
Eric W. Biederman eedd0f4cbf cgroupns: Close race between cgroup_post_fork and copy_cgroup_ns
In most code paths involving cgroup migration cgroup_threadgroup_rwsem
is taken.  There are two exceptions:

- remove_tasks_in_empty_cpuset calls cgroup_transfer_tasks
- vhost_attach_cgroups_work calls cgroup_attach_task_all

With cgroup_threadgroup_rwsem held it is guaranteed that cgroup_post_fork
and copy_cgroup_ns will reference the same css_set from the process calling
fork.

Without such an interlock there process after fork could reference one
css_set from it's new cgroup namespace and another css_set from
task->cgroups, which semantically is nonsensical.

Cc: stable@vger.kernel.org
Fixes: a79a908fd2 ("cgroup: introduce cgroup namespaces")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-07-15 07:56:38 -04:00
Eric W. Biederman 7bd8830875 cgroupns: Fix the locking in copy_cgroup_ns
If "clone(CLONE_NEWCGROUP...)" is called it results in a nice lockdep
valid splat.

In __cgroup_proc_write the lock ordering is:
     cgroup_mutex -- through cgroup_kn_lock_live
     cgroup_threadgroup_rwsem

In copy_process the guts of clone the lock ordering is:
     cgroup_threadgroup_rwsem -- through threadgroup_change_begin
     cgroup_mutex -- through copy_namespaces -- copy_cgroup_ns

lockdep reports some a different call chains for the first ordering of
cgroup_mutex and cgroup_threadgroup_rwsem but it is harder to trace.
This is most definitely deadlock potential under the right
circumstances.

Fix this by by skipping the cgroup_mutex and making the locking in
copy_cgroup_ns mirror the locking in cgroup_post_fork which also runs
during fork under the cgroup_threadgroup_rwsem.

Cc: stable@vger.kernel.org
Fixes: a79a908fd2 ("cgroup: introduce cgroup namespaces")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-07-15 07:56:32 -04:00
Martin KaFai Lau 1f3fe7ebf6 cgroup: Add cgroup_get_from_fd
Add a helper function to get a cgroup2 from a fd.  It will be
stored in a bpf array (BPF_MAP_TYPE_CGROUP_ARRAY) which will
be introduced in the later patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:30:38 -04:00
Daniel Bristot de Oliveira 82d6489d0f cgroup: Disable IRQs while holding css_set_lock
While testing the deadline scheduler + cgroup setup I hit this
warning.

[  132.612935] ------------[ cut here ]------------
[  132.612951] WARNING: CPU: 5 PID: 0 at kernel/softirq.c:150 __local_bh_enable_ip+0x6b/0x80
[  132.612952] Modules linked in: (a ton of modules...)
[  132.612981] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc2 #2
[  132.612981] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
[  132.612982]  0000000000000086 45c8bb5effdd088b ffff88013fd43da0 ffffffff813d229e
[  132.612984]  0000000000000000 0000000000000000 ffff88013fd43de0 ffffffff810a652b
[  132.612985]  00000096811387b5 0000000000000200 ffff8800bab29d80 ffff880034c54c00
[  132.612986] Call Trace:
[  132.612987]  <IRQ>  [<ffffffff813d229e>] dump_stack+0x63/0x85
[  132.612994]  [<ffffffff810a652b>] __warn+0xcb/0xf0
[  132.612997]  [<ffffffff810e76a0>] ? push_dl_task.part.32+0x170/0x170
[  132.612999]  [<ffffffff810a665d>] warn_slowpath_null+0x1d/0x20
[  132.613000]  [<ffffffff810aba5b>] __local_bh_enable_ip+0x6b/0x80
[  132.613008]  [<ffffffff817d6c8a>] _raw_write_unlock_bh+0x1a/0x20
[  132.613010]  [<ffffffff817d6c9e>] _raw_spin_unlock_bh+0xe/0x10
[  132.613015]  [<ffffffff811388ac>] put_css_set+0x5c/0x60
[  132.613016]  [<ffffffff8113dc7f>] cgroup_free+0x7f/0xa0
[  132.613017]  [<ffffffff810a3912>] __put_task_struct+0x42/0x140
[  132.613018]  [<ffffffff810e776a>] dl_task_timer+0xca/0x250
[  132.613027]  [<ffffffff810e76a0>] ? push_dl_task.part.32+0x170/0x170
[  132.613030]  [<ffffffff8111371e>] __hrtimer_run_queues+0xee/0x270
[  132.613031]  [<ffffffff81113ec8>] hrtimer_interrupt+0xa8/0x190
[  132.613034]  [<ffffffff81051a58>] local_apic_timer_interrupt+0x38/0x60
[  132.613035]  [<ffffffff817d9b0d>] smp_apic_timer_interrupt+0x3d/0x50
[  132.613037]  [<ffffffff817d7c5c>] apic_timer_interrupt+0x8c/0xa0
[  132.613038]  <EOI>  [<ffffffff81063466>] ? native_safe_halt+0x6/0x10
[  132.613043]  [<ffffffff81037a4e>] default_idle+0x1e/0xd0
[  132.613044]  [<ffffffff810381cf>] arch_cpu_idle+0xf/0x20
[  132.613046]  [<ffffffff810e8fda>] default_idle_call+0x2a/0x40
[  132.613047]  [<ffffffff810e92d7>] cpu_startup_entry+0x2e7/0x340
[  132.613048]  [<ffffffff81050235>] start_secondary+0x155/0x190
[  132.613049] ---[ end trace f91934d162ce9977 ]---

The warn is the spin_(lock|unlock)_bh(&css_set_lock) in the interrupt
context. Converting the spin_lock_bh to spin_lock_irq(save) to avoid
this problem - and other problems of sharing a spinlock with an
interrupt.

Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: cgroups@vger.kernel.org
Cc: stable@vger.kernel.org # 4.5+
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-06-23 17:23:12 -04:00
Tejun Heo e7e15b87f8 cgroup: allow NULL return from ss->css_alloc()
cgroup core expected css_alloc to return an ERR_PTR value on failure
and caused NULL deref if it returned NULL.  It's an easy mistake to
make from an alloc function and there's no ambiguity in what's being
indicated.  Update css_create() so that it interprets NULL return from
css_alloc as -ENOMEM.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-06-21 13:07:09 -04:00
Johannes Weiner d6ccc55e66 cgroup: remove unnecessary 0 check from css_from_id()
css_idr allocation starts at 1, so index 0 will never point to an
item. css_from_id() currently filters that before asking idr_find(),
but idr_find() would also just return NULL, so this is not needed.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-06-17 14:16:32 -04:00
Johannes Weiner 8c8a550218 cgroup: fix idr leak for the first cgroup root
The valid cgroup hierarchy ID range includes 0, so we can't filter for
positive numbers when freeing it, or it'll leak the first ID. No big
deal, just disruptive when reading the code.

The ID is freed during error handling and when the reference count
hits zero, so the double-free test is not necessary; remove it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-06-17 14:16:28 -04:00
Tejun Heo 8fa3b8d689 cgroup: set css->id to -1 during init
If percpu_ref initialization fails during css_create(), the free path
can end up trying to free css->id of zero.  As ID 0 is unused, it
doesn't cause a critical breakage but it does trigger a warning
message.  Fix it by setting css->id to -1 from init_and_link_css().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Wenwei Tao <ww.tao0320@gmail.com>
Fixes: 01e586598b ("cgroup: release css->id after css_free")
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-06-16 17:59:35 -04:00
Wenwei Tao b00c52dae6 cgroup: remove redundant cleanup in css_create
When create css failed, before call css_free_rcu_fn, we remove the css
id and exit the percpu_ref, but we will do these again in
css_free_work_fn, so they are redundant.  Especially the css id, that
would cause problem if we remove it twice, since it may be assigned to
another css after the first remove.

tj: This was broken by two commits updating the free path without
    synchronizing the creation failure path.  This can be easily
    triggered by trying to create more than 64k memory cgroups.

Signed-off-by: Wenwei Tao <ww.tao0320@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Fixes: 9a1049da9b ("percpu-refcount: require percpu_ref to be exited explicitly")
Fixes: 01e586598b ("cgroup: release css->id after css_free")
Cc: stable@vger.kernel.org # v3.17+
2016-05-26 15:09:23 -04:00
Felipe Balbi 09be4c824e cgroup: fix compile warning
commit 4f41fc5962 ("cgroup, kernfs: make mountinfo
 show properly scoped path for cgroup namespaces")
 added the following compile warning:

kernel/cgroup.c: In function ‘cgroup_show_path’:
kernel/cgroup.c:1634:15: warning: unused variable ‘ret’ [-Wunused-variable]
  int len = 0, ret = 0;
               ^
fix it.

Fixes: 4f41fc5962 ("cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces")
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-05-12 11:05:27 -04:00
Serge E. Hallyn 4f41fc5962 cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
Patch summary:

When showing a cgroupfs entry in mountinfo, show the path of the mount
root dentry relative to the reader's cgroup namespace root.

Short explanation (courtesy of mkerrisk):

If we create a new cgroup namespace, then we want both /proc/self/cgroup
and /proc/self/mountinfo to show cgroup paths that are correctly
virtualized with respect to the cgroup mount point.  Previous to this
patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
does not.

Long version:

When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
namespace, and then mounts a new instance of the freezer cgroup, the new
mount will be rooted at /a/b.  The root dentry field of the mountinfo
entry will show '/a/b'.

 cat > /tmp/do1 << EOF
 mount -t cgroup -o freezer freezer /mnt
 grep freezer /proc/self/mountinfo
 EOF

 unshare -Gm  bash /tmp/do1
 > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
 > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer

The task's freezer cgroup entry in /proc/self/cgroup will simply show
'/':

 grep freezer /proc/self/cgroup
 9:freezer:/

If instead the same task simply bind mounts the /a/b cgroup directory,
the resulting mountinfo entry will again show /a/b for the dentry root.
However in this case the task will find its own cgroup at /mnt/a/b,
not at /mnt:

 mount --bind /sys/fs/cgroup/freezer/a/b /mnt
 130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer

In other words, there is no way for the task to know, based on what is
in mountinfo, which cgroup directory is its own.

Example (by mkerrisk):

First, a little script to save some typing and verbiage:

echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
cat /proc/self/mountinfo | grep freezer |
        awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'

Create cgroup, place this shell into the cgroup, and look at the state
of the /proc files:

2653
2653                         # Our shell
14254                        # cat(1)
        /proc/self/cgroup:      10:freezer:/a/b
        mountinfo:              /       /sys/fs/cgroup/freezer

Create a shell in new cgroup and mount namespaces. The act of creating
a new cgroup namespace causes the process's current cgroups directories
to become its cgroup root directories. (Here, I'm using my own version
of the "unshare" utility, which takes the same options as the util-linux
version):

Look at the state of the /proc files:

        /proc/self/cgroup:      10:freezer:/
        mountinfo:              /       /sys/fs/cgroup/freezer

The third entry in /proc/self/cgroup (the pathname of the cgroup inside
the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
is rooted at /a/b in the outer namespace.

However, the info in /proc/self/mountinfo is not for this cgroup
namespace, since we are seeing a duplicate of the mount from the
old mount namespace, and the info there does not correspond to the
new cgroup namespace. However, trying to create a new mount still
doesn't show us the right information in mountinfo:

                                      # propagating to other mountns
        /proc/self/cgroup:      7:freezer:/
        mountinfo:              /a/b    /mnt/freezer

The act of creating a new cgroup namespace caused the process's
current freezer directory, "/a/b", to become its cgroup freezer root
directory. In other words, the pathname directory of the directory
within the newly mounted cgroup filesystem should be "/",
but mountinfo wrongly shows us "/a/b". The consequence of this is
that the process in the cgroup namespace cannot correctly construct
the pathname of its cgroup root directory from the information in
/proc/PID/mountinfo.

With this patch, the dentry root field in mountinfo is shown relative
to the reader's cgroup namespace.  So the same steps as above:

        /proc/self/cgroup:      10:freezer:/a/b
        mountinfo:              /       /sys/fs/cgroup/freezer
        /proc/self/cgroup:      10:freezer:/
        mountinfo:              /../..  /sys/fs/cgroup/freezer
        /proc/self/cgroup:      10:freezer:/
        mountinfo:              /       /mnt/freezer

cgroup.clone_children  freezer.parent_freezing  freezer.state      tasks
cgroup.procs           freezer.self_freezing    notify_on_release
3164
2653                   # First shell that placed in this cgroup
3164                   # Shell started by 'unshare'
14197                  # cat(1)

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-05-09 12:15:03 -04:00
Tejun Heo 5cf1cacb49 cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback
Since e93ad19d05 ("cpuset: make mm migration asynchronous"), cpuset
kicks off asynchronous NUMA node migration if necessary during task
migration and flushes it from cpuset_post_attach_flush() which is
called at the end of __cgroup_procs_write().  This is to avoid
performing migration with cgroup_threadgroup_rwsem write-locked which
can lead to deadlock through dependency on kworker creation.

memcg has a similar issue with charge moving, so let's convert it to
an official callback rather than the current one-off cpuset specific
function.  This patch adds cgroup_subsys->post_attach callback and
makes cpuset register cpuset_post_attach_flush() as its ->post_attach.

The conversion is mostly one-to-one except that the new callback is
called under cgroup_mutex.  This is to guarantee that no other
migration operations are started before ->post_attach callbacks are
finished.  cgroup_mutex is one of the outermost mutex in the system
and has never been and shouldn't be a problem.  We can add specialized
synchronization around __cgroup_procs_write() but I don't think
there's any noticeable benefit.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org> # 4.4+ prerequisite for the next patch
2016-04-25 15:45:14 -04:00
Linus Torvalds 5518f66b5a Merge branch 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup namespace support from Tejun Heo:
 "These are changes to implement namespace support for cgroup which has
  been pending for quite some time now.  It is very straight-forward and
  only affects what part of cgroup hierarchies are visible.

  After unsharing, mounting a cgroup fs will be scoped to the cgroups
  the task belonged to at the time of unsharing and the cgroup paths
  exposed to userland would be adjusted accordingly"

* 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: fix and restructure error handling in copy_cgroup_ns()
  cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
  Add FS_USERNS_FLAG to cgroup fs
  cgroup: Add documentation for cgroup namespaces
  cgroup: mount cgroupns-root when inside non-init cgroupns
  kernfs: define kernfs_node_dentry
  cgroup: cgroup namespace setns support
  cgroup: introduce cgroup namespaces
  sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
  kernfs: Add API to generate relative kernfs path
2016-03-21 10:05:13 -07:00
Arnd Bergmann cfe02a8a97 cgroup: avoid false positive gcc-6 warning
When all subsystems are disabled, gcc notices that cgroup_subsys_enabled_key
is a zero-length array and that any access to it must be out of bounds:

In file included from ../include/linux/cgroup.h:19:0,
                 from ../kernel/cgroup.c:31:
../kernel/cgroup.c: In function 'cgroup_add_cftypes':
../kernel/cgroup.c:261:53: error: array subscript is above array bounds [-Werror=array-bounds]
  return static_key_enabled(cgroup_subsys_enabled_key[ssid]);
                            ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
../include/linux/jump_label.h:271:40: note: in definition of macro 'static_key_enabled'
  static_key_count((struct static_key *)x) > 0;    \
                                        ^

We should never call the function in this particular case, so this is
not a bug. In order to silence the warning, this adds an explicit check
for the CGROUP_SUBSYS_COUNT==0 case.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-16 13:32:23 -07:00
Tejun Heo 2b021cbf3c cgroup: ignore css_sets associated with dead cgroups during migration
Before 2e91fa7f6d ("cgroup: keep zombies associated with their
original cgroups"), all dead tasks were associated with init_css_set.
If a zombie task is requested for migration, while migration prep
operations would still be performed on init_css_set, the actual
migration would ignore zombie tasks.  As init_css_set is always valid,
this worked fine.

However, after 2e91fa7f6d, zombie tasks stay with the css_set it was
associated with at the time of death.  Let's say a task T associated
with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2.  After T
becomes a zombie, it would still remain associated with A and B.  If A
only contains zombie tasks, it can be removed.  On removal, A gets
marked offline but stays pinned until all zombies are drained.  At
this point, if migration is initiated on T to a cgroup C on hierarchy
H-2, migration path would try to prepare T's css_set for migration and
trigger the following.

 WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
 CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
 ...
 Call Trace:
  [<ffffffff8127e63c>] dump_stack+0x4e/0x82
  [<ffffffff810445e8>] warn_slowpath_common+0x78/0xb0
  [<ffffffff810446d5>] warn_slowpath_null+0x15/0x20
  [<ffffffff810c33e1>] cgroup_get+0x121/0x160
  [<ffffffff810c349b>] link_css_set+0x7b/0x90
  [<ffffffff810c4fbc>] find_css_set+0x3bc/0x5e0
  [<ffffffff810c5269>] cgroup_migrate_prepare_dst+0x89/0x1f0
  [<ffffffff810c7547>] cgroup_attach_task+0x157/0x230
  [<ffffffff810c7a17>] __cgroup_procs_write+0x2b7/0x470
  [<ffffffff810c7bdc>] cgroup_tasks_write+0xc/0x10
  [<ffffffff810c4790>] cgroup_file_write+0x30/0x1b0
  [<ffffffff811c68fc>] kernfs_fop_write+0x13c/0x180
  [<ffffffff81151673>] __vfs_write+0x23/0xe0
  [<ffffffff81152494>] vfs_write+0xa4/0x1a0
  [<ffffffff811532d4>] SyS_write+0x44/0xa0
  [<ffffffff814af2d7>] entry_SYSCALL_64_fastpath+0x12/0x6f

It doesn't make sense to prepare migration for css_sets pointing to
dead cgroups as they are guaranteed to contain only zombies which are
ignored later during migration.  This patch makes cgroup destruction
path mark all affected css_sets as dead and updates the migration path
to ignore them during preparation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 2e91fa7f6d ("cgroup: keep zombies associated with their original cgroups")
Cc: stable@vger.kernel.org # v4.4+
2016-03-16 13:31:46 -07:00
Tejun Heo f6d635ad34 cgroup: implement cgroup_subsys->implicit_on_dfl
Some controllers, perf_event for now and possibly freezer in the
future, don't really make sense to control explicitly through
"cgroup.subtree_control".  For example, the primary role of perf_event
is identifying the cgroups of tasks; however, because the controller
also keeps a small amount of state per cgroup, it can't be replaced
with simple cgroup membership tests.

This patch implements cgroup_subsys->implicit_on_dfl flag.  When set,
the controller is implicitly enabled on all cgroups on the v2
hierarchy so that utility type controllers such as perf_event can be
enabled and function transparently.

An implicit controller doesn't show up in "cgroup.controllers" or
"cgroup.subtree_control", is exempt from no internal process rule and
can be stolen from the default hierarchy even if there are non-root
csses.

v2: Reimplemented on top of the recent updates to css handling and
    subsystem rebinding.  Rebinding implicit subsystems is now a
    simple matter of exempting it from the busy subsystem check.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-08 11:51:26 -05:00
Tejun Heo e4857982f4 cgroup: use css_set->mg_dst_cgrp for the migration target cgroup
Migration can be multi-target on the default hierarchy when a
controller is enabled - processes belonging to each child cgroup have
to be moved to the child cgroup itself to refresh css association.

This isn't a problem for cgroup_migrate_add_src() as each source
css_set still maps to single source and target cgroups; however,
cgroup_migrate_prepare_dst() is called once after all source css_sets
are added and thus might not have a single destination cgroup.  This
is currently worked around by specifying NULL for @dst_cgrp and using
the source's default cgroup as destination as the only multi-target
migration in use is self-targetting.  While this works, it's subtle
and clunky.

As all taget cgroups are already specified while preparing the source
css_sets, this clunkiness can easily be removed by recording the
target cgroup in each source css_set.  This patch adds
css_set->mg_dst_cgrp which is recorded on cgroup_migrate_src() and
used by cgroup_migrate_prepare_dst().  This also makes migration code
ready for arbitrary multi-target migration.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-08 11:51:26 -05:00
Tejun Heo 37ff9f8f47 cgroup: make cgroup[_taskset]_migrate() take cgroup_root instead of cgroup
On the default hierarchy, a migration can be multi-source and/or
multi-destination.  cgroup_taskest_migrate() used to incorrectly
assume single destination cgroup but the bug has been fixed by
1f7dd3e5a6 ("cgroup: fix handling of multi-destination migration
from subtree_control enabling").

Since the commit, @dst_cgrp to cgroup[_taskset]_migrate() is only used
to determine which subsystems are affected or which cgroup_root the
migration is taking place in.  As such, @dst_cgrp is misleading.  This
patch replaces @dst_cgrp with @root.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-08 11:51:26 -05:00
Tejun Heo 6c694c8825 cgroup: move migration destination verification out of cgroup_migrate_prepare_dst()
cgroup_migrate_prepare_dst() verifies whether the destination cgroup
is allowable; however, the test doesn't really belong there.  It's too
deep and common in the stack and as a result the test itself is gated
by another test.

Separate the test out into cgroup_may_migrate_to() and update
cgroup_attach_task() and cgroup_transfer_tasks() to perform the test
directly.  This doesn't cause any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-08 11:51:25 -05:00
Tejun Heo 58cdb1ceb1 cgroup: fix incorrect destination cgroup in cgroup_update_dfl_csses()
cgroup_update_dfl_csses() should move each task in the subtree to
self; however, it was incorrectly calling cgroup_migrate_add_src()
with the root of the subtree as @dst_cgrp.  Fortunately,
cgroup_migrate_add_src() currently uses @dst_cgrp only to determine
the hierarchy and the bug doesn't cause any actual breakages.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-08 11:51:25 -05:00
Tejun Heo 549626047d cgroup: update css iteration in cgroup_update_dfl_csses()
The existing sequences of operations ensure that the offlining csses
are drained before cgroup_update_dfl_csses(), so even though
cgroup_update_dfl_csses() uses css_for_each_descendant_pre() to walk
the target cgroups, it doesn't end up operating on dead cgroups.
Also, the function explicitly excludes the subtree root from
operation.

This is fragile and inconsistent with the rest of css update
operations.  This patch updates cgroup_update_dfl_csses() to use
cgroup_for_each_live_descendant_pre() instead and include the subtree
root.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:58:01 -05:00
Tejun Heo 04313591ae cgroup: allocate 2x cgrp_cset_links when setting up a new root
During prep, cgroup_setup_root() allocates cgrp_cset_links matching
the number of existing css_sets to later link the new root.  This is
fine for now as the only operation which can happen inbetween is
rebind_subsystems() and rebinding of empty subsystems doesn't create
new css_sets.

However, while not yet allowed, with the recent reimplementation,
rebind_subsystems() can rebind subsystems with descendant csses and
thus can create new css_sets.  This patch makes cgroup_setup_root()
allocate 2x of the existing css_sets so that later use of live
subsystem rebinding doesn't blow up.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:58:01 -05:00
Tejun Heo 5ced2518bd cgroup: make cgroup_calc_subtree_ss_mask() take @this_ss_mask
cgroup_calc_subtree_ss_mask() currently takes @cgrp and
@subtree_control.  @cgrp is used for two purposes - to decide whether
it's for default hierarchy and the mask of available subsystems.  The
former doesn't matter as the results are the same regardless.  The
latter can be specified directly through a subsystem mask.

This patch makes cgroup_calc_subtree_ss_mask() perform the same
calculations for both default and legacy hierarchies and take
@this_ss_mask for available subsystems.  @cgrp is no longer used and
dropped.  This is to allow using the function in contexts where
available controllers can't be decided from the cgroup.

v2: cgroup_refres_subtree_ss_mask() is removed by a previous patch.
    Updated accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:58:01 -05:00
Tejun Heo 334c3679ec cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends
rebind_subsystem() open codes quite a bit of css and interface file
manipulations.  It tries to be fail-safe but doesn't quite achieve it.
It can be greatly simplified by using the new css management helpers.
This patch reimplements rebind_subsytsems() using
cgroup_apply_control() and friends.

* The half-baked rollback on file creation failure is dropped.  It is
  an extremely cold path, failure isn't critical, and, aside from
  kernel bugs, the only reason it can fail is memory allocation
  failure which pretty much doesn't happen for small allocations.

* As cgroup_apply_control_disable() is now used to clean up root
  cgroup on rebind, make sure that it doesn't end up killing root
  csses.

* All callers of rebind_subsystems() are updated to use
  cgroup_lock_and_drain_offline() as the apply_control functions
  require drained subtree.

* This leaves cgroup_refresh_subtree_ss_mask() without any user.
  Removed.

* css_populate_dir() and css_clear_dir() no longer needs
  @cgrp_override parameter.  Dropped.

* While at it, add WARN_ON() to rebind_subsystem() calls which are
  expected to always succeed just in case.

While the rules visible to userland aren't changed, this
reimplementation not only simplifies rebind_subsystems() but also
allows it to disable and enable csses recursively.  This can be used
to implement more flexible rebinding.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:58:01 -05:00
Tejun Heo 03970d3c11 cgroup: use cgroup_apply_enable_control() in cgroup creation path
cgroup_create() manually updates control masks and creates child csses
which cgroup_mkdir() then manually populates.  Both can be simplified
by using cgroup_apply_enable_control() and friends.  The only catch is
that it calls css_populate_dir() with NULL cgroup->kn during
cgroup_create().  This is worked around by making the function noop on
NULL kn.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:58:00 -05:00
Tejun Heo 945ba19968 cgroup: combine cgroup_mutex locking and offline css draining
cgroup_drain_offline() is used to wait for csses being offlined to
uninstall itself from cgroup->subsys[] array so that new csses can be
installed.  The function's only user, cgroup_subtree_control_write(),
calls it after performing some checks and restarts the whole process
via restart_syscall() if draining has to release cgroup_mutex to wait.

This can be simplified by draining before other synchronized
operations so that there's nothing to restart.  This patch converts
cgroup_drain_offline() to cgroup_lock_and_drain_offline() which
performs both locking and draining and updates cgroup_kn_lock_live()
use it instead of cgroup_mutex() if requested.  This combined locking
and draining operations are easier to use and less error-prone.

While at it, add WARNs in control_apply functions which triggers if
the subtree isn't properly drained.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:58:00 -05:00
Tejun Heo f7b2814bb9 cgroup: factor out cgroup_{apply|finalize}_control() from cgroup_subtree_control_write()
Factor out cgroup_{apply|finalize}_control() so that control mask
update can be done in several simple steps.  This patch doesn't
introduce behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:58:00 -05:00
Tejun Heo 15a27c362d cgroup: introduce cgroup_{save|propagate|restore}_control()
While controllers are being enabled and disabled in
cgroup_subtree_control_write(), the original subsystem masks are
stashed in local variables so that they can be restored if the
operation fails in the middle.

This patch adds dedicated fields to struct cgroup to be used instead
of the local variables and implements functions to stash the current
values, propagate the changes and restore them recursively.  Combined
with the previous changes, this makes subsystem management operations
fully recursive and modularlized.  This will be used to expand cgroup
core functionalities.

While at it, remove now unused @css_enable and @css_disable from
cgroup_subtree_control_write().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:59 -05:00
Tejun Heo ce3f1d9d19 cgroup: make cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() recursive
The three factored out css management operations -
cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() -
only depend on the current state of the target cgroups and idempotent
and thus can be easily made to operate on the subtree instead of the
immediate children.

This patch introduces the iterators which walk live subtree and
converts the three functions to operate on the subtree including self
instead of the children.  While this leads to spurious walking and be
slightly more expensive, it will allow them to be used for wider scope
of operations.

Note that cgroup_drain_offline() now tests for whether a css is dying
before trying to drain it.  This is to avoid trying to drain live
csses as there can be mix of live and dying csses in a subtree unlike
children of the same parent.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:59 -05:00
Tejun Heo bdb53bd797 cgroup: factor out cgroup_apply_control_enable() from cgroup_subtree_control_write()
Factor out css enabling and showing into cgroup_apply_control_enable().

* Nest subsystem walk inside child walk.  The child walk will later be
  converted to subtree walk which is a bit more expensive.

* Instead of operating on the differential masks @css_enable, simply
  enable or show csses according to the current cgroup_control() and
  cgroup_ss_mask().  This leads to the same result and is simpler and
  more robust.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:59 -05:00
Tejun Heo 12b3bb6af8 cgroup: factor out cgroup_apply_control_disable() from cgroup_subtree_control_write()
Factor out css disabling and hiding into cgroup_apply_control_disable().

* Nest subsystem walk inside child walk.  The child walk will later be
  converted to subtree walk which is a bit more expensive.

* Instead of operating on the differential masks @css_enable and
  @css_disable, simply disable or hide csses according to the current
  cgroup_control() and cgroup_ss_mask().  This leads to the same
  result and is simpler and more robust.

* This allows error handling path to share the same code.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:59 -05:00
Tejun Heo 1b9b96a12b cgroup: factor out cgroup_drain_offline() from cgroup_subtree_control_write()
Factor out async css offline draining into cgroup_drain_offline().

* Nest subsystem walk inside child walk.  The child walk will later be
  converted to subtree walk which is a bit more expensive.

* Relocate the draining above subsystem mask preparation, which
  doesn't create any behavior differences but helps further
  refactoring.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:59 -05:00
Tejun Heo 5531dc915b cgroup: introduce cgroup_control() and cgroup_ss_mask()
When a controller is enabled and visible on a non-root cgroup is
determined by subtree_control and subtree_ss_mask of the parent
cgroup.  For a root cgroup, by the type of the hierarchy and which
controllers are attached to it.  Deciding the above on each usage is
fragile and unnecessarily complicates the users.

This patch introduces cgroup_control() and cgroup_ss_mask() which
calculate and return the [visibly] enabled subsyste mask for the
specified cgroup and conver the existing usages.

* cgroup_e_css() is restructured for simplicity.

* cgroup_calc_subtree_ss_mask() and cgroup_subtree_control_write() no
  longer need to distinguish root and non-root cases.

* With cgroup_control(), cgroup_controllers_show() can now handle both
  root and non-root cases.  cgroup_root_controllers_show() is removed.

v2: cgroup_control() updated to yield the correct result on v1
    hierarchies too.  cgroup_subtree_control_write() converted.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:58 -05:00
Tejun Heo a5bca21520 cgroup: factor out cgroup_create() out of cgroup_mkdir()
We're in the process of refactoring cgroup and css management paths to
separate them out to eventually allow cgroups which aren't visible
through cgroup fs.  This patch factors out cgroup_create() out of
cgroup_mkdir().  cgroup_create() contains all internal object creation
and initialization.  cgroup_mkdir() uses cgroup_create() to create the
internal cgroup and adds interface directory and file creation.

This patch doesn't cause any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:58 -05:00
Tejun Heo 195e9b6c4b cgroup: reorder operations in cgroup_mkdir()
Currently, operations to initialize internal objects and create
interface directory and files are intermixed in cgroup_mkdir().  We're
in the process of refactoring cgroup and css management paths to
separate them out to eventually allow cgroups which aren't visible
through cgroup fs.

This patch reorders operations inside cgroup_mkdir() so that interface
directory and file handling comes after internal object
initialization.  This will enable further refactoring.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:58 -05:00
Tejun Heo 88cb04b96a cgroup: explicitly track whether a cgroup_subsys_state is visible to userland
Currently, whether a css (cgroup_subsys_state) has its interface files
created is not tracked and assumed to change together with the owning
cgroup's lifecycle.  cgroup directory and interface creation is being
separated out from internal object creation to help refactoring and
eventually allow cgroups which are not visible through cgroupfs.

This patch adds CSS_VISIBLE to track whether a css has its interface
files created and perform management operations only when necessary
which helps decoupling interface file handling from internal object
lifecycle.  After this patch, all css interface file management
functions can be called regardless of the current state and will
achieve the expected result.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:58 -05:00
Tejun Heo 6cd0f5bbaf cgroup: separate out interface file creation from css creation
Currently, interface files are created when a css is created depending
on whether @visible is set.  This patch separates out the two into
separate steps to help code refactoring and eventually allow cgroups
which aren't visible through cgroup fs.

Move css_populate_dir() out of create_css() and drop @visible.  While
at it, rename the function to css_create() for consistency.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:58 -05:00
Tejun Heo 20b454a61f cgroup: suppress spurious de-populated events
During task migration, tasks may transfer between two css_sets which
are associated with the same cgroup.  If those tasks are the only
tasks in the cgroup, this currently triggers a spurious de-populated
event on the cgroup.

Fix it by bumping up populated count before bumping it down during
migration to ensure that it doesn't reach zero spuriously.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:57 -05:00
Tejun Heo 2378d8b8ba cgroup: re-hash init_css_set after subsystems are initialized
css_sets are hashed by their subsys[] contents and in cgroup_init()
init_css_set is hashed early, before subsystem inits, when all entries
in its subsys[] are NULL, so that cgroup_dfl_root initialization can
find and link to it.  As subsystems are initialized,
init_css_set.subsys[] is filled up but the hashing is never updated
making init_css_set hashed in the wrong place.  While incorrect, this
doesn't cause a critical failure as css_set management code would
create an identical css_set dynamically.

Fix it by rehashing init_css_set after subsystems are initialized.
While at it, drop unnecessary @key local variable.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2016-03-03 09:57:57 -05:00
Vladimir Davydov fa06235b8e cgroup: reset css on destruction
An associated css can be around for quite a while after a cgroup
directory has been removed. In general, it makes sense to reset it to
defaults so as not to worry about any remnants. For instance, memory
cgroup needs to reset memory.low, otherwise pages charged to a dead
cgroup might never get reclaimed. There's ->css_reset callback, which
would fit perfectly for the purpose. Currently, it's only called when a
subsystem is disabled in the unified hierarchy and there are other
subsystems dependant on it. Let's call it on css destruction as well.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-01 12:06:02 -05:00
Tejun Heo fa5ff8a1c4 cgroup: fix and restructure error handling in copy_cgroup_ns()
copy_cgroup_ns()'s error handling was broken and the attempt to fix it
d22025570e ("cgroup: fix alloc_cgroup_ns() error handling in
copy_cgroup_ns()") was broken too in that it ended up trying an
ERR_PTR() value.

There's only one place where copy_cgroup_ns() needs to perform cleanup
after failure.  Simplify and fix the error handling by removing the
goto's.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
2016-02-29 16:22:52 -05:00
Xiubo Li 63253ad814 cgroup: fix a mistake in warning message
There is a mistake about the print format name:id <--> %d:%s, which
the name is 'char *' type and id is 'int' type.  Change "name:id" to
"id:name" instead to be consistent with "cgroup_subsys %d:%s".

Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-02-27 06:33:37 -05:00