2014-10-23 04:45:23 +08:00
|
|
|
// +build linux
|
|
|
|
|
|
|
|
package libcontainer
|
|
|
|
|
2014-10-31 06:08:28 +08:00
|
|
|
import (
|
2015-10-17 23:35:36 +08:00
|
|
|
"bytes"
|
2015-02-12 08:45:23 +08:00
|
|
|
"encoding/json"
|
2018-01-23 01:03:02 +08:00
|
|
|
"errors"
|
2014-12-15 23:05:11 +08:00
|
|
|
"fmt"
|
2015-10-17 23:35:36 +08:00
|
|
|
"io"
|
2015-04-29 00:49:44 +08:00
|
|
|
"io/ioutil"
|
2017-03-02 16:02:15 +08:00
|
|
|
"net"
|
2014-12-15 23:05:11 +08:00
|
|
|
"os"
|
|
|
|
"os/exec"
|
2015-02-12 08:45:23 +08:00
|
|
|
"path/filepath"
|
2015-08-31 19:34:14 +08:00
|
|
|
"reflect"
|
2015-03-19 11:22:21 +08:00
|
|
|
"strings"
|
2015-02-14 06:41:37 +08:00
|
|
|
"sync"
|
2016-01-23 09:29:36 +08:00
|
|
|
"time"
|
2014-12-15 23:05:11 +08:00
|
|
|
|
2019-09-24 04:45:45 +08:00
|
|
|
securejoin "github.com/cyphar/filepath-securejoin"
|
2015-06-22 10:29:59 +08:00
|
|
|
"github.com/opencontainers/runc/libcontainer/cgroups"
|
|
|
|
"github.com/opencontainers/runc/libcontainer/configs"
|
libcontainer: add support for Intel RDT/CAT in runc
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.
This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).
For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.
About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.
Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| |-- cbm_mask
| |-- min_cbm_bits
| |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
|-- cpus
|-- schemata
|-- tasks
For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.
The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:
"linux": {
"intelRdt": {
"l3CacheSchema": "L3:0=ffff0;1=3ff"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-08-30 19:34:26 +08:00
|
|
|
"github.com/opencontainers/runc/libcontainer/intelrdt"
|
2016-07-05 08:24:13 +08:00
|
|
|
"github.com/opencontainers/runc/libcontainer/system"
|
2016-01-26 10:15:44 +08:00
|
|
|
"github.com/opencontainers/runc/libcontainer/utils"
|
libcontainer: Set 'status' in hook stdin
Finish off the work started in a344b2d6 (sync up `HookState` with OCI
spec `State`, 2016-12-19, #1201).
And drop HookState, since there's no need for a local alias for
specs.State.
Also set c.initProcess in newInitProcess to support OCIState calls
from within initProcess.start(). I think the cyclic references
between linuxContainer and initProcess are unfortunate, but didn't
want to address that here.
I've also left the timing of the Prestart hooks alone, although the
spec calls for them to happen before start (not as part of creation)
[1,2]. Once the timing gets fixed we can drop the
initProcessStartTime hacks which initProcess.start currently needs.
I'm not sure why we trigger the prestart hooks in response to both
procReady and procHooks. But we've had two prestart rounds in
initProcess.start since 2f276498 (Move pre-start hooks after container
mounts, 2016-02-17, #568). I've left that alone too.
I really think we should have len() guards to avoid computing the
state when .Hooks is non-nil but the particular phase we're looking at
is empty. Aleksa, however, is adamantly against them [3] citing a
risk of sloppy copy/pastes causing the hook slice being len-guarded to
diverge from the hook slice being iterated over within the guard. I
think that ort of thing is very lo-risk, because:
* We shouldn't be copy/pasting this, right? DRY for the win :).
* There's only ever a few lines between the guard and the guarded
loop. That makes broken copy/pastes easy to catch in review.
* We should have test coverage for these. Guarding with the wrong
slice is certainly not the only thing you can break with a sloppy
copy/paste.
But I'm not a maintainer ;).
[1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.0/config.md#prestart
[2]: https://github.com/opencontainers/runc/issues/1710
[3]: https://github.com/opencontainers/runc/pull/1741#discussion_r233331570
Signed-off-by: W. Trevor King <wking@tremily.us>
2018-02-26 06:47:41 +08:00
|
|
|
"github.com/opencontainers/runtime-spec/specs-go"
|
2017-07-19 22:28:59 +08:00
|
|
|
|
2020-05-15 20:23:56 +08:00
|
|
|
"github.com/checkpoint-restore/go-criu/v4"
|
|
|
|
criurpc "github.com/checkpoint-restore/go-criu/v4/rpc"
|
2017-07-19 22:28:59 +08:00
|
|
|
"github.com/golang/protobuf/proto"
|
|
|
|
"github.com/sirupsen/logrus"
|
2015-10-17 23:35:36 +08:00
|
|
|
"github.com/vishvananda/netlink/nl"
|
2017-07-19 22:28:59 +08:00
|
|
|
"golang.org/x/sys/unix"
|
2014-10-31 06:08:28 +08:00
|
|
|
)
|
2014-10-23 04:45:23 +08:00
|
|
|
|
2015-04-09 05:14:51 +08:00
|
|
|
const stdioFdCount = 3
|
|
|
|
|
2014-10-23 04:45:23 +08:00
|
|
|
type linuxContainer struct {
|
2016-07-05 08:24:13 +08:00
|
|
|
id string
|
|
|
|
root string
|
|
|
|
config *configs.Config
|
|
|
|
cgroupManager cgroups.Manager
|
libcontainer: add support for Intel RDT/CAT in runc
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.
This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).
For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.
About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.
Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| |-- cbm_mask
| |-- min_cbm_bits
| |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
|-- cpus
|-- schemata
|-- tasks
For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.
The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:
"linux": {
"intelRdt": {
"l3CacheSchema": "L3:0=ffff0;1=3ff"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-08-30 19:34:26 +08:00
|
|
|
intelRdtManager intelrdt.Manager
|
2017-05-19 12:54:02 +08:00
|
|
|
initPath string
|
2016-07-05 08:24:13 +08:00
|
|
|
initArgs []string
|
|
|
|
initProcess parentProcess
|
2017-06-15 06:38:45 +08:00
|
|
|
initProcessStartTime uint64
|
2016-07-05 08:24:13 +08:00
|
|
|
criuPath string
|
2017-07-21 01:33:01 +08:00
|
|
|
newuidmapPath string
|
|
|
|
newgidmapPath string
|
2016-07-05 08:24:13 +08:00
|
|
|
m sync.Mutex
|
|
|
|
criuVersion int
|
|
|
|
state containerState
|
|
|
|
created time.Time
|
2014-10-23 04:45:23 +08:00
|
|
|
}
|
|
|
|
|
2015-10-24 00:22:48 +08:00
|
|
|
// State represents a running container's state
|
|
|
|
type State struct {
|
|
|
|
BaseState
|
|
|
|
|
|
|
|
// Platform specific fields below here
|
|
|
|
|
Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-07-05 14:28:21 +08:00
|
|
|
// Specified if the container was started under the rootless mode.
|
|
|
|
// Set to true if BaseState.Config.RootlessEUID && BaseState.Config.RootlessCgroups
|
2016-04-23 21:39:42 +08:00
|
|
|
Rootless bool `json:"rootless"`
|
|
|
|
|
Simplify cgroup path handing in v2 via unified API
This unties the Gordian Knot of using GetPaths in cgroupv2 code.
The problem is, the current code uses GetPaths for three kinds of things:
1. Get all the paths to cgroup v1 controllers to save its state (see
(*linuxContainer).currentState(), (*LinuxFactory).loadState()
methods).
2. Get all the paths to cgroup v1 controllers to have the setns process
enter the proper cgroups in `(*setnsProcess).start()`.
3. Get the path to a specific controller (for example,
`m.GetPaths()["devices"]`).
Now, for cgroup v2 instead of a set of per-controller paths, we have only
one single unified path, and a dedicated function `GetUnifiedPath()` to get it.
This discrepancy between v1 and v2 cgroupManager API leads to the
following problems with the code:
- multiple if/else code blocks that have to treat v1 and v2 separately;
- backward-compatible GetPaths() methods in v2 controllers;
- - repeated writing of the PID into the same cgroup for v2;
Overall, it's hard to write the right code with all this, and the code
that is written is kinda hard to follow.
The solution is to slightly change the API to do the 3 things outlined
above in the same manner for v1 and v2:
1. Use `GetPaths()` for state saving and setns process cgroups entering.
2. Introduce and use Path(subsys string) to obtain a path to a
subsystem. For v2, the argument is ignored and the unified path is
returned.
This commit converts all the controllers to the new API, and modifies
all the users to use it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-07 08:36:28 +08:00
|
|
|
// Paths to all the container's cgroups, as returned by (*cgroups.Manager).GetPaths
|
|
|
|
//
|
|
|
|
// For cgroup v1, a key is cgroup subsystem name, and the value is the path
|
|
|
|
// to the cgroup for this subsystem.
|
|
|
|
//
|
|
|
|
// For cgroup v2 unified hierarchy, a key is "", and the value is the unified path.
|
2015-10-24 00:22:48 +08:00
|
|
|
CgroupPaths map[string]string `json:"cgroup_paths"`
|
|
|
|
|
|
|
|
// NamespacePaths are filepaths to the container's namespaces. Key is the namespace type
|
|
|
|
// with the value as the path.
|
|
|
|
NamespacePaths map[configs.NamespaceType]string `json:"namespace_paths"`
|
|
|
|
|
|
|
|
// Container's standard descriptors (std{in,out,err}), needed for checkpoint and restore
|
|
|
|
ExternalDescriptors []string `json:"external_descriptors,omitempty"`
|
libcontainer: add support for Intel RDT/CAT in runc
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.
This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).
For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.
About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.
Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| |-- cbm_mask
| |-- min_cbm_bits
| |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
|-- cpus
|-- schemata
|-- tasks
For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.
The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:
"linux": {
"intelRdt": {
"l3CacheSchema": "L3:0=ffff0;1=3ff"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-08-30 19:34:26 +08:00
|
|
|
|
|
|
|
// Intel RDT "resource control" filesystem path
|
|
|
|
IntelRdtPath string `json:"intel_rdt_path"`
|
2015-10-24 00:22:48 +08:00
|
|
|
}
|
|
|
|
|
2016-04-12 16:12:23 +08:00
|
|
|
// Container is a libcontainer container object.
|
2015-10-24 00:30:32 +08:00
|
|
|
//
|
|
|
|
// Each container is thread-safe within the same process. Since a container can
|
|
|
|
// be destroyed by a separate process, any function may return that the container
|
|
|
|
// was not found.
|
|
|
|
type Container interface {
|
|
|
|
BaseContainer
|
|
|
|
|
|
|
|
// Methods below here are platform specific
|
|
|
|
|
|
|
|
// Checkpoint checkpoints the running container's state to disk using the criu(8) utility.
|
|
|
|
//
|
|
|
|
// errors:
|
|
|
|
// Systemerror - System error.
|
|
|
|
Checkpoint(criuOpts *CriuOpts) error
|
|
|
|
|
2016-03-27 12:44:16 +08:00
|
|
|
// Restore restores the checkpointed container to a running state using the criu(8) utility.
|
2015-10-24 00:30:32 +08:00
|
|
|
//
|
|
|
|
// errors:
|
|
|
|
// Systemerror - System error.
|
|
|
|
Restore(process *Process, criuOpts *CriuOpts) error
|
|
|
|
|
2016-09-20 10:49:04 +08:00
|
|
|
// If the Container state is RUNNING or CREATED, sets the Container state to PAUSING and pauses
|
2015-10-24 00:30:32 +08:00
|
|
|
// the execution of any user processes. Asynchronously, when the container finished being paused the
|
|
|
|
// state is changed to PAUSED.
|
|
|
|
// If the Container state is PAUSED, do nothing.
|
|
|
|
//
|
|
|
|
// errors:
|
2016-05-10 17:56:10 +08:00
|
|
|
// ContainerNotExists - Container no longer exists,
|
2016-09-20 10:49:04 +08:00
|
|
|
// ContainerNotRunning - Container not running or created,
|
2015-10-24 00:30:32 +08:00
|
|
|
// Systemerror - System error.
|
|
|
|
Pause() error
|
|
|
|
|
|
|
|
// If the Container state is PAUSED, resumes the execution of any user processes in the
|
|
|
|
// Container before setting the Container state to RUNNING.
|
|
|
|
// If the Container state is RUNNING, do nothing.
|
|
|
|
//
|
|
|
|
// errors:
|
2016-05-10 17:56:10 +08:00
|
|
|
// ContainerNotExists - Container no longer exists,
|
|
|
|
// ContainerNotPaused - Container is not paused,
|
2015-10-24 00:30:32 +08:00
|
|
|
// Systemerror - System error.
|
|
|
|
Resume() error
|
|
|
|
|
|
|
|
// NotifyOOM returns a read-only channel signaling when the container receives an OOM notification.
|
|
|
|
//
|
|
|
|
// errors:
|
|
|
|
// Systemerror - System error.
|
|
|
|
NotifyOOM() (<-chan struct{}, error)
|
2015-12-08 23:33:47 +08:00
|
|
|
|
|
|
|
// NotifyMemoryPressure returns a read-only channel signaling when the container reaches a given pressure level
|
|
|
|
//
|
|
|
|
// errors:
|
|
|
|
// Systemerror - System error.
|
|
|
|
NotifyMemoryPressure(level PressureLevel) (<-chan struct{}, error)
|
2015-10-24 00:30:32 +08:00
|
|
|
}
|
|
|
|
|
2015-02-01 13:21:06 +08:00
|
|
|
// ID returns the container's unique ID
|
2014-10-23 04:45:23 +08:00
|
|
|
func (c *linuxContainer) ID() string {
|
|
|
|
return c.id
|
|
|
|
}
|
|
|
|
|
2015-02-01 13:21:06 +08:00
|
|
|
// Config returns the container's configuration
|
|
|
|
func (c *linuxContainer) Config() configs.Config {
|
|
|
|
return *c.config
|
2014-10-23 04:45:23 +08:00
|
|
|
}
|
|
|
|
|
2015-02-12 08:45:23 +08:00
|
|
|
func (c *linuxContainer) Status() (Status, error) {
|
2015-02-14 06:41:37 +08:00
|
|
|
c.m.Lock()
|
|
|
|
defer c.m.Unlock()
|
|
|
|
return c.currentStatus()
|
2014-10-23 04:45:23 +08:00
|
|
|
}
|
|
|
|
|
2015-02-12 06:45:07 +08:00
|
|
|
func (c *linuxContainer) State() (*State, error) {
|
2015-02-14 06:41:37 +08:00
|
|
|
c.m.Lock()
|
|
|
|
defer c.m.Unlock()
|
|
|
|
return c.currentState()
|
2015-02-12 06:45:07 +08:00
|
|
|
}
|
|
|
|
|
libcontainer: Set 'status' in hook stdin
Finish off the work started in a344b2d6 (sync up `HookState` with OCI
spec `State`, 2016-12-19, #1201).
And drop HookState, since there's no need for a local alias for
specs.State.
Also set c.initProcess in newInitProcess to support OCIState calls
from within initProcess.start(). I think the cyclic references
between linuxContainer and initProcess are unfortunate, but didn't
want to address that here.
I've also left the timing of the Prestart hooks alone, although the
spec calls for them to happen before start (not as part of creation)
[1,2]. Once the timing gets fixed we can drop the
initProcessStartTime hacks which initProcess.start currently needs.
I'm not sure why we trigger the prestart hooks in response to both
procReady and procHooks. But we've had two prestart rounds in
initProcess.start since 2f276498 (Move pre-start hooks after container
mounts, 2016-02-17, #568). I've left that alone too.
I really think we should have len() guards to avoid computing the
state when .Hooks is non-nil but the particular phase we're looking at
is empty. Aleksa, however, is adamantly against them [3] citing a
risk of sloppy copy/pastes causing the hook slice being len-guarded to
diverge from the hook slice being iterated over within the guard. I
think that ort of thing is very lo-risk, because:
* We shouldn't be copy/pasting this, right? DRY for the win :).
* There's only ever a few lines between the guard and the guarded
loop. That makes broken copy/pastes easy to catch in review.
* We should have test coverage for these. Guarding with the wrong
slice is certainly not the only thing you can break with a sloppy
copy/paste.
But I'm not a maintainer ;).
[1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.0/config.md#prestart
[2]: https://github.com/opencontainers/runc/issues/1710
[3]: https://github.com/opencontainers/runc/pull/1741#discussion_r233331570
Signed-off-by: W. Trevor King <wking@tremily.us>
2018-02-26 06:47:41 +08:00
|
|
|
func (c *linuxContainer) OCIState() (*specs.State, error) {
|
|
|
|
c.m.Lock()
|
|
|
|
defer c.m.Unlock()
|
|
|
|
return c.currentOCIState()
|
|
|
|
}
|
|
|
|
|
2014-10-23 07:27:06 +08:00
|
|
|
func (c *linuxContainer) Processes() ([]int, error) {
|
2020-04-23 21:33:47 +08:00
|
|
|
var pids []int
|
|
|
|
status, err := c.currentStatus()
|
|
|
|
if err != nil {
|
|
|
|
return pids, err
|
|
|
|
}
|
|
|
|
// for systemd cgroup, the unit's cgroup path will be auto removed if container's all processes exited
|
|
|
|
if status == Stopped && !c.cgroupManager.Exists() {
|
|
|
|
return pids, nil
|
|
|
|
}
|
|
|
|
|
|
|
|
pids, err = c.cgroupManager.GetAllPids()
|
2014-10-23 04:45:23 +08:00
|
|
|
if err != nil {
|
2016-04-19 02:37:26 +08:00
|
|
|
return nil, newSystemErrorWithCause(err, "getting all container pids from cgroups")
|
2014-10-23 04:45:23 +08:00
|
|
|
}
|
|
|
|
return pids, nil
|
|
|
|
}
|
|
|
|
|
2015-02-01 11:56:27 +08:00
|
|
|
func (c *linuxContainer) Stats() (*Stats, error) {
|
2014-10-23 04:45:23 +08:00
|
|
|
var (
|
|
|
|
err error
|
2015-02-01 11:56:27 +08:00
|
|
|
stats = &Stats{}
|
2014-10-23 04:45:23 +08:00
|
|
|
)
|
2014-12-06 09:02:49 +08:00
|
|
|
if stats.CgroupStats, err = c.cgroupManager.GetStats(); err != nil {
|
2016-04-19 02:37:26 +08:00
|
|
|
return stats, newSystemErrorWithCause(err, "getting container stats from cgroups")
|
2014-10-23 04:45:23 +08:00
|
|
|
}
|
libcontainer: add support for Intel RDT/CAT in runc
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.
This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).
For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.
About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.
Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| |-- cbm_mask
| |-- min_cbm_bits
| |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
|-- cpus
|-- schemata
|-- tasks
For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.
The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:
"linux": {
"intelRdt": {
"l3CacheSchema": "L3:0=ffff0;1=3ff"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-08-30 19:34:26 +08:00
|
|
|
if c.intelRdtManager != nil {
|
|
|
|
if stats.IntelRdtStats, err = c.intelRdtManager.GetStats(); err != nil {
|
|
|
|
return stats, newSystemErrorWithCause(err, "getting container's Intel RDT stats")
|
|
|
|
}
|
|
|
|
}
|
2015-02-07 13:12:27 +08:00
|
|
|
for _, iface := range c.config.Networks {
|
2015-02-10 07:16:27 +08:00
|
|
|
switch iface.Type {
|
|
|
|
case "veth":
|
2015-02-11 03:51:45 +08:00
|
|
|
istats, err := getNetworkInterfaceStats(iface.HostInterfaceName)
|
2015-02-10 07:16:27 +08:00
|
|
|
if err != nil {
|
2016-04-19 02:37:26 +08:00
|
|
|
return stats, newSystemErrorWithCausef(err, "getting network stats for interface %q", iface.HostInterfaceName)
|
2015-02-07 13:12:27 +08:00
|
|
|
}
|
2015-02-10 07:16:27 +08:00
|
|
|
stats.Interfaces = append(stats.Interfaces, istats)
|
2015-02-07 13:12:27 +08:00
|
|
|
}
|
2014-10-23 04:45:23 +08:00
|
|
|
}
|
|
|
|
return stats, nil
|
|
|
|
}
|
2014-10-28 08:51:14 +08:00
|
|
|
|
2015-03-11 16:46:54 +08:00
|
|
|
func (c *linuxContainer) Set(config configs.Config) error {
|
2015-02-27 12:09:42 +08:00
|
|
|
c.m.Lock()
|
|
|
|
defer c.m.Unlock()
|
2016-06-05 00:38:08 +08:00
|
|
|
status, err := c.currentStatus()
|
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
if status == Stopped {
|
2020-05-17 08:20:44 +08:00
|
|
|
return newGenericError(errors.New("container not running"), ContainerNotRunning)
|
2016-06-05 00:38:08 +08:00
|
|
|
}
|
2017-08-15 14:30:58 +08:00
|
|
|
if err := c.cgroupManager.Set(&config); err != nil {
|
|
|
|
// Set configs back
|
|
|
|
if err2 := c.cgroupManager.Set(c.config); err2 != nil {
|
|
|
|
logrus.Warnf("Setting back cgroup configs failed due to error: %v, your state.json and actual configs might be inconsistent.", err2)
|
|
|
|
}
|
|
|
|
return err
|
|
|
|
}
|
libcontainer: add support for Intel RDT/CAT in runc
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.
This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).
For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.
About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.
Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| |-- cbm_mask
| |-- min_cbm_bits
| |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
|-- cpus
|-- schemata
|-- tasks
For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.
The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:
"linux": {
"intelRdt": {
"l3CacheSchema": "L3:0=ffff0;1=3ff"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-08-30 19:34:26 +08:00
|
|
|
if c.intelRdtManager != nil {
|
|
|
|
if err := c.intelRdtManager.Set(&config); err != nil {
|
|
|
|
// Set configs back
|
|
|
|
if err2 := c.intelRdtManager.Set(c.config); err2 != nil {
|
|
|
|
logrus.Warnf("Setting back intelrdt configs failed due to error: %v, your state.json and actual configs might be inconsistent.", err2)
|
|
|
|
}
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
}
|
2017-08-15 14:30:58 +08:00
|
|
|
// After config setting succeed, update config and states
|
2015-03-11 16:46:54 +08:00
|
|
|
c.config = &config
|
2017-08-15 14:30:58 +08:00
|
|
|
_, err = c.updateState(nil)
|
|
|
|
return err
|
2015-02-27 12:09:42 +08:00
|
|
|
}
|
|
|
|
|
2015-02-23 17:26:43 +08:00
|
|
|
func (c *linuxContainer) Start(process *Process) error {
|
2015-02-14 06:41:37 +08:00
|
|
|
c.m.Lock()
|
|
|
|
defer c.m.Unlock()
|
2018-06-02 03:56:13 +08:00
|
|
|
if process.Init {
|
2017-02-23 02:34:48 +08:00
|
|
|
if err := c.createExecFifo(); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
}
|
2018-06-02 03:56:13 +08:00
|
|
|
if err := c.start(process); err != nil {
|
|
|
|
if process.Init {
|
2017-02-23 02:34:48 +08:00
|
|
|
c.deleteExecFifo()
|
|
|
|
}
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
return nil
|
2016-05-20 08:28:58 +08:00
|
|
|
}
|
|
|
|
|
2016-05-28 04:13:11 +08:00
|
|
|
func (c *linuxContainer) Run(process *Process) error {
|
2017-02-23 02:34:48 +08:00
|
|
|
if err := c.Start(process); err != nil {
|
2016-05-20 08:28:58 +08:00
|
|
|
return err
|
|
|
|
}
|
2018-06-02 03:56:13 +08:00
|
|
|
if process.Init {
|
2016-06-07 04:15:18 +08:00
|
|
|
return c.exec()
|
2016-05-20 08:28:58 +08:00
|
|
|
}
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
2016-06-07 04:15:18 +08:00
|
|
|
func (c *linuxContainer) Exec() error {
|
|
|
|
c.m.Lock()
|
|
|
|
defer c.m.Unlock()
|
|
|
|
return c.exec()
|
|
|
|
}
|
|
|
|
|
|
|
|
func (c *linuxContainer) exec() error {
|
|
|
|
path := filepath.Join(c.root, execFifoFilename)
|
2019-12-18 23:20:53 +08:00
|
|
|
pid := c.initProcess.pid()
|
|
|
|
blockingFifoOpenCh := awaitFifoOpen(path)
|
|
|
|
for {
|
|
|
|
select {
|
|
|
|
case result := <-blockingFifoOpenCh:
|
|
|
|
return handleFifoResult(result)
|
|
|
|
|
|
|
|
case <-time.After(time.Millisecond * 100):
|
|
|
|
stat, err := system.Stat(pid)
|
|
|
|
if err != nil || stat.State == system.Zombie {
|
|
|
|
// could be because process started, ran, and completed between our 100ms timeout and our system.Stat() check.
|
|
|
|
// see if the fifo exists and has data (with a non-blocking open, which will succeed if the writing process is complete).
|
|
|
|
if err := handleFifoResult(fifoOpen(path, false)); err != nil {
|
|
|
|
return errors.New("container process is already dead")
|
|
|
|
}
|
|
|
|
return nil
|
|
|
|
}
|
2018-01-23 01:03:02 +08:00
|
|
|
}
|
2016-06-07 04:15:18 +08:00
|
|
|
}
|
2018-01-23 01:03:02 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
func readFromExecFifo(execFifo io.Reader) error {
|
|
|
|
data, err := ioutil.ReadAll(execFifo)
|
2016-06-07 04:15:18 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2018-01-23 01:03:02 +08:00
|
|
|
if len(data) <= 0 {
|
2020-05-17 08:20:44 +08:00
|
|
|
return errors.New("cannot start an already running container")
|
2016-06-07 04:15:18 +08:00
|
|
|
}
|
2018-01-23 01:03:02 +08:00
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
|
|
|
func awaitFifoOpen(path string) <-chan openResult {
|
|
|
|
fifoOpened := make(chan openResult)
|
|
|
|
go func() {
|
2019-12-18 23:20:53 +08:00
|
|
|
result := fifoOpen(path, true)
|
|
|
|
fifoOpened <- result
|
2018-01-23 01:03:02 +08:00
|
|
|
}()
|
|
|
|
return fifoOpened
|
|
|
|
}
|
|
|
|
|
2019-12-18 23:20:53 +08:00
|
|
|
func fifoOpen(path string, block bool) openResult {
|
|
|
|
flags := os.O_RDONLY
|
|
|
|
if !block {
|
2020-04-19 07:05:10 +08:00
|
|
|
flags |= unix.O_NONBLOCK
|
2019-12-18 23:20:53 +08:00
|
|
|
}
|
|
|
|
f, err := os.OpenFile(path, flags, 0)
|
|
|
|
if err != nil {
|
|
|
|
return openResult{err: newSystemErrorWithCause(err, "open exec fifo for reading")}
|
|
|
|
}
|
|
|
|
return openResult{file: f}
|
|
|
|
}
|
|
|
|
|
|
|
|
func handleFifoResult(result openResult) error {
|
|
|
|
if result.err != nil {
|
|
|
|
return result.err
|
|
|
|
}
|
|
|
|
f := result.file
|
|
|
|
defer f.Close()
|
|
|
|
if err := readFromExecFifo(f); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
return os.Remove(f.Name())
|
|
|
|
}
|
|
|
|
|
2018-01-23 01:03:02 +08:00
|
|
|
type openResult struct {
|
|
|
|
file *os.File
|
|
|
|
err error
|
2016-06-07 04:15:18 +08:00
|
|
|
}
|
|
|
|
|
2018-06-02 03:56:13 +08:00
|
|
|
func (c *linuxContainer) start(process *Process) error {
|
|
|
|
parent, err := c.newParentProcess(process)
|
2015-02-07 13:12:27 +08:00
|
|
|
if err != nil {
|
2016-04-19 02:37:26 +08:00
|
|
|
return newSystemErrorWithCause(err, "creating new parent process")
|
2015-02-07 04:48:57 +08:00
|
|
|
}
|
2019-04-04 19:57:28 +08:00
|
|
|
parent.forwardChildLogs()
|
2015-02-07 13:12:27 +08:00
|
|
|
if err := parent.start(); err != nil {
|
|
|
|
// terminate the process to ensure that it properly is reaped.
|
2017-01-25 07:24:05 +08:00
|
|
|
if err := ignoreTerminateErrors(parent.terminate()); err != nil {
|
2015-05-06 21:14:04 +08:00
|
|
|
logrus.Warn(err)
|
2015-02-07 13:12:27 +08:00
|
|
|
}
|
2016-04-19 02:37:26 +08:00
|
|
|
return newSystemErrorWithCause(err, "starting container process")
|
2015-02-07 04:48:57 +08:00
|
|
|
}
|
2016-01-23 09:29:36 +08:00
|
|
|
// generate a timestamp indicating when the container was started
|
2016-01-29 05:32:24 +08:00
|
|
|
c.created = time.Now().UTC()
|
2018-06-02 03:56:13 +08:00
|
|
|
if process.Init {
|
2016-05-20 08:28:58 +08:00
|
|
|
c.state = &createdState{
|
|
|
|
c: c,
|
|
|
|
}
|
2016-07-05 08:24:13 +08:00
|
|
|
state, err := c.updateState(parent)
|
|
|
|
if err != nil {
|
2015-10-03 02:16:50 +08:00
|
|
|
return err
|
|
|
|
}
|
2016-07-05 08:24:13 +08:00
|
|
|
c.initProcessStartTime = state.InitProcessStartTime
|
|
|
|
|
2016-01-22 08:43:33 +08:00
|
|
|
if c.config.Hooks != nil {
|
libcontainer: Set 'status' in hook stdin
Finish off the work started in a344b2d6 (sync up `HookState` with OCI
spec `State`, 2016-12-19, #1201).
And drop HookState, since there's no need for a local alias for
specs.State.
Also set c.initProcess in newInitProcess to support OCIState calls
from within initProcess.start(). I think the cyclic references
between linuxContainer and initProcess are unfortunate, but didn't
want to address that here.
I've also left the timing of the Prestart hooks alone, although the
spec calls for them to happen before start (not as part of creation)
[1,2]. Once the timing gets fixed we can drop the
initProcessStartTime hacks which initProcess.start currently needs.
I'm not sure why we trigger the prestart hooks in response to both
procReady and procHooks. But we've had two prestart rounds in
initProcess.start since 2f276498 (Move pre-start hooks after container
mounts, 2016-02-17, #568). I've left that alone too.
I really think we should have len() guards to avoid computing the
state when .Hooks is non-nil but the particular phase we're looking at
is empty. Aleksa, however, is adamantly against them [3] citing a
risk of sloppy copy/pastes causing the hook slice being len-guarded to
diverge from the hook slice being iterated over within the guard. I
think that ort of thing is very lo-risk, because:
* We shouldn't be copy/pasting this, right? DRY for the win :).
* There's only ever a few lines between the guard and the guarded
loop. That makes broken copy/pastes easy to catch in review.
* We should have test coverage for these. Guarding with the wrong
slice is certainly not the only thing you can break with a sloppy
copy/paste.
But I'm not a maintainer ;).
[1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.0/config.md#prestart
[2]: https://github.com/opencontainers/runc/issues/1710
[3]: https://github.com/opencontainers/runc/pull/1741#discussion_r233331570
Signed-off-by: W. Trevor King <wking@tremily.us>
2018-02-26 06:47:41 +08:00
|
|
|
s, err := c.currentOCIState()
|
|
|
|
if err != nil {
|
|
|
|
return err
|
2016-01-22 08:43:33 +08:00
|
|
|
}
|
2016-04-19 02:37:26 +08:00
|
|
|
for i, hook := range c.config.Hooks.Poststart {
|
2016-01-22 08:43:33 +08:00
|
|
|
if err := hook.Run(s); err != nil {
|
2017-01-25 07:24:05 +08:00
|
|
|
if err := ignoreTerminateErrors(parent.terminate()); err != nil {
|
2016-01-22 08:43:33 +08:00
|
|
|
logrus.Warn(err)
|
|
|
|
}
|
2016-04-19 02:37:26 +08:00
|
|
|
return newSystemErrorWithCausef(err, "running poststart hook %d", i)
|
2015-11-07 07:03:32 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2015-02-23 17:26:43 +08:00
|
|
|
return nil
|
2015-02-07 04:48:57 +08:00
|
|
|
}
|
|
|
|
|
2016-11-08 07:22:27 +08:00
|
|
|
func (c *linuxContainer) Signal(s os.Signal, all bool) error {
|
2020-03-10 02:08:11 +08:00
|
|
|
c.m.Lock()
|
|
|
|
defer c.m.Unlock()
|
2018-11-18 20:19:17 +08:00
|
|
|
status, err := c.currentStatus()
|
|
|
|
if err != nil {
|
|
|
|
return err
|
2015-08-04 07:48:19 +08:00
|
|
|
}
|
2020-04-23 21:33:47 +08:00
|
|
|
if all {
|
|
|
|
// for systemd cgroup, the unit's cgroup path will be auto removed if container's all processes exited
|
|
|
|
if status == Stopped && !c.cgroupManager.Exists() {
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
return signalAllProcesses(c.cgroupManager, s)
|
|
|
|
}
|
2018-11-18 20:19:17 +08:00
|
|
|
// to avoid a PID reuse attack
|
2018-12-01 06:35:45 +08:00
|
|
|
if status == Running || status == Created || status == Paused {
|
2018-11-18 20:19:17 +08:00
|
|
|
if err := c.initProcess.signal(s); err != nil {
|
|
|
|
return newSystemErrorWithCause(err, "signaling init process")
|
|
|
|
}
|
|
|
|
return nil
|
|
|
|
}
|
2020-05-17 08:20:44 +08:00
|
|
|
return newGenericError(errors.New("container not running"), ContainerNotRunning)
|
2015-08-04 07:48:19 +08:00
|
|
|
}
|
|
|
|
|
2017-02-23 02:34:48 +08:00
|
|
|
func (c *linuxContainer) createExecFifo() error {
|
2017-03-18 01:32:16 +08:00
|
|
|
rootuid, err := c.Config().HostRootUID()
|
2017-02-23 02:34:48 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2017-03-18 01:32:16 +08:00
|
|
|
rootgid, err := c.Config().HostRootGID()
|
2017-02-23 02:34:48 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
|
|
|
fifoName := filepath.Join(c.root, execFifoFilename)
|
|
|
|
if _, err := os.Stat(fifoName); err == nil {
|
|
|
|
return fmt.Errorf("exec fifo %s already exists", fifoName)
|
|
|
|
}
|
2017-05-10 05:38:27 +08:00
|
|
|
oldMask := unix.Umask(0000)
|
|
|
|
if err := unix.Mkfifo(fifoName, 0622); err != nil {
|
|
|
|
unix.Umask(oldMask)
|
2017-02-23 02:34:48 +08:00
|
|
|
return err
|
|
|
|
}
|
2017-05-10 05:38:27 +08:00
|
|
|
unix.Umask(oldMask)
|
2018-10-14 03:14:03 +08:00
|
|
|
return os.Chown(fifoName, rootuid, rootgid)
|
2017-02-23 02:34:48 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
func (c *linuxContainer) deleteExecFifo() {
|
|
|
|
fifoName := filepath.Join(c.root, execFifoFilename)
|
|
|
|
os.Remove(fifoName)
|
|
|
|
}
|
|
|
|
|
2017-08-24 15:37:26 +08:00
|
|
|
// includeExecFifo opens the container's execfifo as a pathfd, so that the
|
|
|
|
// container cannot access the statedir (and the FIFO itself remains
|
|
|
|
// un-opened). It then adds the FifoFd to the given exec.Cmd as an inherited
|
|
|
|
// fd, with _LIBCONTAINER_FIFOFD set to its fd number.
|
|
|
|
func (c *linuxContainer) includeExecFifo(cmd *exec.Cmd) error {
|
|
|
|
fifoName := filepath.Join(c.root, execFifoFilename)
|
|
|
|
fifoFd, err := unix.Open(fifoName, unix.O_PATH|unix.O_CLOEXEC, 0)
|
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
|
|
|
cmd.ExtraFiles = append(cmd.ExtraFiles, os.NewFile(uintptr(fifoFd), fifoName))
|
|
|
|
cmd.Env = append(cmd.Env,
|
|
|
|
fmt.Sprintf("_LIBCONTAINER_FIFOFD=%d", stdioFdCount+len(cmd.ExtraFiles)-1))
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
2018-06-02 03:56:13 +08:00
|
|
|
func (c *linuxContainer) newParentProcess(p *Process) (parentProcess, error) {
|
2019-04-23 22:02:31 +08:00
|
|
|
parentInitPipe, childInitPipe, err := utils.NewSockPair("init")
|
2015-02-07 13:12:27 +08:00
|
|
|
if err != nil {
|
2016-04-19 02:37:26 +08:00
|
|
|
return nil, newSystemErrorWithCause(err, "creating new init pipe")
|
2015-02-07 13:12:27 +08:00
|
|
|
}
|
2019-04-23 22:02:31 +08:00
|
|
|
messageSockPair := filePair{parentInitPipe, childInitPipe}
|
2018-08-04 01:11:20 +08:00
|
|
|
|
2019-04-23 22:02:31 +08:00
|
|
|
parentLogPipe, childLogPipe, err := os.Pipe()
|
2018-08-04 01:11:20 +08:00
|
|
|
if err != nil {
|
|
|
|
return nil, fmt.Errorf("Unable to create the log pipe: %s", err)
|
|
|
|
}
|
2019-04-23 22:02:31 +08:00
|
|
|
logFilePair := filePair{parentLogPipe, childLogPipe}
|
2018-08-04 01:11:20 +08:00
|
|
|
|
2019-02-08 19:37:42 +08:00
|
|
|
cmd := c.commandTemplate(p, childInitPipe, childLogPipe)
|
2018-06-02 03:56:13 +08:00
|
|
|
if !p.Init {
|
2019-04-23 22:02:31 +08:00
|
|
|
return c.newSetnsProcess(p, cmd, messageSockPair, logFilePair)
|
2015-02-07 13:12:27 +08:00
|
|
|
}
|
2016-11-28 22:25:06 +08:00
|
|
|
|
2017-08-24 15:37:26 +08:00
|
|
|
// We only set up fifoFd if we're not doing a `runc exec`. The historic
|
|
|
|
// reason for this is that previously we would pass a dirfd that allowed
|
|
|
|
// for container rootfs escape (and not doing it in `runc exec` avoided
|
|
|
|
// that problem), but we no longer do that. However, there's no need to do
|
|
|
|
// this for `runc exec` so we just keep it this way to be safe.
|
|
|
|
if err := c.includeExecFifo(cmd); err != nil {
|
|
|
|
return nil, newSystemErrorWithCause(err, "including execfifo in cmd.Exec setup")
|
2016-11-28 22:25:06 +08:00
|
|
|
}
|
2019-04-23 22:02:31 +08:00
|
|
|
return c.newInitProcess(p, cmd, messageSockPair, logFilePair)
|
2015-02-07 13:12:27 +08:00
|
|
|
}
|
|
|
|
|
2019-02-08 19:37:42 +08:00
|
|
|
func (c *linuxContainer) commandTemplate(p *Process, childInitPipe *os.File, childLogPipe *os.File) *exec.Cmd {
|
2017-05-19 12:54:02 +08:00
|
|
|
cmd := exec.Command(c.initPath, c.initArgs[1:]...)
|
|
|
|
cmd.Args[0] = c.initArgs[0]
|
2015-02-07 13:12:27 +08:00
|
|
|
cmd.Stdin = p.Stdin
|
|
|
|
cmd.Stdout = p.Stdout
|
|
|
|
cmd.Stderr = p.Stderr
|
2015-02-04 09:44:58 +08:00
|
|
|
cmd.Dir = c.config.Rootfs
|
2014-12-23 06:06:22 +08:00
|
|
|
if cmd.SysProcAttr == nil {
|
2020-04-19 07:05:10 +08:00
|
|
|
cmd.SysProcAttr = &unix.SysProcAttr{}
|
2014-12-23 06:06:22 +08:00
|
|
|
}
|
2018-06-26 23:02:34 +08:00
|
|
|
cmd.Env = append(cmd.Env, fmt.Sprintf("GOMAXPROCS=%s", os.Getenv("GOMAXPROCS")))
|
2017-03-03 04:53:06 +08:00
|
|
|
cmd.ExtraFiles = append(cmd.ExtraFiles, p.ExtraFiles...)
|
|
|
|
if p.ConsoleSocket != nil {
|
|
|
|
cmd.ExtraFiles = append(cmd.ExtraFiles, p.ConsoleSocket)
|
|
|
|
cmd.Env = append(cmd.Env,
|
|
|
|
fmt.Sprintf("_LIBCONTAINER_CONSOLE=%d", stdioFdCount+len(cmd.ExtraFiles)-1),
|
|
|
|
)
|
|
|
|
}
|
2019-04-23 22:02:31 +08:00
|
|
|
cmd.ExtraFiles = append(cmd.ExtraFiles, childInitPipe)
|
2016-06-07 04:15:18 +08:00
|
|
|
cmd.Env = append(cmd.Env,
|
2017-03-03 04:53:06 +08:00
|
|
|
fmt.Sprintf("_LIBCONTAINER_INITPIPE=%d", stdioFdCount+len(cmd.ExtraFiles)-1),
|
2019-02-19 19:33:09 +08:00
|
|
|
fmt.Sprintf("_LIBCONTAINER_STATEDIR=%s", c.root),
|
2017-03-03 04:53:06 +08:00
|
|
|
)
|
2018-08-04 01:11:20 +08:00
|
|
|
|
2019-04-23 22:02:31 +08:00
|
|
|
cmd.ExtraFiles = append(cmd.ExtraFiles, childLogPipe)
|
2018-08-04 01:11:20 +08:00
|
|
|
cmd.Env = append(cmd.Env,
|
|
|
|
fmt.Sprintf("_LIBCONTAINER_LOGPIPE=%d", stdioFdCount+len(cmd.ExtraFiles)-1),
|
2019-04-19 22:36:52 +08:00
|
|
|
fmt.Sprintf("_LIBCONTAINER_LOGLEVEL=%s", p.LogLevel),
|
2018-08-04 01:11:20 +08:00
|
|
|
)
|
|
|
|
|
2015-04-03 04:55:55 +08:00
|
|
|
// NOTE: when running a container with no PID namespace and the parent process spawning the container is
|
|
|
|
// PID1 the pdeathsig is being delivered to the container's init process by the kernel for some reason
|
|
|
|
// even with the parent still running.
|
2015-02-12 08:45:23 +08:00
|
|
|
if c.config.ParentDeathSignal > 0 {
|
2020-04-19 07:05:10 +08:00
|
|
|
cmd.SysProcAttr.Pdeathsig = unix.Signal(c.config.ParentDeathSignal)
|
2015-02-12 08:45:23 +08:00
|
|
|
}
|
2019-02-08 19:37:42 +08:00
|
|
|
return cmd
|
2014-12-15 23:05:11 +08:00
|
|
|
}
|
|
|
|
|
2019-04-23 22:02:31 +08:00
|
|
|
func (c *linuxContainer) newInitProcess(p *Process, cmd *exec.Cmd, messageSockPair, logFilePair filePair) (*initProcess, error) {
|
2015-09-14 08:40:43 +08:00
|
|
|
cmd.Env = append(cmd.Env, "_LIBCONTAINER_INITTYPE="+string(initStandard))
|
|
|
|
nsMaps := make(map[configs.NamespaceType]string)
|
|
|
|
for _, ns := range c.config.Namespaces {
|
|
|
|
if ns.Path != "" {
|
|
|
|
nsMaps[ns.Type] = ns.Path
|
2015-02-07 04:48:57 +08:00
|
|
|
}
|
2015-02-01 13:21:06 +08:00
|
|
|
}
|
2015-09-14 08:40:43 +08:00
|
|
|
_, sharePidns := nsMaps[configs.NEWPID]
|
2016-06-03 23:29:34 +08:00
|
|
|
data, err := c.bootstrapData(c.config.Namespaces.CloneFlags(), nsMaps)
|
2015-09-14 08:40:43 +08:00
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
libcontainer: Set 'status' in hook stdin
Finish off the work started in a344b2d6 (sync up `HookState` with OCI
spec `State`, 2016-12-19, #1201).
And drop HookState, since there's no need for a local alias for
specs.State.
Also set c.initProcess in newInitProcess to support OCIState calls
from within initProcess.start(). I think the cyclic references
between linuxContainer and initProcess are unfortunate, but didn't
want to address that here.
I've also left the timing of the Prestart hooks alone, although the
spec calls for them to happen before start (not as part of creation)
[1,2]. Once the timing gets fixed we can drop the
initProcessStartTime hacks which initProcess.start currently needs.
I'm not sure why we trigger the prestart hooks in response to both
procReady and procHooks. But we've had two prestart rounds in
initProcess.start since 2f276498 (Move pre-start hooks after container
mounts, 2016-02-17, #568). I've left that alone too.
I really think we should have len() guards to avoid computing the
state when .Hooks is non-nil but the particular phase we're looking at
is empty. Aleksa, however, is adamantly against them [3] citing a
risk of sloppy copy/pastes causing the hook slice being len-guarded to
diverge from the hook slice being iterated over within the guard. I
think that ort of thing is very lo-risk, because:
* We shouldn't be copy/pasting this, right? DRY for the win :).
* There's only ever a few lines between the guard and the guarded
loop. That makes broken copy/pastes easy to catch in review.
* We should have test coverage for these. Guarding with the wrong
slice is certainly not the only thing you can break with a sloppy
copy/paste.
But I'm not a maintainer ;).
[1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.0/config.md#prestart
[2]: https://github.com/opencontainers/runc/issues/1710
[3]: https://github.com/opencontainers/runc/pull/1741#discussion_r233331570
Signed-off-by: W. Trevor King <wking@tremily.us>
2018-02-26 06:47:41 +08:00
|
|
|
init := &initProcess{
|
libcontainer: add support for Intel RDT/CAT in runc
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.
This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).
For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.
About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.
Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| |-- cbm_mask
| |-- min_cbm_bits
| |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
|-- cpus
|-- schemata
|-- tasks
For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.
The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:
"linux": {
"intelRdt": {
"l3CacheSchema": "L3:0=ffff0;1=3ff"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-08-30 19:34:26 +08:00
|
|
|
cmd: cmd,
|
2019-04-23 22:02:31 +08:00
|
|
|
messageSockPair: messageSockPair,
|
|
|
|
logFilePair: logFilePair,
|
libcontainer: add support for Intel RDT/CAT in runc
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.
This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).
For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.
About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.
Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| |-- cbm_mask
| |-- min_cbm_bits
| |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
|-- cpus
|-- schemata
|-- tasks
For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.
The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:
"linux": {
"intelRdt": {
"l3CacheSchema": "L3:0=ffff0;1=3ff"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-08-30 19:34:26 +08:00
|
|
|
manager: c.cgroupManager,
|
|
|
|
intelRdtManager: c.intelRdtManager,
|
|
|
|
config: c.newInitConfig(p),
|
|
|
|
container: c,
|
|
|
|
process: p,
|
|
|
|
bootstrapData: data,
|
|
|
|
sharePidns: sharePidns,
|
libcontainer: Set 'status' in hook stdin
Finish off the work started in a344b2d6 (sync up `HookState` with OCI
spec `State`, 2016-12-19, #1201).
And drop HookState, since there's no need for a local alias for
specs.State.
Also set c.initProcess in newInitProcess to support OCIState calls
from within initProcess.start(). I think the cyclic references
between linuxContainer and initProcess are unfortunate, but didn't
want to address that here.
I've also left the timing of the Prestart hooks alone, although the
spec calls for them to happen before start (not as part of creation)
[1,2]. Once the timing gets fixed we can drop the
initProcessStartTime hacks which initProcess.start currently needs.
I'm not sure why we trigger the prestart hooks in response to both
procReady and procHooks. But we've had two prestart rounds in
initProcess.start since 2f276498 (Move pre-start hooks after container
mounts, 2016-02-17, #568). I've left that alone too.
I really think we should have len() guards to avoid computing the
state when .Hooks is non-nil but the particular phase we're looking at
is empty. Aleksa, however, is adamantly against them [3] citing a
risk of sloppy copy/pastes causing the hook slice being len-guarded to
diverge from the hook slice being iterated over within the guard. I
think that ort of thing is very lo-risk, because:
* We shouldn't be copy/pasting this, right? DRY for the win :).
* There's only ever a few lines between the guard and the guarded
loop. That makes broken copy/pastes easy to catch in review.
* We should have test coverage for these. Guarding with the wrong
slice is certainly not the only thing you can break with a sloppy
copy/paste.
But I'm not a maintainer ;).
[1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.0/config.md#prestart
[2]: https://github.com/opencontainers/runc/issues/1710
[3]: https://github.com/opencontainers/runc/pull/1741#discussion_r233331570
Signed-off-by: W. Trevor King <wking@tremily.us>
2018-02-26 06:47:41 +08:00
|
|
|
}
|
|
|
|
c.initProcess = init
|
|
|
|
return init, nil
|
2015-02-07 13:12:27 +08:00
|
|
|
}
|
|
|
|
|
2019-04-23 22:02:31 +08:00
|
|
|
func (c *linuxContainer) newSetnsProcess(p *Process, cmd *exec.Cmd, messageSockPair, logFilePair filePair) (*setnsProcess, error) {
|
2016-01-20 11:02:31 +08:00
|
|
|
cmd.Env = append(cmd.Env, "_LIBCONTAINER_INITTYPE="+string(initSetns))
|
2015-09-14 08:40:43 +08:00
|
|
|
state, err := c.currentState()
|
|
|
|
if err != nil {
|
2016-04-19 02:37:26 +08:00
|
|
|
return nil, newSystemErrorWithCause(err, "getting container's current state")
|
2015-09-14 08:40:43 +08:00
|
|
|
}
|
2016-12-01 15:23:58 +08:00
|
|
|
// for setns process, we don't have to set cloneflags as the process namespaces
|
2015-10-17 23:35:36 +08:00
|
|
|
// will only be set via setns syscall
|
2016-06-03 23:29:34 +08:00
|
|
|
data, err := c.bootstrapData(0, state.NamespacePaths)
|
2015-10-17 23:35:36 +08:00
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
2015-03-05 08:04:20 +08:00
|
|
|
}
|
2015-02-07 13:12:27 +08:00
|
|
|
return &setnsProcess{
|
Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-07-05 14:28:21 +08:00
|
|
|
cmd: cmd,
|
2020-05-08 09:27:53 +08:00
|
|
|
cgroupPaths: state.CgroupPaths,
|
Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-07-05 14:28:21 +08:00
|
|
|
rootlessCgroups: c.config.RootlessCgroups,
|
|
|
|
intelRdtPath: state.IntelRdtPath,
|
2019-04-23 22:02:31 +08:00
|
|
|
messageSockPair: messageSockPair,
|
|
|
|
logFilePair: logFilePair,
|
Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-07-05 14:28:21 +08:00
|
|
|
config: c.newInitConfig(p),
|
|
|
|
process: p,
|
|
|
|
bootstrapData: data,
|
2020-05-19 19:46:31 +08:00
|
|
|
initProcessPid: state.InitProcessPid,
|
2015-10-17 23:14:26 +08:00
|
|
|
}, nil
|
2015-02-07 13:12:27 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
func (c *linuxContainer) newInitConfig(process *Process) *initConfig {
|
2016-03-04 02:44:33 +08:00
|
|
|
cfg := &initConfig{
|
2015-04-01 05:40:05 +08:00
|
|
|
Config: c.config,
|
|
|
|
Args: process.Args,
|
|
|
|
Env: process.Env,
|
|
|
|
User: process.User,
|
2016-06-10 18:35:13 +08:00
|
|
|
AdditionalGroups: process.AdditionalGroups,
|
2015-04-01 05:40:05 +08:00
|
|
|
Cwd: process.Cwd,
|
|
|
|
Capabilities: process.Capabilities,
|
|
|
|
PassedFilesCount: len(process.ExtraFiles),
|
2016-02-23 04:36:12 +08:00
|
|
|
ContainerId: c.ID(),
|
2016-03-04 02:44:33 +08:00
|
|
|
NoNewPrivileges: c.config.NoNewPrivileges,
|
Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-07-05 14:28:21 +08:00
|
|
|
RootlessEUID: c.config.RootlessEUID,
|
|
|
|
RootlessCgroups: c.config.RootlessCgroups,
|
2016-03-04 02:44:33 +08:00
|
|
|
AppArmorProfile: c.config.AppArmorProfile,
|
|
|
|
ProcessLabel: c.config.ProcessLabel,
|
2016-03-11 06:35:16 +08:00
|
|
|
Rlimits: c.config.Rlimits,
|
2015-02-01 13:21:06 +08:00
|
|
|
}
|
2016-03-04 02:44:33 +08:00
|
|
|
if process.NoNewPrivileges != nil {
|
|
|
|
cfg.NoNewPrivileges = *process.NoNewPrivileges
|
|
|
|
}
|
|
|
|
if process.AppArmorProfile != "" {
|
|
|
|
cfg.AppArmorProfile = process.AppArmorProfile
|
|
|
|
}
|
|
|
|
if process.Label != "" {
|
|
|
|
cfg.ProcessLabel = process.Label
|
|
|
|
}
|
2016-03-11 06:35:16 +08:00
|
|
|
if len(process.Rlimits) > 0 {
|
|
|
|
cfg.Rlimits = process.Rlimits
|
|
|
|
}
|
2017-03-03 04:53:06 +08:00
|
|
|
cfg.CreateConsole = process.ConsoleSocket != nil
|
2017-09-26 21:39:46 +08:00
|
|
|
cfg.ConsoleWidth = process.ConsoleWidth
|
|
|
|
cfg.ConsoleHeight = process.ConsoleHeight
|
2016-03-04 02:44:33 +08:00
|
|
|
return cfg
|
2014-12-15 23:05:11 +08:00
|
|
|
}
|
|
|
|
|
2014-10-28 08:51:14 +08:00
|
|
|
func (c *linuxContainer) Destroy() error {
|
2015-02-14 06:41:37 +08:00
|
|
|
c.m.Lock()
|
|
|
|
defer c.m.Unlock()
|
2015-10-03 02:16:50 +08:00
|
|
|
return c.state.destroy()
|
2014-10-28 08:51:14 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
func (c *linuxContainer) Pause() error {
|
2015-02-14 06:41:37 +08:00
|
|
|
c.m.Lock()
|
|
|
|
defer c.m.Unlock()
|
2016-01-22 08:43:33 +08:00
|
|
|
status, err := c.currentStatus()
|
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2016-05-20 08:28:58 +08:00
|
|
|
switch status {
|
|
|
|
case Running, Created:
|
|
|
|
if err := c.cgroupManager.Freeze(configs.Frozen); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
return c.state.transition(&pausedState{
|
|
|
|
c: c,
|
|
|
|
})
|
2015-10-03 02:16:50 +08:00
|
|
|
}
|
2016-09-20 10:49:04 +08:00
|
|
|
return newGenericError(fmt.Errorf("container not running or created: %s", status), ContainerNotRunning)
|
2014-10-28 08:51:14 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
func (c *linuxContainer) Resume() error {
|
2015-02-14 06:41:37 +08:00
|
|
|
c.m.Lock()
|
|
|
|
defer c.m.Unlock()
|
2016-01-22 08:43:33 +08:00
|
|
|
status, err := c.currentStatus()
|
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
if status != Paused {
|
|
|
|
return newGenericError(fmt.Errorf("container not paused"), ContainerNotPaused)
|
|
|
|
}
|
2015-10-03 02:16:50 +08:00
|
|
|
if err := c.cgroupManager.Freeze(configs.Thawed); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
return c.state.transition(&runningState{
|
|
|
|
c: c,
|
|
|
|
})
|
2014-10-28 08:51:14 +08:00
|
|
|
}
|
|
|
|
|
2015-02-12 07:09:54 +08:00
|
|
|
func (c *linuxContainer) NotifyOOM() (<-chan struct{}, error) {
|
2016-04-26 00:19:39 +08:00
|
|
|
// XXX(cyphar): This requires cgroups.
|
Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-07-05 14:28:21 +08:00
|
|
|
if c.config.RootlessCgroups {
|
|
|
|
logrus.Warn("getting OOM notifications may fail if you don't have the full access to cgroups")
|
2016-04-26 00:19:39 +08:00
|
|
|
}
|
Simplify cgroup path handing in v2 via unified API
This unties the Gordian Knot of using GetPaths in cgroupv2 code.
The problem is, the current code uses GetPaths for three kinds of things:
1. Get all the paths to cgroup v1 controllers to save its state (see
(*linuxContainer).currentState(), (*LinuxFactory).loadState()
methods).
2. Get all the paths to cgroup v1 controllers to have the setns process
enter the proper cgroups in `(*setnsProcess).start()`.
3. Get the path to a specific controller (for example,
`m.GetPaths()["devices"]`).
Now, for cgroup v2 instead of a set of per-controller paths, we have only
one single unified path, and a dedicated function `GetUnifiedPath()` to get it.
This discrepancy between v1 and v2 cgroupManager API leads to the
following problems with the code:
- multiple if/else code blocks that have to treat v1 and v2 separately;
- backward-compatible GetPaths() methods in v2 controllers;
- - repeated writing of the PID into the same cgroup for v2;
Overall, it's hard to write the right code with all this, and the code
that is written is kinda hard to follow.
The solution is to slightly change the API to do the 3 things outlined
above in the same manner for v1 and v2:
1. Use `GetPaths()` for state saving and setns process cgroups entering.
2. Introduce and use Path(subsys string) to obtain a path to a
subsystem. For v2, the argument is ignored and the unified path is
returned.
This commit converts all the controllers to the new API, and modifies
all the users to use it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-07 08:36:28 +08:00
|
|
|
path := c.cgroupManager.Path("memory")
|
2020-04-25 13:54:06 +08:00
|
|
|
if cgroups.IsCgroup2UnifiedMode() {
|
|
|
|
return notifyOnOOMV2(path)
|
|
|
|
}
|
Simplify cgroup path handing in v2 via unified API
This unties the Gordian Knot of using GetPaths in cgroupv2 code.
The problem is, the current code uses GetPaths for three kinds of things:
1. Get all the paths to cgroup v1 controllers to save its state (see
(*linuxContainer).currentState(), (*LinuxFactory).loadState()
methods).
2. Get all the paths to cgroup v1 controllers to have the setns process
enter the proper cgroups in `(*setnsProcess).start()`.
3. Get the path to a specific controller (for example,
`m.GetPaths()["devices"]`).
Now, for cgroup v2 instead of a set of per-controller paths, we have only
one single unified path, and a dedicated function `GetUnifiedPath()` to get it.
This discrepancy between v1 and v2 cgroupManager API leads to the
following problems with the code:
- multiple if/else code blocks that have to treat v1 and v2 separately;
- backward-compatible GetPaths() methods in v2 controllers;
- - repeated writing of the PID into the same cgroup for v2;
Overall, it's hard to write the right code with all this, and the code
that is written is kinda hard to follow.
The solution is to slightly change the API to do the 3 things outlined
above in the same manner for v1 and v2:
1. Use `GetPaths()` for state saving and setns process cgroups entering.
2. Introduce and use Path(subsys string) to obtain a path to a
subsystem. For v2, the argument is ignored and the unified path is
returned.
This commit converts all the controllers to the new API, and modifies
all the users to use it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-07 08:36:28 +08:00
|
|
|
return notifyOnOOM(path)
|
2015-02-01 13:21:06 +08:00
|
|
|
}
|
2015-02-12 08:45:23 +08:00
|
|
|
|
2015-12-08 23:33:47 +08:00
|
|
|
func (c *linuxContainer) NotifyMemoryPressure(level PressureLevel) (<-chan struct{}, error) {
|
2016-04-26 00:19:39 +08:00
|
|
|
// XXX(cyphar): This requires cgroups.
|
Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-07-05 14:28:21 +08:00
|
|
|
if c.config.RootlessCgroups {
|
|
|
|
logrus.Warn("getting memory pressure notifications may fail if you don't have the full access to cgroups")
|
2016-04-26 00:19:39 +08:00
|
|
|
}
|
Simplify cgroup path handing in v2 via unified API
This unties the Gordian Knot of using GetPaths in cgroupv2 code.
The problem is, the current code uses GetPaths for three kinds of things:
1. Get all the paths to cgroup v1 controllers to save its state (see
(*linuxContainer).currentState(), (*LinuxFactory).loadState()
methods).
2. Get all the paths to cgroup v1 controllers to have the setns process
enter the proper cgroups in `(*setnsProcess).start()`.
3. Get the path to a specific controller (for example,
`m.GetPaths()["devices"]`).
Now, for cgroup v2 instead of a set of per-controller paths, we have only
one single unified path, and a dedicated function `GetUnifiedPath()` to get it.
This discrepancy between v1 and v2 cgroupManager API leads to the
following problems with the code:
- multiple if/else code blocks that have to treat v1 and v2 separately;
- backward-compatible GetPaths() methods in v2 controllers;
- - repeated writing of the PID into the same cgroup for v2;
Overall, it's hard to write the right code with all this, and the code
that is written is kinda hard to follow.
The solution is to slightly change the API to do the 3 things outlined
above in the same manner for v1 and v2:
1. Use `GetPaths()` for state saving and setns process cgroups entering.
2. Introduce and use Path(subsys string) to obtain a path to a
subsystem. For v2, the argument is ignored and the unified path is
returned.
This commit converts all the controllers to the new API, and modifies
all the users to use it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-07 08:36:28 +08:00
|
|
|
return notifyMemoryPressure(c.cgroupManager.Path("memory"), level)
|
2015-12-08 23:33:47 +08:00
|
|
|
}
|
|
|
|
|
2017-03-15 04:21:58 +08:00
|
|
|
var criuFeatures *criurpc.CriuFeatures
|
|
|
|
|
|
|
|
func (c *linuxContainer) checkCriuFeatures(criuOpts *CriuOpts, rpcOpts *criurpc.CriuOpts, criuFeat *criurpc.CriuFeatures) error {
|
|
|
|
|
|
|
|
var t criurpc.CriuReqType
|
|
|
|
t = criurpc.CriuReqType_FEATURE_CHECK
|
|
|
|
|
|
|
|
// make sure the features we are looking for are really not from
|
|
|
|
// some previous check
|
|
|
|
criuFeatures = nil
|
|
|
|
|
|
|
|
req := &criurpc.CriuReq{
|
|
|
|
Type: &t,
|
|
|
|
// Theoretically this should not be necessary but CRIU
|
|
|
|
// segfaults if Opts is empty.
|
|
|
|
// Fixed in CRIU 2.12
|
|
|
|
Opts: rpcOpts,
|
|
|
|
Features: criuFeat,
|
|
|
|
}
|
|
|
|
|
2018-07-03 04:22:38 +08:00
|
|
|
err := c.criuSwrk(nil, req, criuOpts, false, nil)
|
2017-03-15 04:21:58 +08:00
|
|
|
if err != nil {
|
|
|
|
logrus.Debugf("%s", err)
|
2020-05-17 08:20:44 +08:00
|
|
|
return errors.New("CRIU feature check failed")
|
2017-03-15 04:21:58 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
logrus.Debugf("Feature check says: %s", criuFeatures)
|
|
|
|
missingFeatures := false
|
|
|
|
|
2017-07-28 16:44:45 +08:00
|
|
|
// The outer if checks if the fields actually exist
|
|
|
|
if (criuFeat.MemTrack != nil) &&
|
|
|
|
(criuFeatures.MemTrack != nil) {
|
|
|
|
// The inner if checks if they are set to true
|
|
|
|
if *criuFeat.MemTrack && !*criuFeatures.MemTrack {
|
|
|
|
missingFeatures = true
|
|
|
|
logrus.Debugf("CRIU does not support MemTrack")
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// This needs to be repeated for every new feature check.
|
|
|
|
// Is there a way to put this in a function. Reflection?
|
|
|
|
if (criuFeat.LazyPages != nil) &&
|
|
|
|
(criuFeatures.LazyPages != nil) {
|
|
|
|
if *criuFeat.LazyPages && !*criuFeatures.LazyPages {
|
|
|
|
missingFeatures = true
|
|
|
|
logrus.Debugf("CRIU does not support LazyPages")
|
|
|
|
}
|
2017-03-15 04:21:58 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
if missingFeatures {
|
2020-05-17 08:20:44 +08:00
|
|
|
return errors.New("CRIU is missing features")
|
2017-03-15 04:21:58 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
2017-07-26 14:49:09 +08:00
|
|
|
func compareCriuVersion(criuVersion int, minVersion int) error {
|
|
|
|
// simple function to perform the actual version compare
|
|
|
|
if criuVersion < minVersion {
|
|
|
|
return fmt.Errorf("CRIU version %d must be %d or higher", criuVersion, minVersion)
|
2015-04-01 23:15:00 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
2017-07-26 14:49:09 +08:00
|
|
|
// checkCriuVersion checks Criu version greater than or equal to minVersion
|
|
|
|
func (c *linuxContainer) checkCriuVersion(minVersion int) error {
|
|
|
|
|
|
|
|
// If the version of criu has already been determined there is no need
|
|
|
|
// to ask criu for the version again. Use the value from c.criuVersion.
|
|
|
|
if c.criuVersion != 0 {
|
|
|
|
return compareCriuVersion(c.criuVersion, minVersion)
|
|
|
|
}
|
|
|
|
|
2020-05-15 20:23:56 +08:00
|
|
|
criu := criu.MakeCriu()
|
|
|
|
var err error
|
|
|
|
c.criuVersion, err = criu.GetCriuVersion()
|
2017-08-02 23:55:56 +08:00
|
|
|
if err != nil {
|
|
|
|
return fmt.Errorf("CRIU version check failed: %s", err)
|
|
|
|
}
|
|
|
|
|
2017-07-26 14:49:09 +08:00
|
|
|
return compareCriuVersion(c.criuVersion, minVersion)
|
|
|
|
}
|
|
|
|
|
2015-08-05 05:44:45 +08:00
|
|
|
const descriptorsFilename = "descriptors.json"
|
2015-05-05 02:25:43 +08:00
|
|
|
|
2015-07-21 02:25:22 +08:00
|
|
|
func (c *linuxContainer) addCriuDumpMount(req *criurpc.CriuReq, m *configs.Mount) {
|
|
|
|
mountDest := m.Destination
|
|
|
|
if strings.HasPrefix(mountDest, c.config.Rootfs) {
|
|
|
|
mountDest = mountDest[len(c.config.Rootfs):]
|
|
|
|
}
|
|
|
|
|
|
|
|
extMnt := &criurpc.ExtMountMap{
|
|
|
|
Key: proto.String(mountDest),
|
|
|
|
Val: proto.String(mountDest),
|
|
|
|
}
|
|
|
|
req.Opts.ExtMnt = append(req.Opts.ExtMnt, extMnt)
|
|
|
|
}
|
|
|
|
|
2016-10-13 09:15:18 +08:00
|
|
|
func (c *linuxContainer) addMaskPaths(req *criurpc.CriuReq) error {
|
|
|
|
for _, path := range c.config.MaskPaths {
|
|
|
|
fi, err := os.Stat(fmt.Sprintf("/proc/%d/root/%s", c.initProcess.pid(), path))
|
|
|
|
if err != nil {
|
|
|
|
if os.IsNotExist(err) {
|
|
|
|
continue
|
|
|
|
}
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
if fi.IsDir() {
|
|
|
|
continue
|
|
|
|
}
|
|
|
|
|
|
|
|
extMnt := &criurpc.ExtMountMap{
|
|
|
|
Key: proto.String(path),
|
|
|
|
Val: proto.String("/dev/null"),
|
|
|
|
}
|
|
|
|
req.Opts.ExtMnt = append(req.Opts.ExtMnt, extMnt)
|
|
|
|
}
|
2017-07-24 23:43:14 +08:00
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
2018-11-17 03:42:09 +08:00
|
|
|
func (c *linuxContainer) handleCriuConfigurationFile(rpcOpts *criurpc.CriuOpts) {
|
|
|
|
// CRIU will evaluate a configuration starting with release 3.11.
|
|
|
|
// Settings in the configuration file will overwrite RPC settings.
|
|
|
|
// Look for annotations. The annotation 'org.criu.config'
|
|
|
|
// specifies if CRIU should use a different, container specific
|
|
|
|
// configuration file.
|
|
|
|
_, annotations := utils.Annotations(c.config.Labels)
|
|
|
|
configFile, exists := annotations["org.criu.config"]
|
|
|
|
if exists {
|
|
|
|
// If the annotation 'org.criu.config' exists and is set
|
|
|
|
// to a non-empty string, tell CRIU to use that as a
|
|
|
|
// configuration file. If the file does not exist, CRIU
|
|
|
|
// will just ignore it.
|
|
|
|
if configFile != "" {
|
|
|
|
rpcOpts.ConfigFile = proto.String(configFile)
|
|
|
|
}
|
|
|
|
// If 'org.criu.config' exists and is set to an empty
|
|
|
|
// string, a runc specific CRIU configuration file will
|
|
|
|
// be not set at all.
|
|
|
|
} else {
|
|
|
|
// If the mentioned annotation has not been found, specify
|
|
|
|
// a default CRIU configuration file.
|
|
|
|
rpcOpts.ConfigFile = proto.String("/etc/criu/runc.conf")
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-04-19 09:28:40 +08:00
|
|
|
func (c *linuxContainer) Checkpoint(criuOpts *CriuOpts) error {
|
2015-03-13 12:45:43 +08:00
|
|
|
c.m.Lock()
|
|
|
|
defer c.m.Unlock()
|
2015-04-01 23:15:00 +08:00
|
|
|
|
Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-07-05 14:28:21 +08:00
|
|
|
// Checkpoint is unlikely to work if os.Geteuid() != 0 || system.RunningInUserNS().
|
|
|
|
// (CLI prints a warning)
|
2016-04-23 21:39:42 +08:00
|
|
|
// TODO(avagin): Figure out how to make this work nicely. CRIU 2.0 has
|
|
|
|
// support for doing unprivileged dumps, but the setup of
|
|
|
|
// rootless containers might make this complicated.
|
|
|
|
|
2020-05-15 20:23:56 +08:00
|
|
|
// We are relying on the CRIU version RPC which was introduced with CRIU 3.0.0
|
|
|
|
if err := c.checkCriuVersion(30000); err != nil {
|
2015-04-01 23:15:00 +08:00
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
2015-04-19 09:28:40 +08:00
|
|
|
if criuOpts.ImagesDirectory == "" {
|
2020-05-17 08:20:44 +08:00
|
|
|
return errors.New("invalid directory to save checkpoint")
|
2015-04-19 09:28:40 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
// Since a container can be C/R'ed multiple times,
|
|
|
|
// the checkpoint directory may already exist.
|
2019-10-17 14:49:38 +08:00
|
|
|
if err := os.Mkdir(criuOpts.ImagesDirectory, 0700); err != nil && !os.IsExist(err) {
|
2015-03-07 03:32:16 +08:00
|
|
|
return err
|
|
|
|
}
|
2015-04-10 23:19:14 +08:00
|
|
|
|
2015-04-19 09:28:40 +08:00
|
|
|
if criuOpts.WorkDirectory == "" {
|
|
|
|
criuOpts.WorkDirectory = filepath.Join(c.root, "criu.work")
|
|
|
|
}
|
|
|
|
|
2019-10-17 14:49:38 +08:00
|
|
|
if err := os.Mkdir(criuOpts.WorkDirectory, 0700); err != nil && !os.IsExist(err) {
|
2015-04-19 09:28:40 +08:00
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
|
|
|
workDir, err := os.Open(criuOpts.WorkDirectory)
|
2015-04-10 23:19:14 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
defer workDir.Close()
|
|
|
|
|
2015-04-19 09:28:40 +08:00
|
|
|
imageDir, err := os.Open(criuOpts.ImagesDirectory)
|
2015-04-10 23:19:14 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
defer imageDir.Close()
|
2015-04-19 09:28:40 +08:00
|
|
|
|
|
|
|
rpcOpts := criurpc.CriuOpts{
|
2017-03-02 16:02:15 +08:00
|
|
|
ImagesDirFd: proto.Int32(int32(imageDir.Fd())),
|
|
|
|
WorkDirFd: proto.Int32(int32(workDir.Fd())),
|
|
|
|
LogLevel: proto.Int32(4),
|
|
|
|
LogFile: proto.String("dump.log"),
|
|
|
|
Root: proto.String(c.config.Rootfs),
|
|
|
|
ManageCgroups: proto.Bool(true),
|
|
|
|
NotifyScripts: proto.Bool(true),
|
|
|
|
Pid: proto.Int32(int32(c.initProcess.pid())),
|
|
|
|
ShellJob: proto.Bool(criuOpts.ShellJob),
|
|
|
|
LeaveRunning: proto.Bool(criuOpts.LeaveRunning),
|
|
|
|
TcpEstablished: proto.Bool(criuOpts.TcpEstablished),
|
|
|
|
ExtUnixSk: proto.Bool(criuOpts.ExternalUnixConnections),
|
|
|
|
FileLocks: proto.Bool(criuOpts.FileLocks),
|
|
|
|
EmptyNs: proto.Uint32(criuOpts.EmptyNs),
|
|
|
|
OrphanPtsMaster: proto.Bool(true),
|
2017-08-18 06:31:49 +08:00
|
|
|
AutoDedup: proto.Bool(criuOpts.AutoDedup),
|
2017-07-24 23:43:14 +08:00
|
|
|
LazyPages: proto.Bool(criuOpts.LazyPages),
|
2015-04-19 09:28:40 +08:00
|
|
|
}
|
|
|
|
|
2018-11-17 03:42:09 +08:00
|
|
|
c.handleCriuConfigurationFile(&rpcOpts)
|
|
|
|
|
2018-07-03 04:22:38 +08:00
|
|
|
// If the container is running in a network namespace and has
|
|
|
|
// a path to the network namespace configured, we will dump
|
|
|
|
// that network namespace as an external namespace and we
|
|
|
|
// will expect that the namespace exists during restore.
|
|
|
|
// This basically means that CRIU will ignore the namespace
|
|
|
|
// and expect to be setup correctly.
|
|
|
|
nsPath := c.config.Namespaces.PathOf(configs.NEWNET)
|
|
|
|
if nsPath != "" {
|
|
|
|
// For this to work we need at least criu 3.11.0 => 31100.
|
|
|
|
// As there was already a successful version check we will
|
|
|
|
// not error out if it fails. runc will just behave as it used
|
|
|
|
// to do and ignore external network namespaces.
|
|
|
|
err := c.checkCriuVersion(31100)
|
|
|
|
if err == nil {
|
|
|
|
// CRIU expects the information about an external namespace
|
|
|
|
// like this: --external net[<inode>]:<key>
|
|
|
|
// This <key> is always 'extRootNetNS'.
|
2020-04-19 07:05:10 +08:00
|
|
|
var netns unix.Stat_t
|
|
|
|
err = unix.Stat(nsPath, &netns)
|
2018-07-03 04:22:38 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
criuExternal := fmt.Sprintf("net[%d]:extRootNetNS", netns.Ino)
|
|
|
|
rpcOpts.External = append(rpcOpts.External, criuExternal)
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-04-18 07:17:00 +08:00
|
|
|
// CRIU can use cgroup freezer; when rpcOpts.FreezeCgroup
|
|
|
|
// is not set, CRIU uses ptrace() to pause the processes.
|
Simplify cgroup path handing in v2 via unified API
This unties the Gordian Knot of using GetPaths in cgroupv2 code.
The problem is, the current code uses GetPaths for three kinds of things:
1. Get all the paths to cgroup v1 controllers to save its state (see
(*linuxContainer).currentState(), (*LinuxFactory).loadState()
methods).
2. Get all the paths to cgroup v1 controllers to have the setns process
enter the proper cgroups in `(*setnsProcess).start()`.
3. Get the path to a specific controller (for example,
`m.GetPaths()["devices"]`).
Now, for cgroup v2 instead of a set of per-controller paths, we have only
one single unified path, and a dedicated function `GetUnifiedPath()` to get it.
This discrepancy between v1 and v2 cgroupManager API leads to the
following problems with the code:
- multiple if/else code blocks that have to treat v1 and v2 separately;
- backward-compatible GetPaths() methods in v2 controllers;
- - repeated writing of the PID into the same cgroup for v2;
Overall, it's hard to write the right code with all this, and the code
that is written is kinda hard to follow.
The solution is to slightly change the API to do the 3 things outlined
above in the same manner for v1 and v2:
1. Use `GetPaths()` for state saving and setns process cgroups entering.
2. Introduce and use Path(subsys string) to obtain a path to a
subsystem. For v2, the argument is ignored and the unified path is
returned.
This commit converts all the controllers to the new API, and modifies
all the users to use it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-07 08:36:28 +08:00
|
|
|
// Note cgroup v2 freezer is only supported since CRIU release 3.14.
|
|
|
|
if !cgroups.IsCgroup2UnifiedMode() || c.checkCriuVersion(31400) == nil {
|
|
|
|
if fcg := c.cgroupManager.Path("freezer"); fcg != "" {
|
|
|
|
rpcOpts.FreezeCgroup = proto.String(fcg)
|
2020-02-05 17:32:21 +08:00
|
|
|
}
|
2015-04-19 09:28:40 +08:00
|
|
|
}
|
|
|
|
|
2015-04-24 04:16:47 +08:00
|
|
|
// append optional criu opts, e.g., page-server and port
|
2015-04-24 23:13:15 +08:00
|
|
|
if criuOpts.PageServer.Address != "" && criuOpts.PageServer.Port != 0 {
|
2015-04-24 04:16:47 +08:00
|
|
|
rpcOpts.Ps = &criurpc.CriuPageServerInfo{
|
2015-04-24 23:13:15 +08:00
|
|
|
Address: proto.String(criuOpts.PageServer.Address),
|
|
|
|
Port: proto.Int32(criuOpts.PageServer.Port),
|
2015-04-24 04:16:47 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-08-24 17:48:56 +08:00
|
|
|
//pre-dump may need parentImage param to complete iterative migration
|
|
|
|
if criuOpts.ParentImage != "" {
|
|
|
|
rpcOpts.ParentImg = proto.String(criuOpts.ParentImage)
|
|
|
|
rpcOpts.TrackMem = proto.Bool(true)
|
|
|
|
}
|
|
|
|
|
2015-08-06 23:14:59 +08:00
|
|
|
// append optional manage cgroups mode
|
|
|
|
if criuOpts.ManageCgroupsMode != 0 {
|
2016-02-19 06:08:23 +08:00
|
|
|
mode := criurpc.CriuCgMode(criuOpts.ManageCgroupsMode)
|
|
|
|
rpcOpts.ManageCgroupsMode = &mode
|
2015-08-06 23:14:59 +08:00
|
|
|
}
|
|
|
|
|
2016-08-24 17:48:56 +08:00
|
|
|
var t criurpc.CriuReqType
|
|
|
|
if criuOpts.PreDump {
|
2017-03-15 04:21:58 +08:00
|
|
|
feat := criurpc.CriuFeatures{
|
|
|
|
MemTrack: proto.Bool(true),
|
|
|
|
}
|
|
|
|
|
|
|
|
if err := c.checkCriuFeatures(criuOpts, &rpcOpts, &feat); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
2016-08-24 17:48:56 +08:00
|
|
|
t = criurpc.CriuReqType_PRE_DUMP
|
|
|
|
} else {
|
|
|
|
t = criurpc.CriuReqType_DUMP
|
|
|
|
}
|
2015-04-19 09:28:40 +08:00
|
|
|
|
2017-07-24 23:43:14 +08:00
|
|
|
if criuOpts.LazyPages {
|
|
|
|
// lazy migration requested; check if criu supports it
|
|
|
|
feat := criurpc.CriuFeatures{
|
|
|
|
LazyPages: proto.Bool(true),
|
|
|
|
}
|
|
|
|
if err := c.checkCriuFeatures(criuOpts, &rpcOpts, &feat); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
runc checkpoint: fix --status-fd to accept fd
1. The command `runc checkpoint --lazy-server --status-fd $FD` actually
accepts a file name as an $FD. Make it accept a file descriptor,
like its name implies and the documentation states.
In addition, since runc itself does not use the result of CRIU status
fd, remove the code which relays it, and pass the FD directly to CRIU.
Note 1: runc should close this file descriptor itself after passing it
to criu, otherwise whoever waits on it might wait forever.
Note 2: due to the way criu swrk consumes the fd (it reopens
/proc/$SENDER_PID/fd/$FD), runc can't close it as soon as criu swrk has
started. There is no good way to know when criu swrk has reopened the
fd, so we assume that as soon as we have received something back, the
fd is already reopened.
2. Since the meaning of --status-fd has changed, the test case using
it needs to be fixed as well.
Modify the lazy migration test to remove "sleep 2", actually waiting
for the the lazy page server to be ready.
While at it,
- remove the double fork (using shell's background process is
sufficient here);
- check the exit code for "runc checkpoint" and "criu lazy-pages";
- remove the check for no errors in dump.log after restore, as we
are already checking its exit code.
[v2: properly close status fd after spawning criu]
[v3: move close status fd to after the first read]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-21 17:43:24 +08:00
|
|
|
if fd := criuOpts.StatusFd; fd != -1 {
|
|
|
|
// check that the FD is valid
|
|
|
|
flags, err := unix.FcntlInt(uintptr(fd), unix.F_GETFL, 0)
|
|
|
|
if err != nil {
|
|
|
|
return fmt.Errorf("invalid --status-fd argument %d: %w", fd, err)
|
|
|
|
}
|
|
|
|
// and writable
|
|
|
|
if flags&unix.O_WRONLY == 0 {
|
|
|
|
return fmt.Errorf("invalid --status-fd argument %d: not writable", fd)
|
|
|
|
}
|
|
|
|
|
2020-05-13 01:57:04 +08:00
|
|
|
if c.checkCriuVersion(31500) != nil {
|
|
|
|
// For criu 3.15+, use notifications (see case "status-ready"
|
|
|
|
// in criuNotifications). Otherwise, rely on criu status fd.
|
|
|
|
rpcOpts.StatusFd = proto.Int32(int32(fd))
|
|
|
|
}
|
2017-07-24 23:43:14 +08:00
|
|
|
}
|
runc checkpoint: fix --status-fd to accept fd
1. The command `runc checkpoint --lazy-server --status-fd $FD` actually
accepts a file name as an $FD. Make it accept a file descriptor,
like its name implies and the documentation states.
In addition, since runc itself does not use the result of CRIU status
fd, remove the code which relays it, and pass the FD directly to CRIU.
Note 1: runc should close this file descriptor itself after passing it
to criu, otherwise whoever waits on it might wait forever.
Note 2: due to the way criu swrk consumes the fd (it reopens
/proc/$SENDER_PID/fd/$FD), runc can't close it as soon as criu swrk has
started. There is no good way to know when criu swrk has reopened the
fd, so we assume that as soon as we have received something back, the
fd is already reopened.
2. Since the meaning of --status-fd has changed, the test case using
it needs to be fixed as well.
Modify the lazy migration test to remove "sleep 2", actually waiting
for the the lazy page server to be ready.
While at it,
- remove the double fork (using shell's background process is
sufficient here);
- check the exit code for "runc checkpoint" and "criu lazy-pages";
- remove the check for no errors in dump.log after restore, as we
are already checking its exit code.
[v2: properly close status fd after spawning criu]
[v3: move close status fd to after the first read]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-21 17:43:24 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
req := &criurpc.CriuReq{
|
|
|
|
Type: &t,
|
|
|
|
Opts: &rpcOpts,
|
2017-07-24 23:43:14 +08:00
|
|
|
}
|
|
|
|
|
2020-04-21 09:06:59 +08:00
|
|
|
// no need to dump all this in pre-dump
|
2016-08-24 17:48:56 +08:00
|
|
|
if !criuOpts.PreDump {
|
2020-04-21 09:06:59 +08:00
|
|
|
hasCgroupns := c.config.Namespaces.Contains(configs.NEWCGROUP)
|
2016-08-24 17:48:56 +08:00
|
|
|
for _, m := range c.config.Mounts {
|
|
|
|
switch m.Device {
|
|
|
|
case "bind":
|
|
|
|
c.addCriuDumpMount(req, m)
|
|
|
|
case "cgroup":
|
2020-04-21 09:06:59 +08:00
|
|
|
if cgroups.IsCgroup2UnifiedMode() || hasCgroupns {
|
|
|
|
// real mount(s)
|
2020-03-28 01:44:59 +08:00
|
|
|
continue
|
|
|
|
}
|
2020-04-21 09:06:59 +08:00
|
|
|
// a set of "external" bind mounts
|
2016-08-24 17:48:56 +08:00
|
|
|
binds, err := getCgroupMounts(m)
|
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
for _, b := range binds {
|
|
|
|
c.addCriuDumpMount(req, b)
|
|
|
|
}
|
2015-07-21 02:25:22 +08:00
|
|
|
}
|
2015-03-07 03:21:02 +08:00
|
|
|
}
|
2016-10-13 09:15:18 +08:00
|
|
|
|
2016-08-24 17:48:56 +08:00
|
|
|
if err := c.addMaskPaths(req); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2016-10-13 09:15:18 +08:00
|
|
|
|
2016-08-24 17:48:56 +08:00
|
|
|
for _, node := range c.config.Devices {
|
|
|
|
m := &configs.Mount{Destination: node.Path, Source: node.Path}
|
|
|
|
c.addCriuDumpMount(req, m)
|
|
|
|
}
|
2015-04-29 00:49:44 +08:00
|
|
|
|
2016-08-24 17:48:56 +08:00
|
|
|
// Write the FD info to a file in the image directory
|
|
|
|
fdsJSON, err := json.Marshal(c.initProcess.externalDescriptors())
|
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2015-04-29 00:49:44 +08:00
|
|
|
|
2019-10-13 02:25:53 +08:00
|
|
|
err = ioutil.WriteFile(filepath.Join(criuOpts.ImagesDirectory, descriptorsFilename), fdsJSON, 0600)
|
2016-08-24 17:48:56 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2015-04-29 00:49:44 +08:00
|
|
|
}
|
|
|
|
|
2018-07-03 04:22:38 +08:00
|
|
|
err = c.criuSwrk(nil, req, criuOpts, false, nil)
|
2015-04-10 20:48:28 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2015-03-13 12:45:43 +08:00
|
|
|
return nil
|
2015-03-07 03:21:02 +08:00
|
|
|
}
|
|
|
|
|
2015-07-21 02:25:22 +08:00
|
|
|
func (c *linuxContainer) addCriuRestoreMount(req *criurpc.CriuReq, m *configs.Mount) {
|
|
|
|
mountDest := m.Destination
|
|
|
|
if strings.HasPrefix(mountDest, c.config.Rootfs) {
|
|
|
|
mountDest = mountDest[len(c.config.Rootfs):]
|
|
|
|
}
|
|
|
|
|
|
|
|
extMnt := &criurpc.ExtMountMap{
|
|
|
|
Key: proto.String(mountDest),
|
|
|
|
Val: proto.String(m.Source),
|
|
|
|
}
|
|
|
|
req.Opts.ExtMnt = append(req.Opts.ExtMnt, extMnt)
|
|
|
|
}
|
|
|
|
|
2016-03-29 05:41:50 +08:00
|
|
|
func (c *linuxContainer) restoreNetwork(req *criurpc.CriuReq, criuOpts *CriuOpts) {
|
|
|
|
for _, iface := range c.config.Networks {
|
|
|
|
switch iface.Type {
|
|
|
|
case "veth":
|
|
|
|
veth := new(criurpc.CriuVethPair)
|
|
|
|
veth.IfOut = proto.String(iface.HostInterfaceName)
|
|
|
|
veth.IfIn = proto.String(iface.Name)
|
|
|
|
req.Opts.Veths = append(req.Opts.Veths, veth)
|
|
|
|
case "loopback":
|
2017-07-28 21:02:56 +08:00
|
|
|
// Do nothing
|
2016-03-29 05:41:50 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
for _, i := range criuOpts.VethPairs {
|
|
|
|
veth := new(criurpc.CriuVethPair)
|
|
|
|
veth.IfOut = proto.String(i.HostInterfaceName)
|
|
|
|
veth.IfIn = proto.String(i.ContainerInterfaceName)
|
|
|
|
req.Opts.Veths = append(req.Opts.Veths, veth)
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-01-31 00:29:04 +08:00
|
|
|
// makeCriuRestoreMountpoints makes the actual mountpoints for the
|
|
|
|
// restore using CRIU. This function is inspired from the code in
|
|
|
|
// rootfs_linux.go
|
|
|
|
func (c *linuxContainer) makeCriuRestoreMountpoints(m *configs.Mount) error {
|
|
|
|
switch m.Device {
|
|
|
|
case "cgroup":
|
2020-04-21 09:06:59 +08:00
|
|
|
// No mount point(s) need to be created:
|
|
|
|
//
|
|
|
|
// * for v1, mount points are saved by CRIU because
|
|
|
|
// /sys/fs/cgroup is a tmpfs mount
|
|
|
|
//
|
|
|
|
// * for v2, /sys/fs/cgroup is a real mount, but
|
|
|
|
// the mountpoint appears as soon as /sys is mounted
|
|
|
|
return nil
|
2019-01-31 00:29:04 +08:00
|
|
|
case "bind":
|
|
|
|
// The prepareBindMount() function checks if source
|
|
|
|
// exists. So it cannot be used for other filesystem types.
|
|
|
|
if err := prepareBindMount(m, c.config.Rootfs); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
default:
|
2020-04-21 09:06:59 +08:00
|
|
|
// for all other filesystems just create the mountpoints
|
2019-01-31 00:29:04 +08:00
|
|
|
dest, err := securejoin.SecureJoin(c.config.Rootfs, m.Destination)
|
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2019-09-24 04:45:45 +08:00
|
|
|
if err := checkProcMount(c.config.Rootfs, dest, ""); err != nil {
|
2019-01-31 00:29:04 +08:00
|
|
|
return err
|
|
|
|
}
|
|
|
|
m.Destination = dest
|
|
|
|
if err := os.MkdirAll(dest, 0755); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
|
|
|
// isPathInPrefixList is a small function for CRIU restore to make sure
|
|
|
|
// mountpoints, which are on a tmpfs, are not created in the roofs
|
|
|
|
func isPathInPrefixList(path string, prefix []string) bool {
|
|
|
|
for _, p := range prefix {
|
|
|
|
if strings.HasPrefix(path, p+"/") {
|
2020-04-02 06:45:31 +08:00
|
|
|
return true
|
2019-01-31 00:29:04 +08:00
|
|
|
}
|
|
|
|
}
|
2020-04-02 06:45:31 +08:00
|
|
|
return false
|
2019-01-31 00:29:04 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
// prepareCriuRestoreMounts tries to set up the rootfs of the
|
|
|
|
// container to be restored in the same way runc does it for
|
|
|
|
// initial container creation. Even for a read-only rootfs container
|
|
|
|
// runc modifies the rootfs to add mountpoints which do not exist.
|
|
|
|
// This function also creates missing mountpoints as long as they
|
|
|
|
// are not on top of a tmpfs, as CRIU will restore tmpfs content anyway.
|
|
|
|
func (c *linuxContainer) prepareCriuRestoreMounts(mounts []*configs.Mount) error {
|
|
|
|
// First get a list of a all tmpfs mounts
|
|
|
|
tmpfs := []string{}
|
|
|
|
for _, m := range mounts {
|
|
|
|
switch m.Device {
|
|
|
|
case "tmpfs":
|
|
|
|
tmpfs = append(tmpfs, m.Destination)
|
|
|
|
}
|
|
|
|
}
|
|
|
|
// Now go through all mounts and create the mountpoints
|
|
|
|
// if the mountpoints are not on a tmpfs, as CRIU will
|
|
|
|
// restore the complete tmpfs content from its checkpoint.
|
|
|
|
for _, m := range mounts {
|
2020-04-02 06:45:31 +08:00
|
|
|
if !isPathInPrefixList(m.Destination, tmpfs) {
|
2019-01-31 00:29:04 +08:00
|
|
|
if err := c.makeCriuRestoreMountpoints(m); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
2015-04-19 09:28:40 +08:00
|
|
|
func (c *linuxContainer) Restore(process *Process, criuOpts *CriuOpts) error {
|
2015-03-13 12:45:43 +08:00
|
|
|
c.m.Lock()
|
|
|
|
defer c.m.Unlock()
|
2016-04-23 21:39:42 +08:00
|
|
|
|
2018-07-03 04:22:38 +08:00
|
|
|
var extraFiles []*os.File
|
|
|
|
|
Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-07-05 14:28:21 +08:00
|
|
|
// Restore is unlikely to work if os.Geteuid() != 0 || system.RunningInUserNS().
|
|
|
|
// (CLI prints a warning)
|
2016-04-23 21:39:42 +08:00
|
|
|
// TODO(avagin): Figure out how to make this work nicely. CRIU doesn't have
|
|
|
|
// support for unprivileged restore at the moment.
|
|
|
|
|
2020-05-15 20:23:56 +08:00
|
|
|
// We are relying on the CRIU version RPC which was introduced with CRIU 3.0.0
|
|
|
|
if err := c.checkCriuVersion(30000); err != nil {
|
2015-04-01 23:15:00 +08:00
|
|
|
return err
|
|
|
|
}
|
2015-04-19 09:28:40 +08:00
|
|
|
if criuOpts.WorkDirectory == "" {
|
|
|
|
criuOpts.WorkDirectory = filepath.Join(c.root, "criu.work")
|
|
|
|
}
|
2015-04-02 14:54:02 +08:00
|
|
|
// Since a container can be C/R'ed multiple times,
|
|
|
|
// the work directory may already exist.
|
2019-10-17 14:49:38 +08:00
|
|
|
if err := os.Mkdir(criuOpts.WorkDirectory, 0700); err != nil && !os.IsExist(err) {
|
2015-04-02 14:54:02 +08:00
|
|
|
return err
|
|
|
|
}
|
2015-04-19 09:28:40 +08:00
|
|
|
workDir, err := os.Open(criuOpts.WorkDirectory)
|
2015-04-02 14:54:02 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
defer workDir.Close()
|
2015-04-19 09:28:40 +08:00
|
|
|
if criuOpts.ImagesDirectory == "" {
|
2020-05-17 08:20:44 +08:00
|
|
|
return errors.New("invalid directory to restore checkpoint")
|
2015-04-19 09:28:40 +08:00
|
|
|
}
|
|
|
|
imageDir, err := os.Open(criuOpts.ImagesDirectory)
|
2015-03-26 19:20:59 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
defer imageDir.Close()
|
2015-04-22 13:12:41 +08:00
|
|
|
// CRIU has a few requirements for a root directory:
|
|
|
|
// * it must be a mount point
|
|
|
|
// * its parent must not be overmounted
|
|
|
|
// c.config.Rootfs is bind-mounted to a temporary directory
|
|
|
|
// to satisfy these requirements.
|
2015-04-16 19:15:02 +08:00
|
|
|
root := filepath.Join(c.root, "criu-root")
|
|
|
|
if err := os.Mkdir(root, 0755); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
defer os.Remove(root)
|
|
|
|
root, err = filepath.EvalSymlinks(root)
|
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2017-05-10 05:38:27 +08:00
|
|
|
err = unix.Mount(c.config.Rootfs, root, "", unix.MS_BIND|unix.MS_REC, "")
|
2015-04-16 19:15:02 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2017-05-10 05:38:27 +08:00
|
|
|
defer unix.Unmount(root, unix.MNT_DETACH)
|
2015-03-26 19:20:59 +08:00
|
|
|
t := criurpc.CriuReqType_RESTORE
|
2015-07-21 02:25:22 +08:00
|
|
|
req := &criurpc.CriuReq{
|
2015-03-26 19:20:59 +08:00
|
|
|
Type: &t,
|
|
|
|
Opts: &criurpc.CriuOpts{
|
2017-03-02 16:02:15 +08:00
|
|
|
ImagesDirFd: proto.Int32(int32(imageDir.Fd())),
|
|
|
|
WorkDirFd: proto.Int32(int32(workDir.Fd())),
|
|
|
|
EvasiveDevices: proto.Bool(true),
|
|
|
|
LogLevel: proto.Int32(4),
|
|
|
|
LogFile: proto.String("restore.log"),
|
|
|
|
RstSibling: proto.Bool(true),
|
|
|
|
Root: proto.String(root),
|
|
|
|
ManageCgroups: proto.Bool(true),
|
|
|
|
NotifyScripts: proto.Bool(true),
|
|
|
|
ShellJob: proto.Bool(criuOpts.ShellJob),
|
|
|
|
ExtUnixSk: proto.Bool(criuOpts.ExternalUnixConnections),
|
|
|
|
TcpEstablished: proto.Bool(criuOpts.TcpEstablished),
|
|
|
|
FileLocks: proto.Bool(criuOpts.FileLocks),
|
|
|
|
EmptyNs: proto.Uint32(criuOpts.EmptyNs),
|
|
|
|
OrphanPtsMaster: proto.Bool(true),
|
2017-08-18 06:31:49 +08:00
|
|
|
AutoDedup: proto.Bool(criuOpts.AutoDedup),
|
2017-07-24 23:43:14 +08:00
|
|
|
LazyPages: proto.Bool(criuOpts.LazyPages),
|
2015-03-26 19:20:59 +08:00
|
|
|
},
|
2015-03-07 03:21:02 +08:00
|
|
|
}
|
2015-09-08 17:02:08 +08:00
|
|
|
|
2018-11-17 03:42:09 +08:00
|
|
|
c.handleCriuConfigurationFile(req.Opts)
|
|
|
|
|
2018-07-03 04:22:38 +08:00
|
|
|
// Same as during checkpointing. If the container has a specific network namespace
|
|
|
|
// assigned to it, this now expects that the checkpoint will be restored in a
|
|
|
|
// already created network namespace.
|
|
|
|
nsPath := c.config.Namespaces.PathOf(configs.NEWNET)
|
|
|
|
if nsPath != "" {
|
|
|
|
// For this to work we need at least criu 3.11.0 => 31100.
|
|
|
|
// As there was already a successful version check we will
|
|
|
|
// not error out if it fails. runc will just behave as it used
|
|
|
|
// to do and ignore external network namespaces.
|
|
|
|
err := c.checkCriuVersion(31100)
|
|
|
|
if err == nil {
|
|
|
|
// CRIU wants the information about an existing network namespace
|
|
|
|
// like this: --inherit-fd fd[<fd>]:<key>
|
|
|
|
// The <key> needs to be the same as during checkpointing.
|
|
|
|
// We are always using 'extRootNetNS' as the key in this.
|
|
|
|
netns, err := os.Open(nsPath)
|
|
|
|
if err != nil {
|
2018-10-13 18:39:08 +08:00
|
|
|
logrus.Errorf("If a specific network namespace is defined it must exist: %s", err)
|
2018-07-03 04:22:38 +08:00
|
|
|
return fmt.Errorf("Requested network namespace %v does not exist", nsPath)
|
|
|
|
}
|
2020-04-16 09:33:20 +08:00
|
|
|
defer netns.Close()
|
2018-07-03 04:22:38 +08:00
|
|
|
inheritFd := new(criurpc.InheritFd)
|
|
|
|
inheritFd.Key = proto.String("extRootNetNS")
|
|
|
|
// The offset of four is necessary because 0, 1, 2 and 3 is already
|
|
|
|
// used by stdin, stdout, stderr, 'criu swrk' socket.
|
|
|
|
inheritFd.Fd = proto.Int32(int32(4 + len(extraFiles)))
|
|
|
|
req.Opts.InheritFd = append(req.Opts.InheritFd, inheritFd)
|
|
|
|
// All open FDs need to be transferred to CRIU via extraFiles
|
|
|
|
extraFiles = append(extraFiles, netns)
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-01-31 00:29:04 +08:00
|
|
|
// This will modify the rootfs of the container in the same way runc
|
|
|
|
// modifies the container during initial creation.
|
|
|
|
if err := c.prepareCriuRestoreMounts(c.config.Mounts); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
2020-04-21 09:06:59 +08:00
|
|
|
hasCgroupns := c.config.Namespaces.Contains(configs.NEWCGROUP)
|
2015-03-07 03:21:02 +08:00
|
|
|
for _, m := range c.config.Mounts {
|
2015-07-21 02:25:22 +08:00
|
|
|
switch m.Device {
|
|
|
|
case "bind":
|
|
|
|
c.addCriuRestoreMount(req, m)
|
|
|
|
case "cgroup":
|
2020-04-21 09:06:59 +08:00
|
|
|
if cgroups.IsCgroup2UnifiedMode() || hasCgroupns {
|
2020-03-28 01:44:59 +08:00
|
|
|
continue
|
|
|
|
}
|
2020-04-21 09:06:59 +08:00
|
|
|
// cgroup v1 is a set of bind mounts, unless cgroupns is used
|
2015-07-21 02:25:22 +08:00
|
|
|
binds, err := getCgroupMounts(m)
|
|
|
|
if err != nil {
|
|
|
|
return err
|
2015-04-29 00:49:44 +08:00
|
|
|
}
|
2015-07-21 02:25:22 +08:00
|
|
|
for _, b := range binds {
|
|
|
|
c.addCriuRestoreMount(req, b)
|
|
|
|
}
|
2015-03-07 03:21:02 +08:00
|
|
|
}
|
|
|
|
}
|
2016-03-29 05:41:50 +08:00
|
|
|
|
2016-10-13 09:15:18 +08:00
|
|
|
if len(c.config.MaskPaths) > 0 {
|
|
|
|
m := &configs.Mount{Destination: "/dev/null", Source: "/dev/null"}
|
|
|
|
c.addCriuRestoreMount(req, m)
|
|
|
|
}
|
|
|
|
|
|
|
|
for _, node := range c.config.Devices {
|
|
|
|
m := &configs.Mount{Destination: node.Path, Source: node.Path}
|
|
|
|
c.addCriuRestoreMount(req, m)
|
|
|
|
}
|
|
|
|
|
2017-05-10 05:38:27 +08:00
|
|
|
if criuOpts.EmptyNs&unix.CLONE_NEWNET == 0 {
|
2016-03-29 05:41:50 +08:00
|
|
|
c.restoreNetwork(req, criuOpts)
|
2015-08-25 00:26:39 +08:00
|
|
|
}
|
2015-04-29 00:49:44 +08:00
|
|
|
|
2015-08-06 23:14:59 +08:00
|
|
|
// append optional manage cgroups mode
|
|
|
|
if criuOpts.ManageCgroupsMode != 0 {
|
2016-02-19 06:08:23 +08:00
|
|
|
mode := criurpc.CriuCgMode(criuOpts.ManageCgroupsMode)
|
|
|
|
req.Opts.ManageCgroupsMode = &mode
|
2015-08-06 23:14:59 +08:00
|
|
|
}
|
|
|
|
|
2015-05-05 02:25:43 +08:00
|
|
|
var (
|
|
|
|
fds []string
|
|
|
|
fdJSON []byte
|
|
|
|
)
|
2015-08-05 05:44:45 +08:00
|
|
|
if fdJSON, err = ioutil.ReadFile(filepath.Join(criuOpts.ImagesDirectory, descriptorsFilename)); err != nil {
|
2015-04-29 00:49:44 +08:00
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
2015-10-03 02:16:50 +08:00
|
|
|
if err := json.Unmarshal(fdJSON, &fds); err != nil {
|
2015-04-29 00:49:44 +08:00
|
|
|
return err
|
|
|
|
}
|
2015-04-29 23:26:18 +08:00
|
|
|
for i := range fds {
|
2015-04-29 00:49:44 +08:00
|
|
|
if s := fds[i]; strings.Contains(s, "pipe:") {
|
2015-03-26 19:20:59 +08:00
|
|
|
inheritFd := new(criurpc.InheritFd)
|
|
|
|
inheritFd.Key = proto.String(s)
|
2015-04-29 23:26:18 +08:00
|
|
|
inheritFd.Fd = proto.Int32(int32(i))
|
2015-03-26 19:20:59 +08:00
|
|
|
req.Opts.InheritFd = append(req.Opts.InheritFd, inheritFd)
|
2015-03-19 11:22:21 +08:00
|
|
|
}
|
|
|
|
}
|
2018-07-03 04:22:38 +08:00
|
|
|
return c.criuSwrk(process, req, criuOpts, true, extraFiles)
|
2015-04-10 23:18:16 +08:00
|
|
|
}
|
|
|
|
|
2015-09-08 17:02:08 +08:00
|
|
|
func (c *linuxContainer) criuApplyCgroups(pid int, req *criurpc.CriuReq) error {
|
2016-04-23 21:39:42 +08:00
|
|
|
// XXX: Do we need to deal with this case? AFAIK criu still requires root.
|
2015-09-08 17:02:08 +08:00
|
|
|
if err := c.cgroupManager.Apply(pid); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
2017-04-07 07:34:41 +08:00
|
|
|
if err := c.cgroupManager.Set(c.config); err != nil {
|
|
|
|
return newSystemError(err)
|
|
|
|
}
|
|
|
|
|
2015-09-08 17:02:08 +08:00
|
|
|
path := fmt.Sprintf("/proc/%d/cgroup", pid)
|
|
|
|
cgroupsPaths, err := cgroups.ParseCgroupFile(path)
|
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
|
|
|
for c, p := range cgroupsPaths {
|
|
|
|
cgroupRoot := &criurpc.CgroupRoot{
|
|
|
|
Ctrl: proto.String(c),
|
|
|
|
Path: proto.String(p),
|
|
|
|
}
|
|
|
|
req.Opts.CgRoot = append(req.Opts.CgRoot, cgroupRoot)
|
|
|
|
}
|
|
|
|
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
2018-07-03 04:22:38 +08:00
|
|
|
func (c *linuxContainer) criuSwrk(process *Process, req *criurpc.CriuReq, opts *CriuOpts, applyCgroups bool, extraFiles []*os.File) error {
|
2017-05-10 05:38:27 +08:00
|
|
|
fds, err := unix.Socketpair(unix.AF_LOCAL, unix.SOCK_SEQPACKET|unix.SOCK_CLOEXEC, 0)
|
2015-04-10 23:18:16 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
2017-07-27 02:05:01 +08:00
|
|
|
var logPath string
|
|
|
|
if opts != nil {
|
|
|
|
logPath = filepath.Join(opts.WorkDirectory, req.GetOpts().GetLogFile())
|
|
|
|
} else {
|
|
|
|
// For the VERSION RPC 'opts' is set to 'nil' and therefore
|
|
|
|
// opts.WorkDirectory does not exist. Set logPath to "".
|
|
|
|
logPath = ""
|
|
|
|
}
|
2015-04-10 23:18:16 +08:00
|
|
|
criuClient := os.NewFile(uintptr(fds[0]), "criu-transport-client")
|
2017-03-02 16:02:15 +08:00
|
|
|
criuClientFileCon, err := net.FileConn(criuClient)
|
|
|
|
criuClient.Close()
|
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
|
|
|
criuClientCon := criuClientFileCon.(*net.UnixConn)
|
|
|
|
defer criuClientCon.Close()
|
|
|
|
|
2015-04-10 23:18:16 +08:00
|
|
|
criuServer := os.NewFile(uintptr(fds[1]), "criu-transport-server")
|
|
|
|
defer criuServer.Close()
|
|
|
|
|
2015-03-26 19:20:59 +08:00
|
|
|
args := []string{"swrk", "3"}
|
2017-07-27 02:05:01 +08:00
|
|
|
if c.criuVersion != 0 {
|
|
|
|
// If the CRIU Version is still '0' then this is probably
|
|
|
|
// the initial CRIU run to detect the version. Skip it.
|
|
|
|
logrus.Debugf("Using CRIU %d at: %s", c.criuVersion, c.criuPath)
|
|
|
|
}
|
2015-08-31 19:34:14 +08:00
|
|
|
logrus.Debugf("Using CRIU with following args: %s", args)
|
2015-03-13 12:45:43 +08:00
|
|
|
cmd := exec.Command(c.criuPath, args...)
|
2015-04-10 23:19:14 +08:00
|
|
|
if process != nil {
|
|
|
|
cmd.Stdin = process.Stdin
|
|
|
|
cmd.Stdout = process.Stdout
|
|
|
|
cmd.Stderr = process.Stderr
|
|
|
|
}
|
2015-03-26 19:20:59 +08:00
|
|
|
cmd.ExtraFiles = append(cmd.ExtraFiles, criuServer)
|
2018-07-03 04:22:38 +08:00
|
|
|
if extraFiles != nil {
|
|
|
|
cmd.ExtraFiles = append(cmd.ExtraFiles, extraFiles...)
|
|
|
|
}
|
2015-03-26 19:20:59 +08:00
|
|
|
|
2015-03-13 12:45:43 +08:00
|
|
|
if err := cmd.Start(); err != nil {
|
2015-03-19 11:22:21 +08:00
|
|
|
return err
|
2015-03-13 12:45:43 +08:00
|
|
|
}
|
2020-04-04 06:27:27 +08:00
|
|
|
// we close criuServer so that even if CRIU crashes or unexpectedly exits, runc will not hang.
|
|
|
|
criuServer.Close()
|
2020-02-10 14:32:56 +08:00
|
|
|
// cmd.Process will be replaced by a restored init.
|
|
|
|
criuProcess := cmd.Process
|
2015-03-25 21:21:44 +08:00
|
|
|
|
2020-05-30 06:50:37 +08:00
|
|
|
var criuProcessState *os.ProcessState
|
2015-03-26 19:20:59 +08:00
|
|
|
defer func() {
|
2020-05-30 06:50:37 +08:00
|
|
|
if criuProcessState == nil {
|
|
|
|
criuClientCon.Close()
|
|
|
|
_, err := criuProcess.Wait()
|
|
|
|
if err != nil {
|
|
|
|
logrus.Warnf("wait on criuProcess returned %v", err)
|
|
|
|
}
|
2015-03-26 19:20:59 +08:00
|
|
|
}
|
|
|
|
}()
|
|
|
|
|
2015-09-08 17:02:08 +08:00
|
|
|
if applyCgroups {
|
2020-02-10 14:32:56 +08:00
|
|
|
err := c.criuApplyCgroups(criuProcess.Pid, req)
|
2015-09-08 17:02:08 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-05-14 17:42:21 +08:00
|
|
|
var extFds []string
|
2015-04-29 04:54:03 +08:00
|
|
|
if process != nil {
|
2020-02-10 14:32:56 +08:00
|
|
|
extFds, err = getPipeFds(criuProcess.Pid)
|
2015-04-29 04:54:03 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-08-31 19:34:14 +08:00
|
|
|
logrus.Debugf("Using CRIU in %s mode", req.GetType().String())
|
2017-03-15 04:21:58 +08:00
|
|
|
// In the case of criurpc.CriuReqType_FEATURE_CHECK req.GetOpts()
|
|
|
|
// should be empty. For older CRIU versions it still will be
|
2017-08-02 23:55:56 +08:00
|
|
|
// available but empty. criurpc.CriuReqType_VERSION actually
|
|
|
|
// has no req.GetOpts().
|
|
|
|
if !(req.GetType() == criurpc.CriuReqType_FEATURE_CHECK ||
|
|
|
|
req.GetType() == criurpc.CriuReqType_VERSION) {
|
|
|
|
|
2017-03-15 04:21:58 +08:00
|
|
|
val := reflect.ValueOf(req.GetOpts())
|
|
|
|
v := reflect.Indirect(val)
|
|
|
|
for i := 0; i < v.NumField(); i++ {
|
|
|
|
st := v.Type()
|
|
|
|
name := st.Field(i).Name
|
|
|
|
if strings.HasPrefix(name, "XXX_") {
|
|
|
|
continue
|
|
|
|
}
|
|
|
|
value := val.MethodByName("Get" + name).Call([]reflect.Value{})
|
|
|
|
logrus.Debugf("CRIU option %s with value %v", name, value[0])
|
2015-08-31 19:34:14 +08:00
|
|
|
}
|
|
|
|
}
|
2015-04-10 23:18:16 +08:00
|
|
|
data, err := proto.Marshal(req)
|
2015-03-26 19:20:59 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
2015-03-25 21:21:44 +08:00
|
|
|
}
|
2017-03-02 16:02:15 +08:00
|
|
|
_, err = criuClientCon.Write(data)
|
2015-03-07 03:21:02 +08:00
|
|
|
if err != nil {
|
2015-03-19 11:22:21 +08:00
|
|
|
return err
|
2015-03-07 03:21:02 +08:00
|
|
|
}
|
2015-03-25 21:21:44 +08:00
|
|
|
|
2015-03-26 19:20:59 +08:00
|
|
|
buf := make([]byte, 10*4096)
|
2017-03-02 16:02:15 +08:00
|
|
|
oob := make([]byte, 4096)
|
2015-03-26 19:20:59 +08:00
|
|
|
for true {
|
2017-03-02 16:02:15 +08:00
|
|
|
n, oobn, _, _, err := criuClientCon.ReadMsgUnix(buf, oob)
|
runc checkpoint: fix --status-fd to accept fd
1. The command `runc checkpoint --lazy-server --status-fd $FD` actually
accepts a file name as an $FD. Make it accept a file descriptor,
like its name implies and the documentation states.
In addition, since runc itself does not use the result of CRIU status
fd, remove the code which relays it, and pass the FD directly to CRIU.
Note 1: runc should close this file descriptor itself after passing it
to criu, otherwise whoever waits on it might wait forever.
Note 2: due to the way criu swrk consumes the fd (it reopens
/proc/$SENDER_PID/fd/$FD), runc can't close it as soon as criu swrk has
started. There is no good way to know when criu swrk has reopened the
fd, so we assume that as soon as we have received something back, the
fd is already reopened.
2. Since the meaning of --status-fd has changed, the test case using
it needs to be fixed as well.
Modify the lazy migration test to remove "sleep 2", actually waiting
for the the lazy page server to be ready.
While at it,
- remove the double fork (using shell's background process is
sufficient here);
- check the exit code for "runc checkpoint" and "criu lazy-pages";
- remove the check for no errors in dump.log after restore, as we
are already checking its exit code.
[v2: properly close status fd after spawning criu]
[v3: move close status fd to after the first read]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-21 17:43:24 +08:00
|
|
|
if req.Opts != nil && req.Opts.StatusFd != nil {
|
|
|
|
// Close status_fd as soon as we got something back from criu,
|
|
|
|
// assuming it has consumed (reopened) it by this time.
|
|
|
|
// Otherwise it will might be left open forever and whoever
|
|
|
|
// is waiting on it will wait forever.
|
|
|
|
fd := int(*req.Opts.StatusFd)
|
|
|
|
_ = unix.Close(fd)
|
|
|
|
req.Opts.StatusFd = nil
|
|
|
|
}
|
2015-03-26 19:20:59 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
if n == 0 {
|
2020-05-17 08:20:44 +08:00
|
|
|
return errors.New("unexpected EOF")
|
2015-03-26 19:20:59 +08:00
|
|
|
}
|
|
|
|
if n == len(buf) {
|
2020-05-17 08:20:44 +08:00
|
|
|
return errors.New("buffer is too small")
|
2015-03-26 19:20:59 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
resp := new(criurpc.CriuResp)
|
|
|
|
err = proto.Unmarshal(buf[:n], resp)
|
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
if !resp.GetSuccess() {
|
2015-08-22 05:20:59 +08:00
|
|
|
typeString := req.GetType().String()
|
|
|
|
return fmt.Errorf("criu failed: type %s errno %d\nlog file: %s", typeString, resp.GetCrErrno(), logPath)
|
2015-03-26 19:20:59 +08:00
|
|
|
}
|
|
|
|
|
2015-04-10 23:18:16 +08:00
|
|
|
t := resp.GetType()
|
2015-03-26 19:20:59 +08:00
|
|
|
switch {
|
2017-03-15 04:21:58 +08:00
|
|
|
case t == criurpc.CriuReqType_FEATURE_CHECK:
|
|
|
|
logrus.Debugf("Feature check says: %s", resp)
|
|
|
|
criuFeatures = resp.GetFeatures()
|
2015-03-26 19:20:59 +08:00
|
|
|
case t == criurpc.CriuReqType_NOTIFY:
|
2020-02-10 14:32:56 +08:00
|
|
|
if err := c.criuNotifications(resp, process, cmd, opts, extFds, oob[:oobn]); err != nil {
|
2015-04-10 23:03:23 +08:00
|
|
|
return err
|
2015-03-26 19:20:59 +08:00
|
|
|
}
|
|
|
|
t = criurpc.CriuReqType_NOTIFY
|
2015-04-10 23:18:16 +08:00
|
|
|
req = &criurpc.CriuReq{
|
2015-03-26 19:20:59 +08:00
|
|
|
Type: &t,
|
|
|
|
NotifySuccess: proto.Bool(true),
|
|
|
|
}
|
2015-04-10 23:18:16 +08:00
|
|
|
data, err = proto.Marshal(req)
|
2015-03-26 19:20:59 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2017-03-02 16:02:15 +08:00
|
|
|
_, err = criuClientCon.Write(data)
|
2015-03-26 19:20:59 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
continue
|
|
|
|
case t == criurpc.CriuReqType_RESTORE:
|
2015-04-10 23:19:14 +08:00
|
|
|
case t == criurpc.CriuReqType_DUMP:
|
2016-08-24 17:48:56 +08:00
|
|
|
case t == criurpc.CriuReqType_PRE_DUMP:
|
2015-03-26 19:20:59 +08:00
|
|
|
default:
|
|
|
|
return fmt.Errorf("unable to parse the response %s", resp.String())
|
|
|
|
}
|
|
|
|
|
|
|
|
break
|
|
|
|
}
|
|
|
|
|
2017-03-02 16:02:15 +08:00
|
|
|
criuClientCon.CloseWrite()
|
2015-03-26 19:20:59 +08:00
|
|
|
// cmd.Wait() waits cmd.goroutines which are used for proxying file descriptors.
|
|
|
|
// Here we want to wait only the CRIU process.
|
2020-05-30 06:50:37 +08:00
|
|
|
criuProcessState, err = criuProcess.Wait()
|
2015-03-26 19:20:59 +08:00
|
|
|
if err != nil {
|
2015-03-19 11:22:21 +08:00
|
|
|
return err
|
2015-03-13 12:45:43 +08:00
|
|
|
}
|
2017-03-02 16:02:15 +08:00
|
|
|
|
|
|
|
// In pre-dump mode CRIU is in a loop and waits for
|
|
|
|
// the final DUMP command.
|
|
|
|
// The current runc pre-dump approach, however, is
|
|
|
|
// start criu in PRE_DUMP once for a single pre-dump
|
|
|
|
// and not the whole series of pre-dump, pre-dump, ...m, dump
|
|
|
|
// If we got the message CriuReqType_PRE_DUMP it means
|
|
|
|
// CRIU was successful and we need to forcefully stop CRIU
|
2020-05-30 06:50:37 +08:00
|
|
|
if !criuProcessState.Success() && *req.Type != criurpc.CriuReqType_PRE_DUMP {
|
|
|
|
return fmt.Errorf("criu failed: %s\nlog file: %s", criuProcessState.String(), logPath)
|
2015-03-26 19:20:59 +08:00
|
|
|
}
|
2015-03-19 11:22:21 +08:00
|
|
|
return nil
|
2015-03-07 03:21:02 +08:00
|
|
|
}
|
|
|
|
|
2015-04-20 16:24:50 +08:00
|
|
|
// block any external network activity
|
|
|
|
func lockNetwork(config *configs.Config) error {
|
|
|
|
for _, config := range config.Networks {
|
|
|
|
strategy, err := getStrategy(config.Type)
|
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
|
|
|
if err := strategy.detach(config); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
|
|
|
func unlockNetwork(config *configs.Config) error {
|
|
|
|
for _, config := range config.Networks {
|
|
|
|
strategy, err := getStrategy(config.Type)
|
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
if err = strategy.attach(config); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
2020-02-10 14:32:56 +08:00
|
|
|
func (c *linuxContainer) criuNotifications(resp *criurpc.CriuResp, process *Process, cmd *exec.Cmd, opts *CriuOpts, fds []string, oob []byte) error {
|
2015-04-10 23:03:23 +08:00
|
|
|
notify := resp.GetNotify()
|
|
|
|
if notify == nil {
|
|
|
|
return fmt.Errorf("invalid response: %s", resp.String())
|
|
|
|
}
|
2020-05-12 07:37:45 +08:00
|
|
|
script := notify.GetScript()
|
|
|
|
logrus.Debugf("notify: %s\n", script)
|
|
|
|
switch script {
|
|
|
|
case "post-dump":
|
2015-10-03 02:16:50 +08:00
|
|
|
f, err := os.Create(filepath.Join(c.root, "checkpoint"))
|
|
|
|
if err != nil {
|
|
|
|
return err
|
2015-04-10 23:19:14 +08:00
|
|
|
}
|
2015-10-03 02:16:50 +08:00
|
|
|
f.Close()
|
2020-05-12 07:37:45 +08:00
|
|
|
case "network-unlock":
|
2015-04-20 16:24:50 +08:00
|
|
|
if err := unlockNetwork(c.config); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2020-05-12 07:37:45 +08:00
|
|
|
case "network-lock":
|
2015-04-20 16:24:50 +08:00
|
|
|
if err := lockNetwork(c.config); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2020-05-12 07:37:45 +08:00
|
|
|
case "setup-namespaces":
|
2016-02-19 07:09:15 +08:00
|
|
|
if c.config.Hooks != nil {
|
libcontainer: Set 'status' in hook stdin
Finish off the work started in a344b2d6 (sync up `HookState` with OCI
spec `State`, 2016-12-19, #1201).
And drop HookState, since there's no need for a local alias for
specs.State.
Also set c.initProcess in newInitProcess to support OCIState calls
from within initProcess.start(). I think the cyclic references
between linuxContainer and initProcess are unfortunate, but didn't
want to address that here.
I've also left the timing of the Prestart hooks alone, although the
spec calls for them to happen before start (not as part of creation)
[1,2]. Once the timing gets fixed we can drop the
initProcessStartTime hacks which initProcess.start currently needs.
I'm not sure why we trigger the prestart hooks in response to both
procReady and procHooks. But we've had two prestart rounds in
initProcess.start since 2f276498 (Move pre-start hooks after container
mounts, 2016-02-17, #568). I've left that alone too.
I really think we should have len() guards to avoid computing the
state when .Hooks is non-nil but the particular phase we're looking at
is empty. Aleksa, however, is adamantly against them [3] citing a
risk of sloppy copy/pastes causing the hook slice being len-guarded to
diverge from the hook slice being iterated over within the guard. I
think that ort of thing is very lo-risk, because:
* We shouldn't be copy/pasting this, right? DRY for the win :).
* There's only ever a few lines between the guard and the guarded
loop. That makes broken copy/pastes easy to catch in review.
* We should have test coverage for these. Guarding with the wrong
slice is certainly not the only thing you can break with a sloppy
copy/paste.
But I'm not a maintainer ;).
[1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.0/config.md#prestart
[2]: https://github.com/opencontainers/runc/issues/1710
[3]: https://github.com/opencontainers/runc/pull/1741#discussion_r233331570
Signed-off-by: W. Trevor King <wking@tremily.us>
2018-02-26 06:47:41 +08:00
|
|
|
s, err := c.currentOCIState()
|
|
|
|
if err != nil {
|
|
|
|
return nil
|
2016-02-19 07:09:15 +08:00
|
|
|
}
|
2018-12-03 13:31:20 +08:00
|
|
|
s.Pid = int(notify.GetPid())
|
2016-04-19 02:37:26 +08:00
|
|
|
for i, hook := range c.config.Hooks.Prestart {
|
2016-02-19 07:09:15 +08:00
|
|
|
if err := hook.Run(s); err != nil {
|
2016-04-19 02:37:26 +08:00
|
|
|
return newSystemErrorWithCausef(err, "running prestart hook %d", i)
|
2016-02-19 07:09:15 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2020-05-12 07:37:45 +08:00
|
|
|
case "post-restore":
|
2015-04-10 23:03:23 +08:00
|
|
|
pid := notify.GetPid()
|
2020-02-10 14:32:56 +08:00
|
|
|
|
|
|
|
p, err := os.FindProcess(int(pid))
|
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
cmd.Process = p
|
|
|
|
|
|
|
|
r, err := newRestoredProcess(cmd, fds)
|
2015-04-10 23:03:23 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2015-10-03 02:16:50 +08:00
|
|
|
process.ops = r
|
|
|
|
if err := c.state.transition(&restoredState{
|
|
|
|
imageDir: opts.ImagesDirectory,
|
|
|
|
c: c,
|
|
|
|
}); err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2016-08-22 06:15:18 +08:00
|
|
|
// create a timestamp indicating when the restored checkpoint was started
|
|
|
|
c.created = time.Now().UTC()
|
2016-07-05 08:24:13 +08:00
|
|
|
if _, err := c.updateState(r); err != nil {
|
2015-04-10 23:03:23 +08:00
|
|
|
return err
|
|
|
|
}
|
2015-10-03 02:16:50 +08:00
|
|
|
if err := os.Remove(filepath.Join(c.root, "checkpoint")); err != nil {
|
|
|
|
if !os.IsNotExist(err) {
|
|
|
|
logrus.Error(err)
|
|
|
|
}
|
|
|
|
}
|
2020-05-12 07:37:45 +08:00
|
|
|
case "orphan-pts-master":
|
2017-07-13 21:02:17 +08:00
|
|
|
scm, err := unix.ParseSocketControlMessage(oob)
|
2017-03-02 16:02:15 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2017-07-13 21:02:17 +08:00
|
|
|
fds, err := unix.ParseUnixRights(&scm[0])
|
2017-07-28 19:56:33 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2017-03-02 16:02:15 +08:00
|
|
|
|
|
|
|
master := os.NewFile(uintptr(fds[0]), "orphan-pts-master")
|
|
|
|
defer master.Close()
|
|
|
|
|
|
|
|
// While we can access console.master, using the API is a good idea.
|
2017-05-20 01:18:43 +08:00
|
|
|
if err := utils.SendFd(process.ConsoleSocket, master.Name(), master.Fd()); err != nil {
|
2017-03-02 16:02:15 +08:00
|
|
|
return err
|
|
|
|
}
|
2020-05-13 01:57:04 +08:00
|
|
|
case "status-ready":
|
|
|
|
if opts.StatusFd != -1 {
|
|
|
|
// write \0 to status fd to notify that lazy page server is ready
|
|
|
|
_, err := unix.Write(opts.StatusFd, []byte{0})
|
|
|
|
if err != nil {
|
|
|
|
logrus.Warnf("can't write \\0 to status fd: %v", err)
|
|
|
|
}
|
|
|
|
_ = unix.Close(opts.StatusFd)
|
|
|
|
opts.StatusFd = -1
|
|
|
|
}
|
2015-04-10 23:03:23 +08:00
|
|
|
}
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
2016-07-05 08:24:13 +08:00
|
|
|
func (c *linuxContainer) updateState(process parentProcess) (*State, error) {
|
2017-08-15 14:30:58 +08:00
|
|
|
if process != nil {
|
|
|
|
c.initProcess = process
|
|
|
|
}
|
2015-02-14 06:41:37 +08:00
|
|
|
state, err := c.currentState()
|
2015-02-12 08:45:23 +08:00
|
|
|
if err != nil {
|
2016-07-05 08:24:13 +08:00
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
err = c.saveState(state)
|
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
2015-02-12 08:45:23 +08:00
|
|
|
}
|
2016-07-05 08:24:13 +08:00
|
|
|
return state, nil
|
2015-10-03 02:16:50 +08:00
|
|
|
}
|
|
|
|
|
2020-06-11 04:54:21 +08:00
|
|
|
func (c *linuxContainer) saveState(s *State) (retErr error) {
|
|
|
|
tmpFile, err := ioutil.TempFile(c.root, "state-")
|
2015-02-12 08:45:23 +08:00
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
2020-06-11 04:54:21 +08:00
|
|
|
|
|
|
|
defer func() {
|
|
|
|
if retErr != nil {
|
|
|
|
tmpFile.Close()
|
|
|
|
os.Remove(tmpFile.Name())
|
|
|
|
}
|
|
|
|
}()
|
|
|
|
|
|
|
|
err = utils.WriteJSON(tmpFile, s)
|
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
err = tmpFile.Close()
|
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
|
|
|
|
stateFilePath := filepath.Join(c.root, stateFilename)
|
|
|
|
return os.Rename(tmpFile.Name(), stateFilePath)
|
2015-10-03 02:16:50 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
func (c *linuxContainer) deleteState() error {
|
|
|
|
return os.Remove(filepath.Join(c.root, stateFilename))
|
2015-02-12 08:45:23 +08:00
|
|
|
}
|
2015-02-14 06:41:37 +08:00
|
|
|
|
|
|
|
func (c *linuxContainer) currentStatus() (Status, error) {
|
2015-10-03 02:16:50 +08:00
|
|
|
if err := c.refreshState(); err != nil {
|
|
|
|
return -1, err
|
|
|
|
}
|
|
|
|
return c.state.status(), nil
|
|
|
|
}
|
|
|
|
|
|
|
|
// refreshState needs to be called to verify that the current state on the
|
|
|
|
// container is what is true. Because consumers of libcontainer can use it
|
|
|
|
// out of process we need to verify the container's status based on runtime
|
|
|
|
// information and not rely on our in process info.
|
|
|
|
func (c *linuxContainer) refreshState() error {
|
|
|
|
paused, err := c.isPaused()
|
|
|
|
if err != nil {
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
if paused {
|
|
|
|
return c.state.transition(&pausedState{c: c})
|
2015-04-10 20:47:37 +08:00
|
|
|
}
|
2019-02-08 19:37:42 +08:00
|
|
|
t := c.runType()
|
2016-05-14 07:54:16 +08:00
|
|
|
switch t {
|
2016-05-14 08:01:12 +08:00
|
|
|
case Created:
|
|
|
|
return c.state.transition(&createdState{c: c})
|
2016-05-14 07:54:16 +08:00
|
|
|
case Running:
|
2015-10-03 02:16:50 +08:00
|
|
|
return c.state.transition(&runningState{c: c})
|
|
|
|
}
|
|
|
|
return c.state.transition(&stoppedState{c: c})
|
|
|
|
}
|
|
|
|
|
2019-02-08 19:37:42 +08:00
|
|
|
func (c *linuxContainer) runType() Status {
|
2015-02-14 06:41:37 +08:00
|
|
|
if c.initProcess == nil {
|
2019-02-08 19:37:42 +08:00
|
|
|
return Stopped
|
2015-02-14 06:41:37 +08:00
|
|
|
}
|
2016-05-14 07:54:16 +08:00
|
|
|
pid := c.initProcess.pid()
|
2017-06-15 07:41:16 +08:00
|
|
|
stat, err := system.Stat(pid)
|
|
|
|
if err != nil {
|
2019-02-08 19:37:42 +08:00
|
|
|
return Stopped
|
2016-05-14 07:54:16 +08:00
|
|
|
}
|
2017-06-15 07:41:16 +08:00
|
|
|
if stat.StartTime != c.initProcessStartTime || stat.State == system.Zombie || stat.State == system.Dead {
|
2019-02-08 19:37:42 +08:00
|
|
|
return Stopped
|
2016-07-05 08:24:13 +08:00
|
|
|
}
|
2017-02-22 10:16:19 +08:00
|
|
|
// We'll create exec fifo and blocking on it after container is created,
|
|
|
|
// and delete it after start container.
|
|
|
|
if _, err := os.Stat(filepath.Join(c.root, execFifoFilename)); err == nil {
|
2019-02-08 19:37:42 +08:00
|
|
|
return Created
|
2015-02-14 06:41:37 +08:00
|
|
|
}
|
2019-02-08 19:37:42 +08:00
|
|
|
return Running
|
2015-10-03 02:16:50 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
func (c *linuxContainer) isPaused() (bool, error) {
|
2020-05-11 13:19:30 +08:00
|
|
|
state, err := c.cgroupManager.GetFreezerState()
|
2015-10-03 02:16:50 +08:00
|
|
|
if err != nil {
|
2020-05-11 13:19:30 +08:00
|
|
|
return false, err
|
2015-02-14 06:41:37 +08:00
|
|
|
}
|
2020-05-11 13:19:30 +08:00
|
|
|
return state == configs.Frozen, nil
|
2015-02-14 06:41:37 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
func (c *linuxContainer) currentState() (*State, error) {
|
2016-01-22 08:43:33 +08:00
|
|
|
var (
|
2017-06-15 06:38:45 +08:00
|
|
|
startTime uint64
|
2016-01-22 08:43:33 +08:00
|
|
|
externalDescriptors []string
|
|
|
|
pid = -1
|
|
|
|
)
|
|
|
|
if c.initProcess != nil {
|
|
|
|
pid = c.initProcess.pid()
|
|
|
|
startTime, _ = c.initProcess.startTime()
|
|
|
|
externalDescriptors = c.initProcess.externalDescriptors()
|
2015-02-14 06:41:37 +08:00
|
|
|
}
|
libcontainer: add support for Intel RDT/CAT in runc
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.
This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).
For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.
About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.
Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| |-- cbm_mask
| |-- min_cbm_bits
| |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
|-- cpus
|-- schemata
|-- tasks
For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.
The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:
"linux": {
"intelRdt": {
"l3CacheSchema": "L3:0=ffff0;1=3ff"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-08-30 19:34:26 +08:00
|
|
|
intelRdtPath, err := intelrdt.GetIntelRdtPath(c.ID())
|
|
|
|
if err != nil {
|
|
|
|
intelRdtPath = ""
|
|
|
|
}
|
2015-02-14 06:41:37 +08:00
|
|
|
state := &State{
|
2015-10-24 00:22:48 +08:00
|
|
|
BaseState: BaseState{
|
|
|
|
ID: c.ID(),
|
|
|
|
Config: *c.config,
|
2016-01-22 08:43:33 +08:00
|
|
|
InitProcessPid: pid,
|
2015-10-24 00:22:48 +08:00
|
|
|
InitProcessStartTime: startTime,
|
2016-01-29 05:32:24 +08:00
|
|
|
Created: c.created,
|
2015-10-24 00:22:48 +08:00
|
|
|
},
|
Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-07-05 14:28:21 +08:00
|
|
|
Rootless: c.config.RootlessEUID && c.config.RootlessCgroups,
|
2015-10-24 00:22:48 +08:00
|
|
|
CgroupPaths: c.cgroupManager.GetPaths(),
|
libcontainer: add support for Intel RDT/CAT in runc
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.
This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).
For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.
About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.
Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| |-- cbm_mask
| |-- min_cbm_bits
| |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
|-- cpus
|-- schemata
|-- tasks
For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.
The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:
"linux": {
"intelRdt": {
"l3CacheSchema": "L3:0=ffff0;1=3ff"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-08-30 19:34:26 +08:00
|
|
|
IntelRdtPath: intelRdtPath,
|
2015-10-24 00:22:48 +08:00
|
|
|
NamespacePaths: make(map[configs.NamespaceType]string),
|
2016-01-22 08:43:33 +08:00
|
|
|
ExternalDescriptors: externalDescriptors,
|
2015-02-14 06:41:37 +08:00
|
|
|
}
|
2016-01-22 08:43:33 +08:00
|
|
|
if pid > 0 {
|
|
|
|
for _, ns := range c.config.Namespaces {
|
|
|
|
state.NamespacePaths[ns.Type] = ns.GetPath(pid)
|
|
|
|
}
|
|
|
|
for _, nsType := range configs.NamespaceTypes() {
|
2016-03-02 09:59:26 +08:00
|
|
|
if !configs.IsNamespaceSupported(nsType) {
|
|
|
|
continue
|
|
|
|
}
|
2016-01-22 08:43:33 +08:00
|
|
|
if _, ok := state.NamespacePaths[nsType]; !ok {
|
|
|
|
ns := configs.Namespace{Type: nsType}
|
|
|
|
state.NamespacePaths[ns.Type] = ns.GetPath(pid)
|
|
|
|
}
|
2015-04-08 05:16:29 +08:00
|
|
|
}
|
|
|
|
}
|
2015-02-14 06:41:37 +08:00
|
|
|
return state, nil
|
|
|
|
}
|
2015-10-17 23:35:36 +08:00
|
|
|
|
libcontainer: Set 'status' in hook stdin
Finish off the work started in a344b2d6 (sync up `HookState` with OCI
spec `State`, 2016-12-19, #1201).
And drop HookState, since there's no need for a local alias for
specs.State.
Also set c.initProcess in newInitProcess to support OCIState calls
from within initProcess.start(). I think the cyclic references
between linuxContainer and initProcess are unfortunate, but didn't
want to address that here.
I've also left the timing of the Prestart hooks alone, although the
spec calls for them to happen before start (not as part of creation)
[1,2]. Once the timing gets fixed we can drop the
initProcessStartTime hacks which initProcess.start currently needs.
I'm not sure why we trigger the prestart hooks in response to both
procReady and procHooks. But we've had two prestart rounds in
initProcess.start since 2f276498 (Move pre-start hooks after container
mounts, 2016-02-17, #568). I've left that alone too.
I really think we should have len() guards to avoid computing the
state when .Hooks is non-nil but the particular phase we're looking at
is empty. Aleksa, however, is adamantly against them [3] citing a
risk of sloppy copy/pastes causing the hook slice being len-guarded to
diverge from the hook slice being iterated over within the guard. I
think that ort of thing is very lo-risk, because:
* We shouldn't be copy/pasting this, right? DRY for the win :).
* There's only ever a few lines between the guard and the guarded
loop. That makes broken copy/pastes easy to catch in review.
* We should have test coverage for these. Guarding with the wrong
slice is certainly not the only thing you can break with a sloppy
copy/paste.
But I'm not a maintainer ;).
[1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.0/config.md#prestart
[2]: https://github.com/opencontainers/runc/issues/1710
[3]: https://github.com/opencontainers/runc/pull/1741#discussion_r233331570
Signed-off-by: W. Trevor King <wking@tremily.us>
2018-02-26 06:47:41 +08:00
|
|
|
func (c *linuxContainer) currentOCIState() (*specs.State, error) {
|
|
|
|
bundle, annotations := utils.Annotations(c.config.Labels)
|
|
|
|
state := &specs.State{
|
|
|
|
Version: specs.Version,
|
|
|
|
ID: c.ID(),
|
|
|
|
Bundle: bundle,
|
|
|
|
Annotations: annotations,
|
|
|
|
}
|
|
|
|
status, err := c.currentStatus()
|
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
state.Status = status.String()
|
|
|
|
if status != Stopped {
|
|
|
|
if c.initProcess != nil {
|
|
|
|
state.Pid = c.initProcess.pid()
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return state, nil
|
|
|
|
}
|
|
|
|
|
2015-09-14 08:37:56 +08:00
|
|
|
// orderNamespacePaths sorts namespace paths into a list of paths that we
|
|
|
|
// can setns in order.
|
|
|
|
func (c *linuxContainer) orderNamespacePaths(namespaces map[configs.NamespaceType]string) ([]string, error) {
|
|
|
|
paths := []string{}
|
2017-04-28 12:42:56 +08:00
|
|
|
for _, ns := range configs.NamespaceTypes() {
|
2017-04-25 18:26:40 +08:00
|
|
|
|
|
|
|
// Remove namespaces that we don't need to join.
|
|
|
|
if !c.config.Namespaces.Contains(ns) {
|
|
|
|
continue
|
2016-07-18 22:40:24 +08:00
|
|
|
}
|
2017-04-25 18:26:40 +08:00
|
|
|
|
|
|
|
if p, ok := namespaces[ns]; ok && p != "" {
|
2015-09-14 08:37:56 +08:00
|
|
|
// check if the requested namespace is supported
|
2017-04-25 18:26:40 +08:00
|
|
|
if !configs.IsNamespaceSupported(ns) {
|
|
|
|
return nil, newSystemError(fmt.Errorf("namespace %s is not supported", ns))
|
2015-09-14 08:37:56 +08:00
|
|
|
}
|
|
|
|
// only set to join this namespace if it exists
|
|
|
|
if _, err := os.Lstat(p); err != nil {
|
2016-04-19 02:37:26 +08:00
|
|
|
return nil, newSystemErrorWithCausef(err, "running lstat on namespace path %q", p)
|
2015-09-14 08:37:56 +08:00
|
|
|
}
|
|
|
|
// do not allow namespace path with comma as we use it to separate
|
|
|
|
// the namespace paths
|
|
|
|
if strings.ContainsRune(p, ',') {
|
|
|
|
return nil, newSystemError(fmt.Errorf("invalid path %s", p))
|
|
|
|
}
|
2017-04-25 18:26:40 +08:00
|
|
|
paths = append(paths, fmt.Sprintf("%s:%s", configs.NsName(ns), p))
|
2015-09-14 08:37:56 +08:00
|
|
|
}
|
2017-04-25 18:26:40 +08:00
|
|
|
|
2015-09-14 08:37:56 +08:00
|
|
|
}
|
2017-04-25 18:26:40 +08:00
|
|
|
|
2015-09-14 08:37:56 +08:00
|
|
|
return paths, nil
|
|
|
|
}
|
2015-09-14 08:40:43 +08:00
|
|
|
|
|
|
|
func encodeIDMapping(idMap []configs.IDMap) ([]byte, error) {
|
|
|
|
data := bytes.NewBuffer(nil)
|
|
|
|
for _, im := range idMap {
|
|
|
|
line := fmt.Sprintf("%d %d %d\n", im.ContainerID, im.HostID, im.Size)
|
|
|
|
if _, err := data.WriteString(line); err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return data.Bytes(), nil
|
|
|
|
}
|
|
|
|
|
|
|
|
// bootstrapData encodes the necessary data in netlink binary format
|
|
|
|
// as a io.Reader.
|
|
|
|
// Consumer can write the data to a bootstrap program
|
|
|
|
// such as one that uses nsenter package to bootstrap the container's
|
|
|
|
// init process correctly, i.e. with correct namespaces, uid/gid
|
|
|
|
// mapping etc.
|
2016-06-03 23:29:34 +08:00
|
|
|
func (c *linuxContainer) bootstrapData(cloneFlags uintptr, nsMaps map[configs.NamespaceType]string) (io.Reader, error) {
|
2015-09-14 08:40:43 +08:00
|
|
|
// create the netlink message
|
|
|
|
r := nl.NewNetlinkRequest(int(InitMsg), 0)
|
|
|
|
|
|
|
|
// write cloneFlags
|
|
|
|
r.AddData(&Int32msg{
|
|
|
|
Type: CloneFlagsAttr,
|
|
|
|
Value: uint32(cloneFlags),
|
|
|
|
})
|
|
|
|
|
|
|
|
// write custom namespace paths
|
|
|
|
if len(nsMaps) > 0 {
|
|
|
|
nsPaths, err := c.orderNamespacePaths(nsMaps)
|
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
r.AddData(&Bytemsg{
|
|
|
|
Type: NsPathsAttr,
|
|
|
|
Value: []byte(strings.Join(nsPaths, ",")),
|
|
|
|
})
|
|
|
|
}
|
|
|
|
|
|
|
|
// write namespace paths only when we are not joining an existing user ns
|
|
|
|
_, joinExistingUser := nsMaps[configs.NEWUSER]
|
|
|
|
if !joinExistingUser {
|
|
|
|
// write uid mappings
|
|
|
|
if len(c.config.UidMappings) > 0 {
|
Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-07-05 14:28:21 +08:00
|
|
|
if c.config.RootlessEUID && c.newuidmapPath != "" {
|
2017-09-05 22:25:17 +08:00
|
|
|
r.AddData(&Bytemsg{
|
|
|
|
Type: UidmapPathAttr,
|
|
|
|
Value: []byte(c.newuidmapPath),
|
|
|
|
})
|
|
|
|
}
|
2015-09-14 08:40:43 +08:00
|
|
|
b, err := encodeIDMapping(c.config.UidMappings)
|
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
r.AddData(&Bytemsg{
|
|
|
|
Type: UidmapAttr,
|
|
|
|
Value: b,
|
|
|
|
})
|
|
|
|
}
|
|
|
|
|
|
|
|
// write gid mappings
|
|
|
|
if len(c.config.GidMappings) > 0 {
|
2016-03-12 13:18:42 +08:00
|
|
|
b, err := encodeIDMapping(c.config.GidMappings)
|
2015-09-14 08:40:43 +08:00
|
|
|
if err != nil {
|
|
|
|
return nil, err
|
|
|
|
}
|
|
|
|
r.AddData(&Bytemsg{
|
|
|
|
Type: GidmapAttr,
|
|
|
|
Value: b,
|
|
|
|
})
|
Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-07-05 14:28:21 +08:00
|
|
|
if c.config.RootlessEUID && c.newgidmapPath != "" {
|
2017-07-21 01:33:01 +08:00
|
|
|
r.AddData(&Bytemsg{
|
|
|
|
Type: GidmapPathAttr,
|
|
|
|
Value: []byte(c.newgidmapPath),
|
|
|
|
})
|
2017-09-05 22:25:17 +08:00
|
|
|
}
|
2018-03-26 13:43:37 +08:00
|
|
|
if requiresRootOrMappingTool(c.config) {
|
2018-05-22 14:56:01 +08:00
|
|
|
r.AddData(&Boolmsg{
|
|
|
|
Type: SetgroupAttr,
|
|
|
|
Value: true,
|
|
|
|
})
|
2015-09-14 08:40:43 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-03-16 08:54:47 +08:00
|
|
|
if c.config.OomScoreAdj != nil {
|
|
|
|
// write oom_score_adj
|
|
|
|
r.AddData(&Bytemsg{
|
|
|
|
Type: OomScoreAdjAttr,
|
|
|
|
Value: []byte(fmt.Sprintf("%d", *c.config.OomScoreAdj)),
|
|
|
|
})
|
|
|
|
}
|
2017-01-17 09:25:21 +08:00
|
|
|
|
2016-04-23 21:39:42 +08:00
|
|
|
// write rootless
|
|
|
|
r.AddData(&Boolmsg{
|
Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-07-05 14:28:21 +08:00
|
|
|
Type: RootlessEUIDAttr,
|
|
|
|
Value: c.config.RootlessEUID,
|
2016-04-23 21:39:42 +08:00
|
|
|
})
|
|
|
|
|
2015-09-14 08:40:43 +08:00
|
|
|
return bytes.NewReader(r.Serialize()), nil
|
|
|
|
}
|
2017-01-25 07:24:05 +08:00
|
|
|
|
|
|
|
// ignoreTerminateErrors returns nil if the given err matches an error known
|
|
|
|
// to indicate that the terminate occurred successfully or err was nil, otherwise
|
|
|
|
// err is returned unaltered.
|
|
|
|
func ignoreTerminateErrors(err error) error {
|
|
|
|
if err == nil {
|
|
|
|
return nil
|
|
|
|
}
|
2017-10-11 03:36:19 +08:00
|
|
|
s := err.Error()
|
|
|
|
switch {
|
|
|
|
case strings.Contains(s, "process already finished"), strings.Contains(s, "Wait was already called"):
|
2017-01-25 07:24:05 +08:00
|
|
|
return nil
|
|
|
|
}
|
2017-10-11 03:36:19 +08:00
|
|
|
return err
|
2017-01-25 07:24:05 +08:00
|
|
|
}
|
2018-03-26 13:43:37 +08:00
|
|
|
|
|
|
|
func requiresRootOrMappingTool(c *configs.Config) bool {
|
|
|
|
gidMap := []configs.IDMap{
|
|
|
|
{ContainerID: 0, HostID: os.Getegid(), Size: 1},
|
|
|
|
}
|
|
|
|
return !reflect.DeepEqual(c.GidMappings, gidMap)
|
|
|
|
}
|