2019-05-29 01:10:09 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2017-03-23 01:00:33 +08:00
|
|
|
/* Copyright (c) 2017 Facebook
|
|
|
|
*/
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/bpf.h>
|
|
|
|
|
|
|
|
#include "map_in_map.h"
|
|
|
|
|
|
|
|
struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
|
|
|
|
{
|
|
|
|
struct bpf_map *inner_map, *inner_map_meta;
|
2019-01-17 23:34:45 +08:00
|
|
|
u32 inner_map_meta_size;
|
2017-03-23 01:00:33 +08:00
|
|
|
struct fd f;
|
|
|
|
|
|
|
|
f = fdget(inner_map_ufd);
|
|
|
|
inner_map = __bpf_map_get(f);
|
|
|
|
if (IS_ERR(inner_map))
|
|
|
|
return inner_map;
|
|
|
|
|
2019-11-23 04:07:56 +08:00
|
|
|
/* prog_array->aux->{type,jited} is a runtime binding.
|
|
|
|
* Doing static check alone in the verifier is not enough.
|
2017-03-23 01:00:33 +08:00
|
|
|
*/
|
2018-08-03 05:27:23 +08:00
|
|
|
if (inner_map->map_type == BPF_MAP_TYPE_PROG_ARRAY ||
|
2018-09-28 22:45:46 +08:00
|
|
|
inner_map->map_type == BPF_MAP_TYPE_CGROUP_STORAGE ||
|
bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS
The patch introduces BPF_MAP_TYPE_STRUCT_OPS. The map value
is a kernel struct with its func ptr implemented in bpf prog.
This new map is the interface to register/unregister/introspect
a bpf implemented kernel struct.
The kernel struct is actually embedded inside another new struct
(or called the "value" struct in the code). For example,
"struct tcp_congestion_ops" is embbeded in:
struct bpf_struct_ops_tcp_congestion_ops {
refcount_t refcnt;
enum bpf_struct_ops_state state;
struct tcp_congestion_ops data; /* <-- kernel subsystem struct here */
}
The map value is "struct bpf_struct_ops_tcp_congestion_ops".
The "bpftool map dump" will then be able to show the
state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
number of tcp_sock in the tcp_congestion_ops case). This "value" struct
is created automatically by a macro. Having a separate "value" struct
will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
"void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
initialization works before registering the struct_ops to the kernel
subsystem). The libbpf will take care of finding and populating the
"struct bpf_struct_ops_XYZ" from "struct XYZ".
Register a struct_ops to a kernel subsystem:
1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
running kernel.
Instead of reusing the attr->btf_value_type_id,
btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
used as the "user" btf which could store other useful sysadmin/debug
info that may be introduced in the furture,
e.g. creation-date/compiler-details/map-creator...etc.
3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
in the running kernel btf. Populate the value of this object.
The function ptr should be populated with the prog fds.
4. Call BPF_MAP_UPDATE with the object created in (3) as
the map value. The key is always "0".
During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
args as an array of u64 is generated. BPF_MAP_UPDATE also allows
the specific struct_ops to do some final checks in "st_ops->init_member()"
(e.g. ensure all mandatory func ptrs are implemented).
If everything looks good, it will register this kernel struct
to the kernel subsystem. The map will not allow further update
from this point.
Unregister a struct_ops from the kernel subsystem:
BPF_MAP_DELETE with key "0".
Introspect a struct_ops:
BPF_MAP_LOOKUP_ELEM with key "0". The map value returned will
have the prog _id_ populated as the func ptr.
The map value state (enum bpf_struct_ops_state) will transit from:
INIT (map created) =>
INUSE (map updated, i.e. reg) =>
TOBEFREE (map value deleted, i.e. unreg)
The kernel subsystem needs to call bpf_struct_ops_get() and
bpf_struct_ops_put() to manage the "refcnt" in the
"struct bpf_struct_ops_XYZ". This patch uses a separate refcnt
for the purose of tracking the subsystem usage. Another approach
is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
the subsystem's usage by doing map->refcnt - map->usercnt to filter out
the map-fd/pinned-map usage. However, that will also tie down the
future semantics of map->refcnt and map->usercnt.
The very first subsystem's refcnt (during reg()) holds one
count to map->refcnt. When the very last subsystem's refcnt
is gone, it will also release the map->refcnt. All bpf_prog will be
freed when the map->refcnt reaches 0 (i.e. during map_free()).
Here is how the bpftool map command will look like:
[root@arch-fb-vm1 bpf]# bpftool map show
6: struct_ops name dctcp flags 0x0
key 4B value 256B max_entries 1 memlock 4096B
btf_id 6
[root@arch-fb-vm1 bpf]# bpftool map dump id 6
[{
"value": {
"refcnt": {
"refs": {
"counter": 1
}
},
"state": 1,
"data": {
"list": {
"next": 0,
"prev": 0
},
"key": 0,
"flags": 2,
"init": 24,
"release": 0,
"ssthresh": 25,
"cong_avoid": 30,
"set_state": 27,
"cwnd_event": 28,
"in_ack_event": 26,
"undo_cwnd": 29,
"pkts_acked": 0,
"min_tso_segs": 0,
"sndbuf_expand": 0,
"cong_control": 0,
"get_info": 0,
"name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
],
"owner": 0
}
}
}
]
Misc Notes:
* bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
It does an inplace update on "*value" instead returning a pointer
to syscall.c. Otherwise, it needs a separate copy of "zero" value
for the BPF_STRUCT_OPS_STATE_INIT to avoid races.
* The bpf_struct_ops_map_delete_elem() is also called without
preempt_disable() from map_delete_elem(). It is because
the "->unreg()" may requires sleepable context, e.g.
the "tcp_unregister_congestion_control()".
* "const" is added to some of the existing "struct btf_func_model *"
function arg to avoid a compiler warning caused by this patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003505.3855919-1-kafai@fb.com
2020-01-09 08:35:05 +08:00
|
|
|
inner_map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE ||
|
|
|
|
inner_map->map_type == BPF_MAP_TYPE_STRUCT_OPS) {
|
2017-03-23 01:00:33 +08:00
|
|
|
fdput(f);
|
|
|
|
return ERR_PTR(-ENOTSUPP);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Does not support >1 level map-in-map */
|
|
|
|
if (inner_map->inner_map_meta) {
|
|
|
|
fdput(f);
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
}
|
|
|
|
|
2019-02-01 07:40:04 +08:00
|
|
|
if (map_value_has_spin_lock(inner_map)) {
|
|
|
|
fdput(f);
|
|
|
|
return ERR_PTR(-ENOTSUPP);
|
|
|
|
}
|
|
|
|
|
2019-01-17 23:34:45 +08:00
|
|
|
inner_map_meta_size = sizeof(*inner_map_meta);
|
|
|
|
/* In some cases verifier needs to access beyond just base map. */
|
|
|
|
if (inner_map->ops == &array_map_ops)
|
|
|
|
inner_map_meta_size = sizeof(struct bpf_array);
|
|
|
|
|
|
|
|
inner_map_meta = kzalloc(inner_map_meta_size, GFP_USER);
|
2017-03-23 01:00:33 +08:00
|
|
|
if (!inner_map_meta) {
|
|
|
|
fdput(f);
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
}
|
|
|
|
|
|
|
|
inner_map_meta->map_type = inner_map->map_type;
|
|
|
|
inner_map_meta->key_size = inner_map->key_size;
|
|
|
|
inner_map_meta->value_size = inner_map->value_size;
|
|
|
|
inner_map_meta->map_flags = inner_map->map_flags;
|
|
|
|
inner_map_meta->max_entries = inner_map->max_entries;
|
2019-02-28 05:22:56 +08:00
|
|
|
inner_map_meta->spin_lock_off = inner_map->spin_lock_off;
|
2017-03-23 01:00:33 +08:00
|
|
|
|
2019-01-17 23:34:45 +08:00
|
|
|
/* Misc members not needed in bpf_map_meta_equal() check. */
|
|
|
|
inner_map_meta->ops = inner_map->ops;
|
|
|
|
if (inner_map->ops == &array_map_ops) {
|
2020-05-14 07:03:54 +08:00
|
|
|
inner_map_meta->bypass_spec_v1 = inner_map->bypass_spec_v1;
|
2019-01-17 23:34:45 +08:00
|
|
|
container_of(inner_map_meta, struct bpf_array, map)->index_mask =
|
|
|
|
container_of(inner_map, struct bpf_array, map)->index_mask;
|
|
|
|
}
|
|
|
|
|
2017-03-23 01:00:33 +08:00
|
|
|
fdput(f);
|
|
|
|
return inner_map_meta;
|
|
|
|
}
|
|
|
|
|
|
|
|
void bpf_map_meta_free(struct bpf_map *map_meta)
|
|
|
|
{
|
|
|
|
kfree(map_meta);
|
|
|
|
}
|
|
|
|
|
|
|
|
bool bpf_map_meta_equal(const struct bpf_map *meta0,
|
|
|
|
const struct bpf_map *meta1)
|
|
|
|
{
|
|
|
|
/* No need to compare ops because it is covered by map_type */
|
|
|
|
return meta0->map_type == meta1->map_type &&
|
|
|
|
meta0->key_size == meta1->key_size &&
|
|
|
|
meta0->value_size == meta1->value_size &&
|
|
|
|
meta0->map_flags == meta1->map_flags &&
|
|
|
|
meta0->max_entries == meta1->max_entries;
|
|
|
|
}
|
|
|
|
|
|
|
|
void *bpf_map_fd_get_ptr(struct bpf_map *map,
|
|
|
|
struct file *map_file /* not used */,
|
|
|
|
int ufd)
|
|
|
|
{
|
|
|
|
struct bpf_map *inner_map;
|
|
|
|
struct fd f;
|
|
|
|
|
|
|
|
f = fdget(ufd);
|
|
|
|
inner_map = __bpf_map_get(f);
|
|
|
|
if (IS_ERR(inner_map))
|
|
|
|
return inner_map;
|
|
|
|
|
|
|
|
if (bpf_map_meta_equal(map->inner_map_meta, inner_map))
|
bpf: Switch bpf_map ref counter to atomic64_t so bpf_map_inc() never fails
92117d8443bc ("bpf: fix refcnt overflow") turned refcounting of bpf_map into
potentially failing operation, when refcount reaches BPF_MAX_REFCNT limit
(32k). Due to using 32-bit counter, it's possible in practice to overflow
refcounter and make it wrap around to 0, causing erroneous map free, while
there are still references to it, causing use-after-free problems.
But having a failing refcounting operations are problematic in some cases. One
example is mmap() interface. After establishing initial memory-mapping, user
is allowed to arbitrarily map/remap/unmap parts of mapped memory, arbitrarily
splitting it into multiple non-contiguous regions. All this happening without
any control from the users of mmap subsystem. Rather mmap subsystem sends
notifications to original creator of memory mapping through open/close
callbacks, which are optionally specified during initial memory mapping
creation. These callbacks are used to maintain accurate refcount for bpf_map
(see next patch in this series). The problem is that open() callback is not
supposed to fail, because memory-mapped resource is set up and properly
referenced. This is posing a problem for using memory-mapping with BPF maps.
One solution to this is to maintain separate refcount for just memory-mappings
and do single bpf_map_inc/bpf_map_put when it goes from/to zero, respectively.
There are similar use cases in current work on tcp-bpf, necessitating extra
counter as well. This seems like a rather unfortunate and ugly solution that
doesn't scale well to various new use cases.
Another approach to solve this is to use non-failing refcount_t type, which
uses 32-bit counter internally, but, once reaching overflow state at UINT_MAX,
stays there. This utlimately causes memory leak, but prevents use after free.
But given refcounting is not the most performance-critical operation with BPF
maps (it's not used from running BPF program code), we can also just switch to
64-bit counter that can't overflow in practice, potentially disadvantaging
32-bit platforms a tiny bit. This simplifies semantics and allows above
described scenarios to not worry about failing refcount increment operation.
In terms of struct bpf_map size, we are still good and use the same amount of
space:
BEFORE (3 cache lines, 8 bytes of padding at the end):
struct bpf_map {
const struct bpf_map_ops * ops __attribute__((__aligned__(64))); /* 0 8 */
struct bpf_map * inner_map_meta; /* 8 8 */
void * security; /* 16 8 */
enum bpf_map_type map_type; /* 24 4 */
u32 key_size; /* 28 4 */
u32 value_size; /* 32 4 */
u32 max_entries; /* 36 4 */
u32 map_flags; /* 40 4 */
int spin_lock_off; /* 44 4 */
u32 id; /* 48 4 */
int numa_node; /* 52 4 */
u32 btf_key_type_id; /* 56 4 */
u32 btf_value_type_id; /* 60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
struct btf * btf; /* 64 8 */
struct bpf_map_memory memory; /* 72 16 */
bool unpriv_array; /* 88 1 */
bool frozen; /* 89 1 */
/* XXX 38 bytes hole, try to pack */
/* --- cacheline 2 boundary (128 bytes) --- */
atomic_t refcnt __attribute__((__aligned__(64))); /* 128 4 */
atomic_t usercnt; /* 132 4 */
struct work_struct work; /* 136 32 */
char name[16]; /* 168 16 */
/* size: 192, cachelines: 3, members: 21 */
/* sum members: 146, holes: 1, sum holes: 38 */
/* padding: 8 */
/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
} __attribute__((__aligned__(64)));
AFTER (same 3 cache lines, no extra padding now):
struct bpf_map {
const struct bpf_map_ops * ops __attribute__((__aligned__(64))); /* 0 8 */
struct bpf_map * inner_map_meta; /* 8 8 */
void * security; /* 16 8 */
enum bpf_map_type map_type; /* 24 4 */
u32 key_size; /* 28 4 */
u32 value_size; /* 32 4 */
u32 max_entries; /* 36 4 */
u32 map_flags; /* 40 4 */
int spin_lock_off; /* 44 4 */
u32 id; /* 48 4 */
int numa_node; /* 52 4 */
u32 btf_key_type_id; /* 56 4 */
u32 btf_value_type_id; /* 60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
struct btf * btf; /* 64 8 */
struct bpf_map_memory memory; /* 72 16 */
bool unpriv_array; /* 88 1 */
bool frozen; /* 89 1 */
/* XXX 38 bytes hole, try to pack */
/* --- cacheline 2 boundary (128 bytes) --- */
atomic64_t refcnt __attribute__((__aligned__(64))); /* 128 8 */
atomic64_t usercnt; /* 136 8 */
struct work_struct work; /* 144 32 */
char name[16]; /* 176 16 */
/* size: 192, cachelines: 3, members: 21 */
/* sum members: 154, holes: 1, sum holes: 38 */
/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
} __attribute__((__aligned__(64)));
This patch, while modifying all users of bpf_map_inc, also cleans up its
interface to match bpf_map_put with separate operations for bpf_map_inc and
bpf_map_inc_with_uref (to match bpf_map_put and bpf_map_put_with_uref,
respectively). Also, given there are no users of bpf_map_inc_not_zero
specifying uref=true, remove uref flag and default to uref=false internally.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20191117172806.2195367-2-andriin@fb.com
2019-11-18 01:28:02 +08:00
|
|
|
bpf_map_inc(inner_map);
|
2017-03-23 01:00:33 +08:00
|
|
|
else
|
|
|
|
inner_map = ERR_PTR(-EINVAL);
|
|
|
|
|
|
|
|
fdput(f);
|
|
|
|
return inner_map;
|
|
|
|
}
|
|
|
|
|
|
|
|
void bpf_map_fd_put_ptr(void *ptr)
|
|
|
|
{
|
|
|
|
/* ptr->ops->map_free() has to go through one
|
|
|
|
* rcu grace period by itself.
|
|
|
|
*/
|
|
|
|
bpf_map_put(ptr);
|
|
|
|
}
|
2017-06-28 14:08:34 +08:00
|
|
|
|
|
|
|
u32 bpf_map_fd_sys_lookup_elem(void *ptr)
|
|
|
|
{
|
|
|
|
return ((struct bpf_map *)ptr)->id;
|
|
|
|
}
|