OpenCloudOS-Kernel/net
Martin KaFai Lau 5dc4c4b7d4 bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY
This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.

To unleash the full potential of a bpf prog, it is essential for the
userspace to be capable of directly setting up a bpf map which can then
be consumed by the bpf prog to make decision.  In this case, decide which
SO_REUSEPORT sk to serve the incoming request.

By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
the bpf prog can directly select a sk from the bpf map.  That will
raise the programmability of the bpf prog attached to a reuseport
group (a group of sk serving the same IP:PORT).

For example, in UDP, the bpf prog can peek into the payload (e.g.
through the "data" pointer introduced in the later patch) to learn
the application level's connection information and then decide which sk
to pick from a bpf map.  The userspace can tightly couple the sk's location
in a bpf map with the application logic in generating the UDP payload's
connection information.  This connection info contact/API stays within the
userspace.

Also, when used with map-in-map, the userspace can switch the
old-server-process's inner map to a new-server-process's inner map
in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
The bpf prog will then direct incoming requests to the new process instead
of the old process.  The old process can finish draining the pending
requests (e.g. by "accept()") before closing the old-fds.  [Note that
deleting a fd from a bpf map does not necessary mean the fd is closed]

During map_update_elem(),
Only SO_REUSEPORT sk (i.e. which has already been added
to a reuse->socks[]) can be used.  That means a SO_REUSEPORT sk that is
"bind()" for UDP or "bind()+listen()" for TCP.  These conditions are
ensured in "reuseport_array_update_check()".

A SO_REUSEPORT sk can only be added once to a map (i.e. the
same sk cannot be added twice even to the same map).  SO_REUSEPORT
already allows another sk to be created for the same IP:PORT.
There is no need to re-create a similar usage in the BPF side.

When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
it will notify the bpf map to remove it from the map also.  It is
done through "bpf_sk_reuseport_detach()" and it will only be called
if >=1 of the "reuse->sock[]" has ever been added to a bpf map.

The map_update()/map_delete() has to be in-sync with the
"reuse->socks[]".  Hence, the same "reuseport_lock" used
by "reuse->socks[]" has to be used here also. Care has
been taken to ensure the lock is only acquired when the
adding sk passes some strict tests. and
freeing the map does not require the reuseport_lock.

The reuseport_array will also support lookup from the syscall
side.  It will return a sock_gen_cookie().  The sock_gen_cookie()
is on-demand (i.e. a sk's cookie is not generated until the very
first map_lookup_elem()).

The lookup cookie is 64bits but it goes against the logical userspace
expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
It may catch user in surprise if we enforce value_size=8 while
userspace still pass a 32bits fd during update.  Supporting different
value_size between lookup and update seems unintuitive also.

We also need to consider what if other existing fd based maps want
to return 64bits value from syscall's lookup in the future.
Hence, reuseport_array supports both value_size 4 and 8, and
assuming user will usually use value_size=4.  The syscall's lookup
will return ENOSPC on value_size=4.  It will will only
return 64bits value from sock_gen_cookie() when user consciously
choose value_size=8 (as a signal that lookup is desired) which then
requires a 64bits value in both lookup and update.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-11 01:58:46 +02:00
..
6lowpan 6lowpan: iphc: reset mac_header after decompress to fix panic 2018-07-06 12:32:12 +02:00
9p net:mod: remove unneeded variable 'ret' in init_p9 2018-08-08 09:40:44 -07:00
802
8021q net: remove blank lines at end of file 2018-07-24 14:10:43 -07:00
appletalk Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL 2018-06-28 10:40:47 -07:00
atm net: simplify sock_poll_wait 2018-07-30 09:10:25 -07:00
ax25 ax25: remove blank line at EOF 2018-07-24 14:10:42 -07:00
batman-adv Merge ra.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux 2018-07-20 21:17:12 -07:00
bluetooth Bluetooth: hidp: buffer overflow in hidp_process_report 2018-08-01 09:12:35 +02:00
bpf bpf/test_run: support cgroup local storage 2018-08-03 00:47:32 +02:00
bpfilter bpfilter: remove trailing newline 2018-07-24 14:10:42 -07:00
bridge net/bridge/br_multicast: remove redundant variable "err" 2018-08-06 10:33:44 -07:00
caif net: simplify sock_poll_wait 2018-07-30 09:10:25 -07:00
can Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL 2018-06-28 10:40:47 -07:00
ceph The main piece is a set of libceph changes that revamps how OSD 2018-06-15 07:24:58 +09:00
core bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY 2018-08-11 01:58:46 +02:00
dcb net: dcb: Add priority-to-DSCP map getters 2018-07-27 13:17:50 -07:00
dccp Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-09 11:52:36 -07:00
decnet decnet: whitespace fixes 2018-07-24 14:10:42 -07:00
dns_resolver net: remove blank lines at end of file 2018-07-24 14:10:43 -07:00
dsa Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-09 11:52:36 -07:00
ethernet net: Convert GRO SKB handling to list_head. 2018-06-26 11:33:04 +09:00
hsr
ieee802154 net: ieee802154: 6lowpan: remove redundant pointers 'fq' and 'net' 2018-08-06 11:21:15 +02:00
ife net: sched: ife: check on metadata length 2018-04-22 21:12:00 -04:00
ipv4 ipv4: frags: precedence bug in ip_expire() 2018-08-06 13:15:12 -07:00
ipv6 Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-09 11:52:36 -07:00
iucv net:af_iucv: get rid of the unneeded variable 'err' in afiucv_pm_freeze 2018-08-08 09:39:36 -07:00
kcm net: remove blank lines at end of file 2018-07-24 14:10:43 -07:00
key Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2018-07-27 09:33:37 -07:00
l2tp Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-05 13:04:31 -07:00
l3mdev
lapb
llc Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-09 11:52:36 -07:00
mac80211 Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-07-24 19:21:58 -07:00
mac802154 net: mac802154: tx: expand tailroom if necessary 2018-08-06 11:21:37 +02:00
mpls mpls: remove trailing whitepace 2018-07-24 14:10:42 -07:00
ncsi net/ncsi: Use netdev_dbg for debug messages 2018-06-20 07:26:58 +09:00
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next 2018-08-05 16:25:22 -07:00
netlabel audit: use inline function to get audit context 2018-05-14 17:24:18 -04:00
netlink Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-05 13:04:31 -07:00
netrom Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL 2018-06-28 10:40:47 -07:00
nfc net: simplify sock_poll_wait 2018-07-30 09:10:25 -07:00
nsh nsh: set mac len based on inner packet 2018-07-12 16:55:29 -07:00
openvswitch Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-02 10:55:32 -07:00
packet Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-09 11:52:36 -07:00
phonet Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL 2018-06-28 10:40:47 -07:00
psample
qrtr net: qrtr: Reset the node and port ID of broadcast messages 2018-07-05 20:20:03 +09:00
rds RDS: IB: fix 'passing zero to ERR_PTR()' warning 2018-08-07 13:19:45 -07:00
rfkill rfkill: Create rfkill-none LED trigger 2018-05-23 11:26:45 +02:00
rose Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL 2018-06-28 10:40:47 -07:00
rxrpc Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-09 11:52:36 -07:00
sched net: sched: cls_flower: set correct offload data in fl_reoffload 2018-08-07 12:35:17 -07:00
sctp sctp: whitespace fixes 2018-07-24 14:10:42 -07:00
smc Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-09 11:52:36 -07:00
strparser strparser: remove redundant variable 'rd_desc' 2018-08-01 10:00:06 -07:00
sunrpc net: Remove some unneeded semicolon 2018-08-04 13:05:39 -07:00
switchdev
tipc Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-09 11:52:36 -07:00
tls net/tls: Mark the end in scatterlist table 2018-08-05 17:13:58 -07:00
unix af_unix: ensure POLLOUT on remote close() for connected dgram socket 2018-08-03 16:44:19 -07:00
vmw_vsock vsock: split dwork to avoid reinitializations 2018-08-07 12:39:13 -07:00
wimax wimax: remove blank lines at EOF 2018-07-24 14:10:42 -07:00
wireless Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-07-24 19:21:58 -07:00
x25 x25: remove blank lines at EOF 2018-07-24 14:10:42 -07:00
xdp Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-05 13:04:31 -07:00
xfrm Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-02 10:55:32 -07:00
Kconfig net: remove blank lines at end of file 2018-07-24 14:10:43 -07:00
Makefile bpfilter: check compiler capability in Kconfig 2018-06-28 13:36:39 +09:00
compat.c net: avoid unnecessary sock_flag() check when enable timestamp 2018-08-06 10:42:48 -07:00
socket.c Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-08-02 10:55:32 -07:00
sysctl_net.c net: Drop pernet_operations::async 2018-03-27 13:18:09 -04:00