OpenCloudOS-Kernel/net
Daniel Borkmann 1b66d25361 bpf: Add get{peer, sock}name attach types for sock_addr
As stated in 983695fa67 ("bpf: fix unconnected udp hooks"), the objective
for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be
transparent to applications. In Cilium we make use of these hooks [0] in
order to enable E-W load balancing for existing Kubernetes service types
for all Cilium managed nodes in the cluster. Those backends can be local
or remote. The main advantage of this approach is that it operates as close
as possible to the socket, and therefore allows to avoid packet-based NAT
given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses.

This also allows to expose NodePort services on loopback addresses in the
host namespace, for example. As another advantage, this also efficiently
blocks bind requests for applications in the host namespace for exposed
ports. However, one missing item is that we also need to perform reverse
xlation for inet{,6}_getname() hooks such that we can return the service
IP/port tuple back to the application instead of the remote peer address.

The vast majority of applications does not bother about getpeername(), but
in a few occasions we've seen breakage when validating the peer's address
since it returns unexpectedly the backend tuple instead of the service one.
Therefore, this trivial patch allows to customise and adds a getpeername()
as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order
to address this situation.

Simple example:

  # ./cilium/cilium service list
  ID   Frontend     Service Type   Backend
  1    1.2.3.4:80   ClusterIP      1 => 10.0.0.10:80

Before; curl's verbose output example, no getpeername() reverse xlation:

  # curl --verbose 1.2.3.4
  * Rebuilt URL to: 1.2.3.4/
  *   Trying 1.2.3.4...
  * TCP_NODELAY set
  * Connected to 1.2.3.4 (10.0.0.10) port 80 (#0)
  > GET / HTTP/1.1
  > Host: 1.2.3.4
  > User-Agent: curl/7.58.0
  > Accept: */*
  [...]

After; with getpeername() reverse xlation:

  # curl --verbose 1.2.3.4
  * Rebuilt URL to: 1.2.3.4/
  *   Trying 1.2.3.4...
  * TCP_NODELAY set
  * Connected to 1.2.3.4 (1.2.3.4) port 80 (#0)
  > GET / HTTP/1.1
  >  Host: 1.2.3.4
  > User-Agent: curl/7.58.0
  > Accept: */*
  [...]

Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed
peer to the context similar as in inet{,6}_getname() fashion, but API-wise
this is suboptimal as it always enforces programs having to test for ctx->peer
which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split.
Similarly, the checked return code is on tnum_range(1, 1), but if a use case
comes up in future, it can easily be changed to return an error code instead.
Helper and ctx member access is the same as with connect/sendmsg/etc hooks.

  [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net
2020-05-19 11:32:04 -07:00
..
6lowpan
9p 9pnet: allow making incomplete read requests 2020-03-27 09:29:56 +00:00
802 net: 802: psnap.c: Use built-in RCU list checking 2020-02-24 13:02:53 -08:00
8021q netpoll: accept NULL np argument in netpoll_send_skb() 2020-05-07 18:11:07 -07:00
appletalk appletalk: enforce CAP_NET_RAW for raw sockets 2019-09-24 16:37:18 +02:00
atm Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-05-06 22:10:13 -07:00
ax25 docs: networking: convert ax25.txt to ReST 2020-04-28 14:38:38 -07:00
batman-adv Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-05-06 22:10:13 -07:00
bluetooth Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next 2020-05-13 12:20:12 -07:00
bpf bpf: Fix too large copy from user in bpf_test_init 2020-05-19 17:56:34 +02:00
bpfilter SPDX patches for 5.7-rc1. 2020-04-03 13:12:26 -07:00
bridge net: bridge: allow enslaving some DSA master network devices 2020-05-10 19:52:33 -07:00
caif net: caif: Fix use correct return type for ndo_start_xmit() 2020-04-30 12:13:07 -07:00
can can: j1939: j1939_sk_bind(): take priv after lock is held 2019-12-08 11:52:02 +01:00
ceph docs: networking: convert dns_resolver.txt to ReST 2020-04-28 14:39:46 -07:00
core bpf: Add get{peer, sock}name attach types for sock_addr 2020-05-19 11:32:04 -07:00
dcb
dccp dccp: remove unused inline function dccp_set_seqno 2020-04-25 20:42:57 -07:00
decnet Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2020-05-01 17:02:27 -07:00
dns_resolver docs: networking: convert dns_resolver.txt to ReST 2020-04-28 14:39:46 -07:00
dsa net: dsa: tag_sja1105: appease sparse checks for ethertype accessors 2020-05-12 18:02:42 -07:00
ethernet net: remove eth_change_mtu 2020-01-27 11:09:31 +01:00
ethtool net: phy: Send notifier when starting the cable test 2020-05-10 12:28:41 -07:00
hsr hsr: remove WARN_ONCE() in hsr_fill_frame_info() 2020-05-07 17:40:02 -07:00
ieee802154 ieee802154: 6lowpan: remove unnecessary comparison 2020-05-08 22:25:10 -07:00
ife net: Fix Kconfig indentation 2019-09-26 08:56:17 +02:00
ipv4 bpf: Add get{peer, sock}name attach types for sock_addr 2020-05-19 11:32:04 -07:00
ipv6 bpf: Add get{peer, sock}name attach types for sock_addr 2020-05-19 11:32:04 -07:00
iucv treewide: Use sizeof_field() macro 2019-12-09 10:36:44 -08:00
kcm net: kcm: kcmproc.c: Fix RCU list suspicious usage warning 2020-03-16 17:14:02 -07:00
key
l2tp net: partially revert dynamic lockdep key changes 2020-05-04 12:05:56 -07:00
l3mdev
lapb docs: networking: convert lapb-module.txt to ReST 2020-04-30 12:56:35 -07:00
llc af_llc: fix if-statement empty body warning 2020-02-26 20:38:13 -08:00
mac80211 docs: networking: convert mac80211-injection.txt to ReST 2020-04-30 12:56:36 -07:00
mac802154
mpls sysctl: pass kernel pointers to ->proc_handler 2020-04-27 02:07:40 -04:00
mptcp Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-05-15 13:48:59 -07:00
ncsi net/ncsi: Support for multi host mellanox card 2020-01-09 18:36:22 -08:00
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-05-15 13:48:59 -07:00
netlabel netlabel: cope with NULL catmap 2020-05-12 18:12:40 -07:00
netlink bpf: Enable bpf_iter targets registering ctx argument types 2020-05-13 12:30:50 -07:00
netrom net: partially revert dynamic lockdep key changes 2020-05-04 12:05:56 -07:00
nfc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-03-12 22:34:48 -07:00
nsh
openvswitch net: openvswitch: use div_u64() for 64-by-32 divisions 2020-04-25 20:48:21 -07:00
packet net/packet: tpacket_rcv: avoid a producer race condition 2020-03-15 00:25:25 -07:00
phonet sysctl: pass kernel pointers to ->proc_handler 2020-04-27 02:07:40 -04:00
psample net: psample: fix skb_over_panic 2019-11-26 14:40:13 -08:00
qrtr net: qrtr: Do not depend on ARCH_QCOM 2020-05-07 13:21:12 -07:00
rds Merge branch 'work.sysctl' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-04-28 21:23:38 +02:00
rfkill rfkill: Fix incorrect check to avoid NULL pointer dereference 2019-12-16 10:15:49 +01:00
rose net: partially revert dynamic lockdep key changes 2020-05-04 12:05:56 -07:00
rxrpc docs: networking: convert rxrpc.txt to ReST 2020-04-30 12:56:38 -07:00
sched net: sched: cls_flower: implement terse dump support 2020-05-15 10:23:11 -07:00
sctp Merge branch 'work.sysctl' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-04-28 21:23:38 +02:00
smc net/smc: remove set but not used variables 'del_llc, del_llc_resp' 2020-05-07 18:05:07 -07:00
strparser
sunrpc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-05-15 13:48:59 -07:00
switchdev net: switchdev: do not propagate bridge updates across bridges 2020-02-26 20:58:33 -08:00
tipc tipc: fix failed service subscription deletion 2020-05-13 12:33:19 -07:00
tls net/tls: Fix sk_psock refcnt leak when in tls_data_ready() 2020-04-27 11:22:38 -07:00
unix net: datagram: drop 'destructor' argument from several helpers 2020-02-28 12:12:53 -08:00
vmw_vsock vsock/virtio: fix multiple packet delivery to monitoring devices 2020-04-27 10:18:01 -07:00
wimax wimax: no need to check return value of debugfs_create functions 2019-08-10 15:25:47 -07:00
wireless netlink: remove NLA_EXACT_LEN_WARN 2020-04-30 17:51:42 -07:00
x25 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-05-06 22:10:13 -07:00
xdp xsk: Remove unnecessary member in xdp_umem 2020-05-04 22:56:26 +02:00
xfrm Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2020-03-30 10:59:20 -07:00
Kconfig net: ethtool: netlink: Add support for triggering a cable test 2020-05-10 12:28:41 -07:00
Makefile mptcp: Add MPTCP socket stubs 2020-01-24 13:44:07 +01:00
compat.c net: cleanly handle kernel vs user buffers for ->msg_control 2020-05-11 16:59:16 -07:00
socket.c net: cleanly handle kernel vs user buffers for ->msg_control 2020-05-11 16:59:16 -07:00
sysctl_net.c