OpenCloudOS-Kernel/kernel
Daniel Borkmann 1b66d25361 bpf: Add get{peer, sock}name attach types for sock_addr
As stated in 983695fa67 ("bpf: fix unconnected udp hooks"), the objective
for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be
transparent to applications. In Cilium we make use of these hooks [0] in
order to enable E-W load balancing for existing Kubernetes service types
for all Cilium managed nodes in the cluster. Those backends can be local
or remote. The main advantage of this approach is that it operates as close
as possible to the socket, and therefore allows to avoid packet-based NAT
given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses.

This also allows to expose NodePort services on loopback addresses in the
host namespace, for example. As another advantage, this also efficiently
blocks bind requests for applications in the host namespace for exposed
ports. However, one missing item is that we also need to perform reverse
xlation for inet{,6}_getname() hooks such that we can return the service
IP/port tuple back to the application instead of the remote peer address.

The vast majority of applications does not bother about getpeername(), but
in a few occasions we've seen breakage when validating the peer's address
since it returns unexpectedly the backend tuple instead of the service one.
Therefore, this trivial patch allows to customise and adds a getpeername()
as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order
to address this situation.

Simple example:

  # ./cilium/cilium service list
  ID   Frontend     Service Type   Backend
  1    1.2.3.4:80   ClusterIP      1 => 10.0.0.10:80

Before; curl's verbose output example, no getpeername() reverse xlation:

  # curl --verbose 1.2.3.4
  * Rebuilt URL to: 1.2.3.4/
  *   Trying 1.2.3.4...
  * TCP_NODELAY set
  * Connected to 1.2.3.4 (10.0.0.10) port 80 (#0)
  > GET / HTTP/1.1
  > Host: 1.2.3.4
  > User-Agent: curl/7.58.0
  > Accept: */*
  [...]

After; with getpeername() reverse xlation:

  # curl --verbose 1.2.3.4
  * Rebuilt URL to: 1.2.3.4/
  *   Trying 1.2.3.4...
  * TCP_NODELAY set
  * Connected to 1.2.3.4 (1.2.3.4) port 80 (#0)
  > GET / HTTP/1.1
  >  Host: 1.2.3.4
  > User-Agent: curl/7.58.0
  > Accept: */*
  [...]

Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed
peer to the context similar as in inet{,6}_getname() fashion, but API-wise
this is suboptimal as it always enforces programs having to test for ctx->peer
which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split.
Similarly, the checked return code is on tnum_range(1, 1), but if a use case
comes up in future, it can easily be changed to return an error code instead.
Helper and ctx member access is the same as with connect/sendmsg/etc hooks.

  [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net
2020-05-19 11:32:04 -07:00
..
bpf bpf: Add get{peer, sock}name attach types for sock_addr 2020-05-19 11:32:04 -07:00
cgroup bpf: Refactor bpf_link update handling 2020-04-28 17:27:07 -07:00
configs compiler: remove CONFIG_OPTIMIZE_INLINING entirely 2020-04-07 10:43:42 -07:00
debug SPDX patches for 5.7-rc1. 2020-04-03 13:12:26 -07:00
dma dma-debug: fix displaying of dma allocation type 2020-04-08 21:46:57 +02:00
events Merge branch 'work.sysctl' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-04-28 21:23:38 +02:00
gcov kernel/gcov/fs.c: gcov_seq_next() should increase position index 2020-04-10 15:36:22 -07:00
irq genirq: Remove setup_irq() and remove_irq() 2020-04-14 10:08:50 +02:00
livepatch New tracing features: 2019-11-27 11:42:01 -08:00
locking locking/lockdep: Improve 'invalid wait context' splat 2020-04-08 12:05:07 +02:00
power PM: hibernate: Freeze kernel threads in software_resume() 2020-04-27 10:30:30 +02:00
printk sysctl: pass kernel pointers to ->proc_handler 2020-04-27 02:07:40 -04:00
rcu Merge branch 'urgent-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/urgent 2020-04-14 08:36:41 +02:00
sched Merge branch 'work.sysctl' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-04-28 21:23:38 +02:00
time Merge branch 'work.sysctl' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-04-28 21:23:38 +02:00
trace Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-05-15 13:48:59 -07:00
.gitignore .gitignore: add SPDX License Identifier 2020-03-25 11:50:48 +01:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks sched/rt, locking: Use CONFIG_PREEMPTION 2019-12-08 14:37:36 +01:00
Kconfig.preempt sched/Kconfig: Fix spelling mistake in user-visible help text 2019-11-12 11:35:32 +01:00
Makefile kcov: ignore fault-inject and stacktrace 2020-01-31 10:30:41 -08:00
acct.c acct: stop using get_seconds() 2019-12-18 18:07:31 +01:00
async.c
audit.c audit: check the length of userspace generated audit records 2020-04-20 17:10:58 -04:00
audit.h audit: trigger accompanying records when no rules present 2020-03-12 10:42:51 -04:00
audit_fsnotify.c fsnotify: use helpers to access data by data_type 2020-03-23 18:19:06 +01:00
audit_tree.c
audit_watch.c \n 2020-04-06 08:58:42 -07:00
auditfilter.c audit: fix error handling in audit_data_to_entry() 2020-02-22 20:36:47 -05:00
auditsc.c audit: trigger accompanying records when no rules present 2020-03-12 10:42:51 -04:00
backtracetest.c
bounds.c
capability.c
compat.c y2038: remove unused time32 interfaces 2020-02-21 11:22:15 -08:00
configs.c proc: convert everything to "struct proc_ops" 2020-02-04 03:05:26 +00:00
context_tracking.c context-tracking: Introduce CONFIG_HAVE_TIF_NOHZ 2020-02-14 16:05:04 +01:00
cpu.c CPU (hotplug) updates: 2020-03-30 18:06:39 -07:00
cpu_pm.c
crash_core.c
crash_dump.c
cred.c kernel: doc: remove outdated comment cred.c 2020-03-25 10:04:01 -05:00
delayacct.c
dma.c
elfcore.c kernel/elfcore.c: include proper prototypes 2019-09-25 17:51:39 -07:00
exec_domain.c
exit.c proc: Put thread_pid in release_task not proc_flush_pid 2020-04-24 15:49:00 -05:00
extable.c kernel/extable.c: use address-of operator on section symbols 2020-04-07 10:43:42 -07:00
fail_function.c
fork.c fork: prevent accidental access to clone3 features 2020-05-08 17:31:50 +02:00
freezer.c Revert "libata, freezer: avoid block device removal while system is frozen" 2019-10-06 09:11:37 -06:00
futex.c Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2020-03-30 16:17:15 -07:00
gen_kheaders.sh kheaders: explain why include/config/autoconf.h is excluded from md5sum 2019-11-11 20:10:01 +09:00
groups.c
hung_task.c
iomem.c
irq_work.c lockdep: Annotate irq_work 2020-03-21 16:00:24 +01:00
jump_label.c jump_label: Don't warn on __exit jump entries 2019-08-29 15:10:10 +01:00
kallsyms.c kallsyms: unexport kallsyms_lookup_name() and kallsyms_on_each_symbol() 2020-04-07 10:43:44 -07:00
kcmp.c kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve 2020-03-25 10:04:01 -05:00
kcov.c kernel/kcov.c: fix typos in kcov_remote_start documentation 2020-05-07 19:27:20 -07:00
kexec.c kexec: add machine_kexec_post_load() 2020-01-08 16:32:55 +00:00
kexec_core.c kexec: add machine_kexec_post_load() 2020-01-08 16:32:55 +00:00
kexec_elf.c kexec_elf: support 32 bit ELF files 2019-09-06 23:58:44 +02:00
kexec_file.c kexec: add machine_kexec_post_load() 2020-01-08 16:32:55 +00:00
kexec_internal.h kexec: add machine_kexec_post_load() 2020-01-08 16:32:55 +00:00
kheaders.c
kmod.c kmod: make request_module() return an error when autoloading is disabled 2020-04-10 15:36:22 -07:00
kprobes.c sysctl: pass kernel pointers to ->proc_handler 2020-04-27 02:07:40 -04:00
ksysfs.c
kthread.c kthread: Do not preempt current task if it is going to call schedule() 2020-03-20 13:06:20 +01:00
latencytop.c sysctl: pass kernel pointers to ->proc_handler 2020-04-27 02:07:40 -04:00
module-internal.h
module.c kernel/module: Hide vermagic header file from general use 2020-04-21 13:27:37 -07:00
module_signature.c
module_signing.c
notifier.c x86/mm: split vmalloc_sync_all() 2020-03-21 18:56:06 -07:00
nsproxy.c ns: Introduce Time Namespace 2020-01-14 12:20:48 +01:00
padata.c crypto: pcrypt - simplify error handling in pcrypt_create_aead() 2020-03-06 12:28:24 +11:00
panic.c locking/refcount: Remove unused 'refcount_error_report()' function 2019-11-25 09:15:42 +01:00
params.c lockdown: Lock down module params that specify hardware parameters (eg. ioport) 2019-08-19 21:54:16 -07:00
pid.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2020-04-10 12:59:56 -07:00
pid_namespace.c sysctl: pass kernel pointers to ->proc_handler 2020-04-27 02:07:40 -04:00
profile.c proc: convert everything to "struct proc_ops" 2020-02-04 03:05:26 +00:00
ptrace.c ptrace: reintroduce usage of subjective credentials in ptrace_has_cap() 2020-01-18 13:51:39 +01:00
range.c
reboot.c
relay.c
resource.c mm/memory_hotplug.c: use PFN_UP / PFN_DOWN in walk_system_ram_range() 2019-09-24 15:54:09 -07:00
rseq.c rseq: Reject unknown flags on rseq unregister 2019-12-25 10:41:20 +01:00
seccomp.c sysctl: pass kernel pointers to ->proc_handler 2020-04-27 02:07:40 -04:00
signal.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2020-04-23 13:30:18 -07:00
smp.c cpu/hotplug: Move bringup of secondary CPUs out of smp_init() 2020-03-25 12:59:37 +01:00
smpboot.c
smpboot.h
softirq.c lockdep: Rename trace_{hard,soft}{irq_context,irqs_enabled}() 2020-03-21 16:03:54 +01:00
stackleak.c
stacktrace.c stacktrace: Get rid of unneeded '!!' pattern 2019-11-11 10:30:59 +01:00
stop_machine.c stop_machine: Make stop_cpus() static 2020-01-17 10:19:21 +01:00
sys.c sys/sysinfo: Respect boottime inside time namespace 2020-03-03 19:34:32 +01:00
sys_ni.c y2038: allow disabling time32 system calls 2019-11-15 14:38:30 +01:00
sysctl-test.c kunit: allow kunit tests to be loaded as a module 2020-01-09 16:42:29 -07:00
sysctl.c sysctl: Fix unused function warning 2020-05-05 11:59:32 -07:00
sysctl_binary.c sysctl: Remove the sysctl system call 2019-11-26 13:03:56 -06:00
task_work.c task_work_run: don't take ->pi_lock unconditionally 2020-03-02 14:06:33 -07:00
taskstats.c taskstats: fix data-race 2019-12-04 15:18:39 +01:00
test_kprobes.c
torture.c CPU (hotplug) updates: 2020-03-30 18:06:39 -07:00
tracepoint.c
tsacct.c tsacct: add 64-bit btime field 2019-12-18 18:07:31 +01:00
ucount.c ucount: Make sure ucounts in /proc/sys/user don't regress again 2020-04-07 21:51:27 +02:00
uid16.c
uid16.h
umh.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-05-15 13:48:59 -07:00
up.c smp/up: Make smp_call_function_single() match SMP semantics 2020-02-07 15:34:12 +01:00
user-return-notifier.c
user.c
user_namespace.c
utsname.c
utsname_sysctl.c sysctl: pass kernel pointers to ->proc_handler 2020-04-27 02:07:40 -04:00
watchdog.c sysctl: pass kernel pointers to ->proc_handler 2020-04-27 02:07:40 -04:00
watchdog_hld.c
workqueue.c workqueue: Remove the warning in wq_worker_sleeping() 2020-04-08 11:35:20 +02:00
workqueue_internal.h