Go to file
Daniel Borkmann 7828f20e37 Merge branch 'bpf-cgroup-bind-connect'
Andrey Ignatov says:

====================
v2->v3:
- rebase due to conflicts
- fix ipv6=m build

v1->v2:
- support expected_attach_type at prog load time so that prog (incl.
  context accesses and calls to helpers) can be validated with regard to
  specific attach point it is supposed to be attached to.
  Later, at attach time, attach type is checked so that it must be same as
  at load time if it was provided
- reworked hooks to rely on expected_attach_type, and reduced number of new
  prog types from 6 to just 1: BPF_PROG_TYPE_CGROUP_SOCK_ADDR
- reused BPF_PROG_TYPE_CGROUP_SOCK for sys_bind post-hooks
- add selftests for post-sys_bind hook

For our container management we've been using complicated and fragile setup
consisting of LD_PRELOAD wrapper intercepting bind and connect calls from
all containerized applications. Unfortunately it doesn't work for apps that
don't use glibc and changing all applications that run in the datacenter
is not possible due to 3rd party code and libraries (despite being
open source code) and sheer amount of legacy code that has to be rewritten
(we're rewriting what we can in parallel)

These applications are written without containers in mind and have
builtin assumptions about network services. Like an application X
expects to connect localhost:special_port and find service Y in there.
To move application X and service Y into two different containers
LD_PRELOAD approach is used to help one service connect to another
without rewriting them.
Moving these two applications into different L2 (netns) or L3 (vrf)
network isolation scopes doesn't help to solve the problem, since
applications need to see each other like they were running on
the host without containers.
So if app X and app Y would run in different netns something
would need to punch a connectivity hole in those namespaces.
That would be real layering violation (with corresponding
network debugging pains), since clean l2, l3 abstraction would
suddenly support something that breaks through the layers.

Instead we used LD_PRELOAD (and now bpf programs) at bind/connect
time to help applications discover and connect to each other.
All applications are running in init_nens and there are no vrfs.
After bind/connect the normal fib/neighbor core networking
logic works as it should always do and the whole system is
clean from network point of view and can be debugged with
standard tools.

We also considered resurrecting Hannes's afnetns work,
but all hierarchical namespace abstraction don't work due
to these builtin networking assumptions inside the apps.
To run an application inside cgroup container that was not written
with containers in mind we have to make an illusion of running
in non-containerized environment.
In some cases we remember the port and container id in the post-bind hook
in a bpf map and when some other task in a different container is trying
to connect to a service we need to know where this service is running.
It can be remote and can be local. Both client and service may or may not
be written with containers in mind and this sockaddr rewrite is providing
connectivity and load balancing feature.

BPF+cgroup looks to be the best solution for this problem.
Hence we introduce 3 hooks:
- at entry into sys_bind and sys_connect
  to let bpf prog look and modify 'struct sockaddr' provided
  by user space and fail bind/connect when appropriate
- post sys_bind after port is allocated

The approach works great and has zero overhead for anyone who doesn't
use it and very low overhead when deployed.

Different use case for this feature is to do low overhead firewall
that doesn't need to inspect all packets and works at bind/connect time.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31 02:18:07 +02:00
Documentation Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
LICENSES LICENSES: Add MPL-1.1 license 2018-01-06 10:59:44 -07:00
arch Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
block for-linus-20180302 2018-03-02 09:35:36 -08:00
certs certs/blacklist_nohashes.c: fix const confusion in certs blacklist 2018-02-21 15:35:43 -08:00
crypto X.509: fix NULL dereference when restricting key with unsupported_sig 2018-02-22 14:38:34 +00:00
drivers nfp: bpf: improve wrong FW response warnings 2018-03-28 19:36:14 -07:00
firmware kbuild: remove all dummy assignments to obj- 2017-11-18 11:46:06 +09:00
fs Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
include bpf: Post-hooks for sys_bind 2018-03-31 02:16:26 +02:00
init jump_label: Explicitly disable jump labels in __init code 2018-02-21 16:54:05 +01:00
ipc vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
kernel bpf: Post-hooks for sys_bind 2018-03-31 02:16:26 +02:00
lib lib/scatterlist: add sg_init_marker() helper 2018-03-30 22:50:15 +02:00
mm mm, thp: do not cause memcg oom for thp 2018-03-22 17:07:02 -07:00
net bpf: Post-hooks for sys_bind 2018-03-31 02:16:26 +02:00
samples bpf: sockmap, more BPF_SK_SKB_STREAM_VERDICT tests 2018-03-30 00:09:43 +02:00
scripts kbuild: Handle builtin dtb file names containing hyphens 2018-03-09 01:14:38 +09:00
security macro: introduce COUNT_ARGS() macro 2018-03-28 22:55:19 +02:00
sound treewide: remove large struct-pass-by-value from tracepoint arguments 2018-03-28 22:55:18 +02:00
tools selftests/bpf: Selftest for sys_bind post-hooks. 2018-03-31 02:16:40 +02:00
usr initramfs: fix initramfs rebuilds w/ compression after disabling 2017-11-03 07:39:19 -07:00
virt kvm/arm fixes for 4.16, take 2 2018-03-15 21:45:37 +01:00
.cocciconfig scripts: add Linux .cocciconfig for coccinelle 2016-07-22 12:13:39 +02:00
.get_maintainer.ignore
.gitattributes .gitattributes: set git diff driver for C source code files 2016-10-07 18:46:30 -07:00
.gitignore .gitignore: ignore ASN.1 auto generated files 2018-02-14 21:05:38 +01:00
.mailmap mailmap: update Mark Yao's email address 2018-01-04 16:45:09 -08:00
COPYING
CREDITS MAINTAINERS: update TPM driver infrastructure changes 2017-11-09 17:58:40 -08:00
Kbuild Kbuild updates for v4.15 2017-11-17 17:45:29 -08:00
Kconfig License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
MAINTAINERS Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-23 11:31:58 -04:00
Makefile Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-03-22 14:10:29 -07:00
README README: add a new README file, pointing to the Documentation/ 2016-10-24 08:12:35 -02:00

README

Linux kernel
============

This file was moved to Documentation/admin-guide/README.rst

Please notice that there are several guides for kernel developers and users.
These guides can be rendered in a number of formats, like HTML and PDF.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.
See Documentation/00-INDEX for a list of what is contained in each file.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.