OpenCloudOS-Kernel/net
Alexei Starovoitov 04fd61ab36 bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
  ...
  bpf_tail_call(ctx, &jmp_table, index);
  ...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
  ...
  if (jmp_table[index])
    return (*jmp_table[index])(ctx);
  ...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.

bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table

Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.

New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.

The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.

Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>

==========
- simplify complex programs by splitting them into a sequence of small programs

- dispatch routine
  For tracing and future seccomp the program may be triggered on all system
  calls, but processing of syscall arguments will be different. It's more
  efficient to implement them as:
  int syscall_entry(struct seccomp_data *ctx)
  {
     bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
     ... default: process unknown syscall ...
  }
  int sys_write_event(struct seccomp_data *ctx) {...}
  int sys_read_event(struct seccomp_data *ctx) {...}
  syscall_jmp_table[__NR_write] = sys_write_event;
  syscall_jmp_table[__NR_read] = sys_read_event;

  For networking the program may call into different parsers depending on
  packet format, like:
  int packet_parser(struct __sk_buff *skb)
  {
     ... parse L2, L3 here ...
     __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
     bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
     ... default: process unknown protocol ...
  }
  int parse_tcp(struct __sk_buff *skb) {...}
  int parse_udp(struct __sk_buff *skb) {...}
  ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
  ipproto_jmp_table[IPPROTO_UDP] = parse_udp;

- for TC use case, bpf_tail_call() allows to implement reclassify-like logic

- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
  are atomic, so user space can build chains of BPF programs on the fly

Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
  It could have been implemented without JIT changes as a wrapper on top of
  BPF_PROG_RUN() macro, but with two downsides:
  . all programs would have to pay performance penalty for this feature and
    tail call itself would be slower, since mandatory stack unwind, return,
    stack allocate would be done for every tailcall.
  . tailcall would be limited to programs running preempt_disabled, since
    generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
    need to be either global per_cpu variable accessed by helper and by wrapper
    or global variable protected by locks.

  In this implementation x64 JIT bypasses stack unwind and jumps into the
  callee program after prologue.

- bpf_prog_array_compatible() ensures that prog_type of callee and caller
  are the same and JITed/non-JITed flag is the same, since calling JITed
  program from non-JITed is invalid, since stack frames are different.
  Similarly calling kprobe type program from socket type program is invalid.

- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
  abstraction, its user space API and all of verifier logic.
  It's in the existing arraymap.c file, since several functions are
  shared with regular array map.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 17:07:59 -04:00
..
6lowpan 6lowpan: nhc: add other known rfc6282 compressions 2015-02-14 23:08:44 +01:00
9p 9p: patches for 4.1 merge window 2015-04-18 17:45:30 -04:00
802 net: Kill dev_rebuild_header 2015-03-02 16:43:41 -05:00
8021q vlan: implement ndo_get_iflink 2015-04-02 14:05:00 -04:00
appletalk net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
atm net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
ax25 net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
batman-adv dev: introduce dev_get_iflink() 2015-04-02 14:04:59 -04:00
bluetooth Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-05-13 14:31:43 -04:00
bridge bridge_netfilter: No ICMP packet on IPv4 fragmentation error 2015-05-19 00:15:39 -04:00
caif net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
can net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
ceph net: Add a struct net parameter to sock_create_kern 2015-05-11 10:50:17 -04:00
core bpf: allow bpf programs to tail-call other bpf programs 2015-05-21 17:07:59 -04:00
dcb net/dcb: Add IEEE QCN attribute 2015-03-06 21:50:02 -05:00
dccp inet: fix possible panic in reqsk_queue_unlink() 2015-04-24 11:39:15 -04:00
decnet net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
dns_resolver Merge commit 'v3.16' into next 2014-10-01 00:44:04 +10:00
dsa switchdev: don't use anonymous union on switchdev attr/obj structs 2015-05-13 14:20:59 -04:00
ethernet flow_dissect: use programable dissector in skb_flow_dissect and friends 2015-05-13 15:19:47 -04:00
hsr net/hsr: Fix NULL pointer dereference and refcnt bugs when deleting a HSR interface. 2015-03-01 13:40:23 -05:00
ieee802154 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-05-13 14:31:43 -04:00
ipv4 tcp: add a force_schedule argument to sk_stream_alloc_skb() 2015-05-21 16:56:40 -04:00
ipv6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-05-13 14:31:43 -04:00
ipx net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
irda net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
iucv net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
key net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
l2tp net: Modify sk_alloc to not reference count the netns of kernel sockets. 2015-05-11 10:50:18 -04:00
lapb lapb: move EXPORT_SYMBOL after functions. 2014-10-24 15:51:42 -04:00
llc net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
mac80211 This just has a few fixes: 2015-05-19 16:43:17 -04:00
mac802154 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth 2015-05-09 15:51:00 -04:00
mpls mpls: Change reserved label names to be consistent with netbsd 2015-05-09 22:29:50 -04:00
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next 2015-05-18 14:47:36 -04:00
netlabel netlink: implement nla_put_in_addr and nla_put_in6_addr 2015-03-31 13:58:35 -04:00
netlink netlink: Use random autobind rover 2015-05-17 23:43:31 -04:00
netrom net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
nfc net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
openvswitch geneve: Rename support library as geneve_core 2015-05-13 15:59:13 -04:00
packet net-packet: fix null pointer exception in rollover mode 2015-05-17 22:41:38 -04:00
phonet net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
rds Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-05-13 14:31:43 -04:00
rfkill Last round of updates for net-next: 2015-02-04 14:57:45 -08:00
rose net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
rxrpc net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
sched cls_flower: Fix compile error 2015-05-14 13:34:35 -04:00
sctp net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
sunrpc svcrpc: fix potential GSSX_ACCEPT_SEC_CONTEXT decoding failures 2015-05-04 12:02:40 -04:00
switchdev switchdev: add support for fdb add/del/dump via switchdev_port_obj ops. 2015-05-17 22:49:09 -04:00
tipc tipc: use sock_create_kern interface to create kernel socket 2015-05-14 13:39:33 -04:00
unix net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
vmw_vsock net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
wimax wimax: convert printk to pr_foo() 2014-10-07 20:28:44 -04:00
wireless cfg80211: change GO_CONCURRENT to IR_CONCURRENT for STA 2015-05-06 15:50:02 +02:00
x25 net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
xfrm net: make skb_dst_pop routine static 2015-05-12 23:19:49 -04:00
Kconfig net: add CONFIG_NET_INGRESS to enable ingress filtering 2015-05-14 01:10:05 -04:00
Makefile mpls: Refactor how the mpls module is built 2015-03-04 00:26:06 -05:00
compat.c net: switch importing msghdr from userland to {compat_,}import_iovec() 2015-04-09 00:02:26 -04:00
socket.c net: Add a struct net parameter to sock_create_kern 2015-05-11 10:50:17 -04:00
sysctl_net.c