linux-sg2042/include
Alexei Starovoitov 6c90598174 bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.

The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work

At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.

Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.

While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.

Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.

Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.

Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.

Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:

1 cpu:
pcpu_ida	2.1M
pcpu_ida nolock	2.3M
bt		2.4M
kmalloc		1.8M
hlist+spinlock	2.3M
pcpu_freelist	2.6M

4 cpu:
pcpu_ida	1.5M
pcpu_ida nolock	1.8M
bt w/smp_align	1.7M
bt no/smp_align	1.1M
kmalloc		0.7M
hlist+spinlock	0.2M
pcpu_freelist	2.0M

8 cpu:
pcpu_ida	0.7M
bt w/smp_align	0.8M
kmalloc		0.4M
pcpu_freelist	1.5M

32 cpu:
kmalloc		0.13M
pcpu_freelist	0.49M

pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.

bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist

hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.

kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 15:28:31 -05:00
..
acpi ACPI / CPPC: remove redundant mbox_send_message() declaration 2016-02-03 01:09:52 +01:00
asm-generic powerpc fixes for 4.5 #2 2016-02-20 09:22:11 -08:00
clocksource
crypto Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2016-01-22 11:58:43 -08:00
drm drm/atomic: Allow for holes in connector state, v2. 2016-02-19 13:24:03 +10:00
dt-bindings clk: tegra: Add the APB2APE audio clock on Tegra210 2016-02-02 15:49:29 +01:00
keys
kvm
linux bpf: pre-allocate hash map elements 2016-03-08 15:28:31 -05:00
math-emu
media [media] vb2: fix nasty vb2_thread regression 2016-02-04 09:13:46 -02:00
memory
misc
net ipv6: per netns FIB garbage collection 2016-03-08 15:16:51 -05:00
pcmcia
ras
rdma net: rdma: use __ethtool_get_ksettings 2016-02-25 22:06:46 -05:00
rxrpc rxrpc: Be more selective about the types of received packets we accept 2016-03-04 15:56:06 +00:00
scsi Initial roundup of 4.5 merge window patches 2016-01-23 18:45:06 -08:00
soc ARM: SoC driver updates for v4.5 2016-01-20 18:42:30 -08:00
sound ALSA: hda - Loop interrupt handling until really cleared 2016-02-26 08:50:31 +01:00
target target/transport: add flag to indicate CPU Affinity is observed 2016-02-10 23:08:55 -08:00
trace sunvnet: Add support for perf LDC event tracing 2016-02-07 14:13:05 -05:00
uapi bpf: pre-allocate hash map elements 2016-03-08 15:28:31 -05:00
video
xen Merge branch 'for-4.5/drivers' of git://git.kernel.dk/linux-block 2016-01-21 18:19:38 -08:00
Kbuild