The krealloc function checks where the same buffer was reused or a new one
allocated by comparing kernel pointers. Tag-based KASAN changes memory
tag on the krealloc'ed chunk of memory and therefore also changes the
pointer tag of the returned pointer. Therefore we need to perform
comparison on untagged (with tags reset) pointers to check whether it's
the same memory region or not.
Link: http://lkml.kernel.org/r/14f6190d7846186a3506cd66d82446646fe65090.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "kasan: add software tag-based mode for arm64", v13.
This patchset adds a new software tag-based mode to KASAN [1]. (Initially
this mode was called KHWASAN, but it got renamed, see the naming rationale
at the end of this section).
The plan is to implement HWASan [2] for the kernel with the incentive,
that it's going to have comparable to KASAN performance, but in the same
time consume much less memory, trading that off for somewhat imprecise bug
detection and being supported only for arm64.
The underlying ideas of the approach used by software tag-based KASAN are:
1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
pointer tags in the top byte of each kernel pointer.
2. Using shadow memory, we can store memory tags for each chunk of kernel
memory.
3. On each memory allocation, we can generate a random tag, embed it into
the returned pointer and set the memory tags that correspond to this
chunk of memory to the same value.
4. By using compiler instrumentation, before each memory access we can add
a check that the pointer tag matches the tag of the memory that is being
accessed.
5. On a tag mismatch we report an error.
With this patchset the existing KASAN mode gets renamed to generic KASAN,
with the word "generic" meaning that the implementation can be supported
by any architecture as it is purely software.
The new mode this patchset adds is called software tag-based KASAN. The
word "tag-based" refers to the fact that this mode uses tags embedded into
the top byte of kernel pointers and the TBI arm64 CPU feature that allows
to dereference such pointers. The word "software" here means that shadow
memory manipulation and tag checking on pointer dereference is done in
software. As it is the only tag-based implementation right now, "software
tag-based" KASAN is sometimes referred to as simply "tag-based" in this
patchset.
A potential expansion of this mode is a hardware tag-based mode, which
would use hardware memory tagging support (announced by Arm [3]) instead
of compiler instrumentation and manual shadow memory manipulation.
Same as generic KASAN, software tag-based KASAN is strictly a debugging
feature.
[1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html
[2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
[3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a
====== Rationale
On mobile devices generic KASAN's memory usage is significant problem.
One of the main reasons to have tag-based KASAN is to be able to perform a
similar set of checks as the generic one does, but with lower memory
requirements.
Comment from Vishwath Mohan <vishwath@google.com>:
I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
problematic to enable for environments that don't tolerate the increased
memory pressure well. This includes
(a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
(c) Connected components like Pixel's visual core [1].
These are both places I'd love to have a low(er) memory footprint option at
my disposal.
Comment from Evgenii Stepanov <eugenis@google.com>:
Looking at a live Android device under load, slab (according to
/proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's
overhead of 2x - 3x on top of it is not insignificant.
Not having this overhead enables near-production use - ex. running
KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
not reproduce in test configuration. These are the ones that often cost
the most engineering time to track down.
CPU overhead is bad, but generally tolerable. RAM is critical, in our
experience. Once it gets low enough, OOM-killer makes your life
miserable.
[1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/
====== Technical details
Software tag-based KASAN mode is implemented in a very similar way to the
generic one. This patchset essentially does the following:
1. TCR_TBI1 is set to enable Top Byte Ignore.
2. Shadow memory is used (with a different scale, 1:16, so each shadow
byte corresponds to 16 bytes of kernel memory) to store memory tags.
3. All slab objects are aligned to shadow scale, which is 16 bytes.
4. All pointers returned from the slab allocator are tagged with a random
tag and the corresponding shadow memory is poisoned with the same value.
5. Compiler instrumentation is used to insert tag checks. Either by
calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
CONFIG_KASAN_INLINE flags are reused).
6. When a tag mismatch is detected in callback instrumentation mode
KASAN simply prints a bug report. In case of inline instrumentation,
clang inserts a brk instruction, and KASAN has it's own brk handler,
which reports the bug.
7. The memory in between slab objects is marked with a reserved tag, and
acts as a redzone.
8. When a slab object is freed it's marked with a reserved tag.
Bug detection is imprecise for two reasons:
1. We won't catch some small out-of-bounds accesses, that fall into the
same shadow cell, as the last byte of a slab object.
2. We only have 1 byte to store tags, which means we have a 1/256
probability of a tag match for an incorrect access (actually even
slightly less due to reserved tag values).
Despite that there's a particular type of bugs that tag-based KASAN can
detect compared to generic KASAN: use-after-free after the object has been
allocated by someone else.
====== Testing
Some kernel developers voiced a concern that changing the top byte of
kernel pointers may lead to subtle bugs that are difficult to discover.
To address this concern deliberate testing has been performed.
It doesn't seem feasible to do some kind of static checking to find
potential issues with pointer tagging, so a dynamic approach was taken.
All pointer comparisons/subtractions have been instrumented in an LLVM
compiler pass and a kernel module that would print a bug report whenever
two pointers with different tags are being compared/subtracted (ignoring
comparisons with NULL pointers and with pointers obtained by casting an
error code to a pointer type) has been used. Then the kernel has been
booted in QEMU and on an Odroid C2 board and syzkaller has been run.
This yielded the following results.
The two places that look interesting are:
is_vmalloc_addr in include/linux/mm.h
is_kernel_rodata in mm/util.c
Here we compare a pointer with some fixed untagged values to make sure
that the pointer lies in a particular part of the kernel address space.
Since tag-based KASAN doesn't add tags to pointers that belong to rodata
or vmalloc regions, this should work as is. To make sure debug checks to
those two functions that check that the result doesn't change whether we
operate on pointers with or without untagging has been added.
A few other cases that don't look that interesting:
Comparing pointers to achieve unique sorting order of pointee objects
(e.g. sorting locks addresses before performing a double lock):
tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
pipe_double_lock in fs/pipe.c
unix_state_double_lock in net/unix/af_unix.c
lock_two_nondirectories in fs/inode.c
mutex_lock_double in kernel/events/core.c
ep_cmp_ffd in fs/eventpoll.c
fsnotify_compare_groups fs/notify/mark.c
Nothing needs to be done here, since the tags embedded into pointers
don't change, so the sorting order would still be unique.
Checks that a pointer belongs to some particular allocation:
is_sibling_entry in lib/radix-tree.c
object_is_on_stack in include/linux/sched/task_stack.h
Nothing needs to be done here either, since two pointers can only belong
to the same allocation if they have the same tag.
Overall, since the kernel boots and works, there are no critical bugs.
As for the rest, the traditional kernel testing way (use until fails) is
the only one that looks feasible.
Another point here is that tag-based KASAN is available under a separate
config option that needs to be deliberately enabled. Even though it might
be used in a "near-production" environment to find bugs that are not found
during fuzzing or running tests, it is still a debug tool.
====== Benchmarks
The following numbers were collected on Odroid C2 board. Both generic and
tag-based KASAN were used in inline instrumentation mode.
Boot time [1]:
* ~1.7 sec for clean kernel
* ~5.0 sec for generic KASAN
* ~5.0 sec for tag-based KASAN
Network performance [2]:
* 8.33 Gbits/sec for clean kernel
* 3.17 Gbits/sec for generic KASAN
* 2.85 Gbits/sec for tag-based KASAN
Slab memory usage after boot [3]:
* ~40 kb for clean kernel
* ~105 kb (~260% overhead) for generic KASAN
* ~47 kb (~20% overhead) for tag-based KASAN
KASAN memory overhead consists of three main parts:
1. Increased slab memory usage due to redzones.
2. Shadow memory (the whole reserved once during boot).
3. Quaratine (grows gradually until some preset limit; the more the limit,
the more the chance to detect a use-after-free).
Comparing tag-based vs generic KASAN for each of these points:
1. 20% vs 260% overhead.
2. 1/16th vs 1/8th of physical memory.
3. Tag-based KASAN doesn't require quarantine.
[1] Time before the ext4 driver is initialized.
[2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
[3] Measured as `cat /proc/meminfo | grep Slab`.
====== Some notes
A few notes:
1. The patchset can be found here:
https://github.com/xairy/kasan-prototype/tree/khwasan
2. Building requires a recent Clang version (7.0.0 or later).
3. Stack instrumentation is not supported yet and will be added later.
This patch (of 25):
Tag-based KASAN changes the value of the top byte of pointers returned
from the kernel allocation functions (such as kmalloc). This patch
updates KASAN hooks signatures and their usage in SLAB and SLUB code to
reflect that.
Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that synchronize_rcu() waits for preempt-disable regions of code
as well as RCU read-side critical sections, synchronize_sched() can be
replaced by synchronize_rcu(). This commit therefore makes this change.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Kmalloc cache names can get quite long for large object sizes, when the
sizes are expressed in bytes. Use 'k' and 'M' prefixes to make the names
as short as possible e.g. in /proc/slabinfo. This works, as we mostly
use power-of-two sizes, with exceptions only below 1k.
Example: 'kmalloc-4194304' becomes 'kmalloc-4M'
Link: http://lkml.kernel.org/r/20180731090649.16028-7-vbabka@suse.cz
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
indicates they contain objects which can be reclaimed under memory
pressure (typically through a shrinker). This makes the slab pages
accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
MemAvailable meminfo counter and in overcommit decisions. The slab pages
are also allocated with __GFP_RECLAIMABLE, which is good for
anti-fragmentation through grouping pages by mobility.
The generic kmalloc-X caches are created without this flag, but sometimes
are used also for objects that can be reclaimed, which due to varying size
cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A
prominent example are dcache external names, which prompted the creation
of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
in commit f1782c9bc5 ("dcache: account external names as indirectly
reclaimable memory").
To better handle this and any other similar cases, this patch introduces
SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X.
They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among
gfp flags. They are added to the kmalloc_caches array as a new type.
Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type
cache.
This change only applies to SLAB and SLUB, not SLOB. This is fine, since
SLOB's target are tiny system and this patch does add some overhead of
kmem management objects.
Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Slub does not call kmalloc_slab() for sizes > KMALLOC_MAX_CACHE_SIZE,
instead it falls back to kmalloc_large().
For slab KMALLOC_MAX_CACHE_SIZE == KMALLOC_MAX_SIZE and it calls
kmalloc_slab() for all allocations relying on NULL return value for
over-sized allocations.
This inconsistency leads to unwanted warnings from kmalloc_slab() for
over-sized allocations for slab. Returning NULL for failed allocations is
the expected behavior.
Make slub and slab code consistent by checking size >
KMALLOC_MAX_CACHE_SIZE in slab before calling kmalloc_slab().
While we are here also fix the check in kmalloc_slab(). We should check
against KMALLOC_MAX_CACHE_SIZE rather than KMALLOC_MAX_SIZE. It all kinda
worked because for slab the constants are the same, and slub always checks
the size against KMALLOC_MAX_CACHE_SIZE before kmalloc_slab(). But if we
get there with size > KMALLOC_MAX_CACHE_SIZE anyhow bad things will
happen. For example, in case of a newly introduced bug in slub code.
Also move the check in kmalloc_slab() from function entry to the size >
192 case. This partially compensates for the additional check in slab
code and makes slub code a bit faster (at least theoretically).
Also drop __GFP_NOWARN in the warning check. This warning means a bug in
slab code itself, user-passed flags have nothing to do with it.
Nothing of this affects slob.
Link: http://lkml.kernel.org/r/20180927171502.226522-1-dvyukov@gmail.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: syzbot+87829a10073277282ad1@syzkaller.appspotmail.com
Reported-by: syzbot+ef4e8fc3a06e9019bb40@syzkaller.appspotmail.com
Reported-by: syzbot+6e438f4036df52cbb863@syzkaller.appspotmail.com
Reported-by: syzbot+8574471d8734457d98aa@syzkaller.appspotmail.com
Reported-by: syzbot+af1504df0807a083dbd9@syzkaller.appspotmail.com
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce new config option, which is used to replace repeating
CONFIG_MEMCG && !CONFIG_SLOB pattern. Next patches add a little more
memcg+kmem related code, so let's keep the defines more clearly.
Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In kernel 4.17 I removed some code from dm-bufio that did slab cache
merging (commit 21bb13276768: "dm bufio: remove code that merges slab
caches") - both slab and slub support merging caches with identical
attributes, so dm-bufio now just calls kmem_cache_create and relies on
implicit merging.
This uncovered a bug in the slub subsystem - if we delete a cache and
immediatelly create another cache with the same attributes, it fails
because of duplicate filename in /sys/kernel/slab/. The slub subsystem
offloads freeing the cache to a workqueue - and if we create the new
cache before the workqueue runs, it complains because of duplicate
filename in sysfs.
This patch fixes the bug by moving the call of kobject_del from
sysfs_slab_remove_workfn to shutdown_cache. kobject_del must be called
while we hold slab_mutex - so that the sysfs entry is deleted before a
cache with the same attributes could be created.
Running device-mapper-test-suite with:
dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/
triggered:
Buffer I/O error on dev dm-0, logical block 1572848, async page read
device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5
device-mapper: thin: 253:1: aborting current metadata transaction
sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144'
CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ #25
Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017
Workqueue: dm-thin do_worker [dm_thin_pool]
Call Trace:
dump_stack+0x5a/0x73
sysfs_warn_dup+0x58/0x70
sysfs_create_dir_ns+0x77/0x80
kobject_add_internal+0xba/0x2e0
kobject_init_and_add+0x70/0xb0
sysfs_slab_add+0xb1/0x250
__kmem_cache_create+0x116/0x150
create_cache+0xd9/0x1f0
kmem_cache_create_usercopy+0x1c1/0x250
kmem_cache_create+0x18/0x20
dm_bufio_client_create+0x1ae/0x410 [dm_bufio]
dm_block_manager_create+0x5e/0x90 [dm_persistent_data]
__create_persistent_data_objects+0x38/0x940 [dm_thin_pool]
dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool]
metadata_operation_failed+0x59/0x100 [dm_thin_pool]
alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool]
process_cell+0x2a3/0x550 [dm_thin_pool]
do_worker+0x28d/0x8f0 [dm_thin_pool]
process_one_work+0x171/0x370
worker_thread+0x49/0x3f0
kthread+0xf8/0x130
ret_from_fork+0x35/0x40
kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory.
kmem_cache_create(dm_bufio_buffer-16) failed with error -17
Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/*.c files use symbolic and octal styles for permissions.
Using octal and not symbolic permissions is preferred by many as more
readable.
https://lkml.org/lkml/2016/8/2/1945
Prefer the direct use of octal for permissions.
Done using
$ scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace mm/*.c
and some typing.
Before: $ git grep -P -w "0[0-7]{3,3}" mm | wc -l
44
After: $ git grep -P -w "0[0-7]{3,3}" mm | wc -l
86
Miscellanea:
o Whitespace neatening around these conversions.
Link: http://lkml.kernel.org/r/2e032ef111eebcd4c5952bae86763b541d373469.1522102887.git.joe@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memcg kmem cache creation and deactivation (SLUB only) is
asynchronous. If a root kmem cache is destroyed whose memcg cache is in
the process of creation or deactivation, the kernel may crash.
Example of one such crash:
general protection fault: 0000 [#1] SMP PTI
CPU: 1 PID: 1721 Comm: kworker/14:1 Not tainted 4.17.0-smp
...
Workqueue: memcg_kmem_cache kmemcg_deactivate_workfn
RIP: 0010:has_cpu_slab
...
Call Trace:
? on_each_cpu_cond
__kmem_cache_shrink
kmemcg_cache_deact_after_rcu
kmemcg_deactivate_workfn
process_one_work
worker_thread
kthread
ret_from_fork+0x35/0x40
To fix this race, on root kmem cache destruction, mark the cache as
dying and flush the workqueue used for memcg kmem cache creation and
deactivation. SLUB's memcg kmem cache deactivation also includes RCU
callback and thus make sure all previous registered RCU callbacks have
completed as well.
[shakeelb@google.com: handle the RCU callbacks for SLUB deactivation]
Link: http://lkml.kernel.org/r/20180611192951.195727-1-shakeelb@google.com
[shakeelb@google.com: add more documentation, rename fields for readability]
Link: http://lkml.kernel.org/r/20180522201336.196994-1-shakeelb@google.com
[akpm@linux-foundation.org: fix build, per Shakeel]
[shakeelb@google.com: v3. Instead of refcount, flush the workqueue]
Link: http://lkml.kernel.org/r/20180530001204.183758-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180521174116.171846-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
should_failslab() is a convenient function to hook into for directed
error injection into kmalloc(). However, it is only available if a
config flag is set.
The following BCC script, for example, fails kmalloc() calls after a
btrfs umount:
from bcc import BPF
prog = r"""
BPF_HASH(flag);
#include <linux/mm.h>
int kprobe__btrfs_close_devices(void *ctx) {
u64 key = 1;
flag.update(&key, &key);
return 0;
}
int kprobe__should_failslab(struct pt_regs *ctx) {
u64 key = 1;
u64 *res;
res = flag.lookup(&key);
if (res != 0) {
bpf_override_return(ctx, -ENOMEM);
}
return 0;
}
"""
b = BPF(text=prog)
while 1:
b.kprobe_poll()
This patch refactors the should_failslab implementation so that the
function is always available for error injection, independent of flags.
This change would be similar in nature to commit f5490d3ec921 ("block:
Add should_fail_bio() for bpf error injection").
Link: http://lkml.kernel.org/r/20180222020320.6944-1-hmclauchlan@fb.com
Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit db265eca77 ("mm/sl[aou]b: Move duping of slab name to
slab_common.c"), the kernel always duplicates the slab cache name when
creating a slab cache, so the test if the slab name is accessible is
useless.
Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1803231133310.22626@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I have noticed on debug kernel with SLAB, the size of some non-root
slabs were larger than their corresponding root slabs.
e.g. for radix_tree_node:
$cat /proc/slabinfo | grep radix
name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> ...
radix_tree_node 15052 15075 4096 1 1 ...
$cat /cgroup/memory/temp/memory.kmem.slabinfo | grep radix
name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> ...
radix_tree_node 1581 158 4120 1 2 ...
However for SLUB in debug kernel, the sizes were same. On further
inspection it is found that SLUB always use kmem_cache.object_size to
measure the kmem_cache.size while SLAB use the given kmem_cache.size.
In the debug kernel the slab's size can be larger than its object_size.
Thus in the creation of non-root slab, the SLAB uses the root's size as
base to calculate the non-root slab's size and thus non-root slab's size
can be larger than the root slab's size. For SLUB, the non-root slab's
size is measured based on the root's object_size and thus the size will
remain same for root and non-root slab.
This patch makes slab's object_size the default base to measure the
slab's size.
Link: http://lkml.kernel.org/r/20180313165428.58699-1-shakeelb@google.com
Fixes: 794b1248be ("memcg, slab: separate memcg vs root cache creation paths")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SLAB doesn't support 4GB+ of objects per slab, therefore randomization
doesn't need size_t.
Link: http://lkml.kernel.org/r/20180305200730.15812-25-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If kmem case sizes are 32-bit, then usecopy region should be too.
Link: http://lkml.kernel.org/r/20180305200730.15812-21-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
size_index_elem() always works with small sizes (kmalloc caches are
32-bit) and returns small indexes.
Link: http://lkml.kernel.org/r/20180305200730.15812-8-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All those small numbers are reverse indexes into kmalloc caches array
and can't be negative.
On x86_64 "unsigned int = fls()" can drop CDQE instruction:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-2 (-2)
Function old new delta
kmalloc_slab 101 99 -2
Link: http://lkml.kernel.org/r/20180305200730.15812-7-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
struct kmem_cache::size and ::align were always 32-bit.
Out of curiosity I created 4GB kmem_cache, it oopsed with division by 0.
kmem_cache_create(1UL<<32+1) created 1-byte cache as expected.
size_t doesn't work and never did.
Link: http://lkml.kernel.org/r/20180305200730.15812-6-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
struct kmem_cache::size has always been "int", all those
"size_t size" are fake.
Link: http://lkml.kernel.org/r/20180305200730.15812-5-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KMALLOC_MAX_CACHE_SIZE is 32-bit so is the largest kmalloc cache size.
Christoph said:
:
: Ok SLABs maximum allocation size is limited to 32M (see
: include/linux/slab.h:
:
: #define KMALLOC_SHIFT_HIGH ((MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \
: (MAX_ORDER + PAGE_SHIFT - 1) : 25)
:
: And SLUB/SLOB pass all larger requests to the page allocator anyways.
Link: http://lkml.kernel.org/r/20180305200730.15812-4-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kmalloc_size() derives size of kmalloc cache from internal index, which
can't be negative.
Propagate unsignedness a bit.
Link: http://lkml.kernel.org/r/20180305200730.15812-3-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kmalloc caches aren't relocated after being set up neither does
"size_index" array.
Link: http://lkml.kernel.org/r/20180226203519.GA6886@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass all
hardened usercopy checks since these sizes cannot change at runtime.)
This new check is WARN-by-default, so any mistakes can be found over the
next several releases without breaking anyone's system.
The series has roughly the following sections:
- remove %p and improve reporting with offset
- prepare infrastructure and whitelist kmalloc
- update VFS subsystem with whitelists
- update SCSI subsystem with whitelists
- update network subsystem with whitelists
- update process memory with whitelists
- update per-architecture thread_struct with whitelists
- update KVM with whitelists and fix ioctl bug
- mark all other allocations as not whitelisted
- update lkdtm for more sensible test overage
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Kees Cook <kees@outflux.net>
iQIcBAABCgAGBQJabvleAAoJEIly9N/cbcAmO1kQAJnjVPutnLSbnUteZxtsv7W4
43Cggvokfxr6l08Yh3hUowNxZVKjhF9uwMVgRRg9Nl5WdYCN+vCQbHz+ZdzGJXKq
cGqdKWgexMKX+aBdNDrK7BphUeD46sH7JWR+a/lDV/BgPxBCm9i5ZZCgXbPP89AZ
NpLBji7gz49wMsnm/x135xtNlZ3dG0oKETzi7MiR+NtKtUGvoIszSKy5JdPZ4m8q
9fnXmHqmwM6uQFuzDJPt1o+D1fusTuYnjI7EgyrJRRhQ+BB3qEFZApXnKNDRS9Dm
uB7jtcwefJCjlZVCf2+PWTOEifH2WFZXLPFlC8f44jK6iRW2Nc+wVRisJ3vSNBG1
gaRUe/FSge68eyfQj5OFiwM/2099MNkKdZ0fSOjEBeubQpiFChjgWgcOXa5Bhlrr
C4CIhFV2qg/tOuHDAF+Q5S96oZkaTy5qcEEwhBSW15ySDUaRWFSrtboNt6ZVOhug
d8JJvDCQWoNu1IQozcbv6xW/Rk7miy8c0INZ4q33YUvIZpH862+vgDWfTJ73Zy9H
jR/8eG6t3kFHKS1vWdKZzOX1bEcnd02CGElFnFYUEewKoV7ZeeLsYX7zodyUAKyi
Yp5CImsDbWWTsptBg6h9nt2TseXTxYCt2bbmpJcqzsqSCUwOQNQ4/YpuzLeG0ihc
JgOmUnQNJWCTwUUw5AS1
=tzmJ
-----END PGP SIGNATURE-----
Merge tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardened usercopy whitelisting from Kees Cook:
"Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs.
To further restrict what memory is available for copying, this creates
a way to whitelist specific areas of a given slab cache object for
copying to/from userspace, allowing much finer granularity of access
control.
Slab caches that are never exposed to userspace can declare no
whitelist for their objects, thereby keeping them unavailable to
userspace via dynamic copy operations. (Note, an implicit form of
whitelisting is the use of constant sizes in usercopy operations and
get_user()/put_user(); these bypass all hardened usercopy checks since
these sizes cannot change at runtime.)
This new check is WARN-by-default, so any mistakes can be found over
the next several releases without breaking anyone's system.
The series has roughly the following sections:
- remove %p and improve reporting with offset
- prepare infrastructure and whitelist kmalloc
- update VFS subsystem with whitelists
- update SCSI subsystem with whitelists
- update network subsystem with whitelists
- update process memory with whitelists
- update per-architecture thread_struct with whitelists
- update KVM with whitelists and fix ioctl bug
- mark all other allocations as not whitelisted
- update lkdtm for more sensible test overage"
* tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (38 commits)
lkdtm: Update usercopy tests for whitelisting
usercopy: Restrict non-usercopy caches to size 0
kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
kvm: whitelist struct kvm_vcpu_arch
arm: Implement thread_struct whitelist for hardened usercopy
arm64: Implement thread_struct whitelist for hardened usercopy
x86: Implement thread_struct whitelist for hardened usercopy
fork: Provide usercopy whitelisting for task_struct
fork: Define usercopy region in thread_stack slab caches
fork: Define usercopy region in mm_struct slab caches
net: Restrict unwhitelisted proto caches to size 0
sctp: Copy struct sctp_sock.autoclose to userspace using put_user()
sctp: Define usercopy region in SCTP proto slab cache
caif: Define usercopy region in caif proto slab cache
ip: Define usercopy region in IP proto slab cache
net: Define usercopy region in struct proto slab cache
scsi: Define usercopy region in scsi_sense_cache slab cache
cifs: Define usercopy region in cifs_request slab cache
vxfs: Define usercopy region in vxfs_inode slab cache
ufs: Define usercopy region in ufs_inode_cache slab cache
...
calculate_alignment() function is only used inside slab_common.c. So
make it static and let the compiler do more optimizations.
After this patch there's a small improvement in text and data size.
$ gcc --version
gcc (GCC) 7.2.1 20171128
Before:
text data bss dec hex filename
9890457 3828702 1212364 14931523 e3d643 vmlinux
After:
text data bss dec hex filename
9890437 3828670 1212364 14931471 e3d60f vmlinux
Also I fixed a style problem reported by checkpatch.
WARNING: Missing a blank line after declarations
#53: FILE: mm/slab_common.c:286:
+ unsigned long ralign = cache_line_size();
+ while (size <= ralign / 2)
Link: http://lkml.kernel.org/r/20171210080132.406-1-bhlee.kernel@gmail.com
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With all known usercopied cache whitelists now defined in the
kernel, switch the default usercopy region of kmem_cache_create()
to size 0. Any new caches with usercopy regions will now need to use
kmem_cache_create_usercopy() instead of kmem_cache_create().
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.
Cc: David Windsor <dave@nullcore.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Mark the kmalloc slab caches as entirely whitelisted. These caches
are frequently used to fulfill kernel allocations that contain data
to be copied to/from userspace. Internal-only uses are also common,
but are scattered in the kernel. For now, mark all the kmalloc caches
as whitelisted.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: merged in moved kmalloc hunks, adjust commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
This introduces CONFIG_HARDENED_USERCOPY_FALLBACK to control the
behavior of hardened usercopy whitelist violations. By default, whitelist
violations will continue to WARN() so that any bad or missing usercopy
whitelists can be discovered without being too disruptive.
If this config is disabled at build time or a system is booted with
"slab_common.usercopy_fallback=0", usercopy whitelists will BUG() instead
of WARN(). This is useful for admins that want to use usercopy whitelists
immediately.
Suggested-by: Matthew Garrett <mjg59@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
Convert all allocations that used a NOTRACK flag to stop using it.
Link: http://lkml.kernel.org/r/20171007030159.22241-3-alexander.levin@verizon.com
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Hansen <devtimhansen@gmail.com>
Cc: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add sparse-checked slab_flags_t for struct kmem_cache::flags (SLAB_POISON,
etc).
SLAB is bloated temporarily by switching to "unsigned long", but only
temporarily.
Link: http://lkml.kernel.org/r/20171021100225.GA22428@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel may panic when an oom happens without killable process
sometimes it is caused by huge unreclaimable slabs used by kernel.
Although kdump could help debug such problem, however, kdump is not
available on all architectures and it might be malfunction sometime.
And, since kernel already panic it is worthy capturing such information
in dmesg to aid touble shooting.
Print out unreclaimable slab info (used size and total size) which
actual memory usage is not zero (num_objs * size != 0) when
unreclaimable slabs amount is greater than total user memory (LRU
pages).
The output looks like:
Unreclaimable slab info:
Name Used Total
rpc_buffers 31KB 31KB
rpc_tasks 7KB 7KB
ebitmap_node 1964KB 1964KB
avtab_node 5024KB 5024KB
xfs_buf 1402KB 1402KB
xfs_ili 134KB 134KB
xfs_efi_item 115KB 115KB
xfs_efd_item 115KB 115KB
xfs_buf_item 134KB 134KB
xfs_log_item_desc 342KB 342KB
xfs_trans 1412KB 1412KB
xfs_ifork 212KB 212KB
[yang.s@alibaba-inc.com: v11]
Link: http://lkml.kernel.org/r/1507656303-103845-4-git-send-email-yang.s@alibaba-inc.com
Link: http://lkml.kernel.org/r/1507152550-46205-4-git-send-email-yang.s@alibaba-inc.com
Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For quick per-memcg indexing, slab caches and list_lru structures
maintain linear arrays of descriptors. As the number of concurrent
memory cgroups in the system goes up, this requires large contiguous
allocations (8k cgroups = order-5, 16k cgroups = order-6 etc.) for every
existing slab cache and list_lru, which can easily fail on loaded
systems. E.g.:
mkdir: page allocation failure: order:5, mode:0x14040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null)
CPU: 1 PID: 6399 Comm: mkdir Not tainted 4.13.0-mm1-00065-g720bbe532b7c-dirty #481
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
Call Trace:
? __alloc_pages_direct_compact+0x4c/0x110
__alloc_pages_nodemask+0xf50/0x1430
alloc_pages_current+0x60/0xc0
kmalloc_order_trace+0x29/0x1b0
__kmalloc+0x1f4/0x320
memcg_update_all_list_lrus+0xca/0x2e0
mem_cgroup_css_alloc+0x612/0x670
cgroup_apply_control_enable+0x19e/0x360
cgroup_mkdir+0x322/0x490
kernfs_iop_mkdir+0x55/0x80
vfs_mkdir+0xd0/0x120
SyS_mkdirat+0x6c/0xe0
SyS_mkdir+0x14/0x20
entry_SYSCALL_64_fastpath+0x18/0xad
Mem-Info:
active_anon:2965 inactive_anon:19 isolated_anon:0
active_file:100270 inactive_file:98846 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
slab_reclaimable:7328 slab_unreclaimable:16402
mapped:771 shmem:52 pagetables:278 bounce:0
free:13718 free_pcp:0 free_cma:0
This output is from an artificial reproducer, but we have repeatedly
observed order-7 failures in production in the Facebook fleet. These
systems become useless as they cannot run more jobs, even though there
is plenty of memory to allocate 128 individual pages.
Use kvmalloc and kvzalloc to fall back to vmalloc space if these arrays
prove too large for allocating them physically contiguous.
Link: http://lkml.kernel.org/r/20170918184919.20644-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some hardened environments want to build kernels with slab_nomerge
already set (so that they do not depend on remembering to set the kernel
command line option). This is desired to reduce the risk of kernel heap
overflows being able to overwrite objects from merged caches and changes
the requirements for cache layout control, increasing the difficulty of
these attacks. By keeping caches unmerged, these kinds of exploits can
usually only damage objects in the same cache (though the risk to
metadata exploitation is unchanged).
Link: http://lkml.kernel.org/r/20170620230911.GA25238@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: David Windsor <dave@nullcore.net>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: David Windsor <dave@nullcore.net>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Daniel Mack <daniel@zonque.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A group of Linux kernel hackers reported chasing a bug that resulted
from their assumption that SLAB_DESTROY_BY_RCU provided an existence
guarantee, that is, that no block from such a slab would be reallocated
during an RCU read-side critical section. Of course, that is not the
case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
slab of blocks.
However, there is a phrase for this, namely "type safety". This commit
therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
to avoid future instances of this sort of confusion.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
[ paulmck: Add comments mentioning the old name, as requested by Eric
Dumazet, in order to help people familiar with the old name find
the new one. ]
Acked-by: David Rientjes <rientjes@google.com>
Per memcg slab accounting and kasan have a problem with kmem_cache
destruction.
- kmem_cache_create() allocates a kmem_cache, which is used for
allocations from processes running in root (top) memcg.
- Processes running in non root memcg and allocating with either
__GFP_ACCOUNT or from a SLAB_ACCOUNT cache use a per memcg
kmem_cache.
- Kasan catches use-after-free by having kfree() and kmem_cache_free()
defer freeing of objects. Objects are placed in a quarantine.
- kmem_cache_destroy() destroys root and non root kmem_caches. It takes
care to drain the quarantine of objects from the root memcg's
kmem_cache, but ignores objects associated with non root memcg. This
causes leaks because quarantined per memcg objects refer to per memcg
kmem cache being destroyed.
To see the problem:
1) create a slab cache with kmem_cache_create(,,,SLAB_ACCOUNT,)
2) from non root memcg, allocate and free a few objects from cache
3) dispose of the cache with kmem_cache_destroy() kmem_cache_destroy()
will trigger a "Slab cache still has objects" warning indicating
that the per memcg kmem_cache structure was leaked.
Fix the leak by draining kasan quarantined objects allocated from non
root memcg.
Racing memcg deletion is tricky, but handled. kmem_cache_destroy() =>
shutdown_memcg_caches() => __shutdown_memcg_cache() => shutdown_cache()
flushes per memcg quarantined objects, even if that memcg has been
rmdir'd and gone through memcg_deactivate_kmem_caches().
This leak only affects destroyed SLAB_ACCOUNT kmem caches when kasan is
enabled. So I don't think it's worth patching stable kernels.
Link: http://lkml.kernel.org/r/1482257462-36948-1-git-send-email-gthelen@google.com
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If there's contention on slab_mutex, queueing the per-cache destruction
work item on the system_wq can unnecessarily create and tie up a lot of
kworkers.
Rename memcg_kmem_cache_create_wq to memcg_kmem_cache_wq and make it
global and use that workqueue for the destruction work items too. While
at it, convert the workqueue from an unbound workqueue to a per-cpu one
with concurrency limited to 1. It's generally preferable to use per-cpu
workqueues and concurrency limit of 1 is safe enough.
This is suggested by Joonsoo Kim.
Link: http://lkml.kernel.org/r/20170117235411.9408-11-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Acked-by: Vladimir Davydov <vdavydov@tarantool.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure. When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code. This is one of the patches to address the issue.
slub uses synchronize_sched() to deactivate a memcg cache.
synchronize_sched() is an expensive and slow operation and doesn't scale
when a huge number of caches are destroyed back-to-back. While there
used to be a simple batching mechanism, the batching was too restricted
to be helpful.
This patch implements slab_deactivate_memcg_cache_rcu_sched() which slub
can use to schedule sched RCU callback instead of performing
synchronize_sched() synchronously while holding cgroup_mutex. While
this adds online cpus, mems and slab_mutex operations, operating on
these locks back-to-back from the same kworker, which is what's gonna
happen when there are many to deactivate, isn't expensive at all and
this gets rid of the scalability problem completely.
Link: http://lkml.kernel.org/r/20170117235411.9408-9-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__kmem_cache_shrink() is called with %true @deactivate only for memcg
caches. Remove @deactivate from __kmem_cache_shrink() and introduce
__kmemcg_cache_deactivate() instead. Each memcg-supporting allocator
should implement it and it should deactivate and drain the cache.
This is to allow memcg cache deactivation behavior to further deviate
from simple shrinking without messing up __kmem_cache_shrink().
This is pure reorganization and doesn't introduce any observable
behavior changes.
v2: Dropped unnecessary ifdef in mm/slab.h as suggested by Vladimir.
Link: http://lkml.kernel.org/r/20170117235411.9408-8-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure. When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code. This is one of the patches to address the issue.
slab_caches currently lists all caches including root and memcg ones.
This is the only data structure which lists the root caches and
iterating root caches can only be done by walking the list while
skipping over memcg caches. As there can be a huge number of memcg
caches, this can become very expensive.
This also can make /proc/slabinfo behave very badly. seq_file processes
reads in 4k chunks and seeks to the previous Nth position on slab_caches
list to resume after each chunk. With a lot of memcg cache churns on
the list, reading /proc/slabinfo can become very slow and its content
often ends up with duplicate and/or missing entries.
This patch adds a new list slab_root_caches which lists only the root
caches. When memcg is not enabled, it becomes just an alias of
slab_caches. memcg specific list operations are collected into
memcg_[un]link_cache().
Link: http://lkml.kernel.org/r/20170117235411.9408-7-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Acked-by: Vladimir Davydov <vdavydov@tarantool.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure. When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code. This is one of the patches to address the issue.
While a memcg kmem_cache is listed on its root cache's ->children list,
there is no direct way to iterate all kmem_caches which are assocaited
with a memory cgroup. The only way to iterate them is walking all
caches while filtering out caches which don't match, which would be most
of them.
This makes memcg destruction operations O(N^2) where N is the total
number of slab caches which can be huge. This combined with the
synchronous RCU operations can tie up a CPU and affect the whole machine
for many hours when memory reclaim triggers offlining and destruction of
the stale memcgs.
This patch adds mem_cgroup->kmem_caches list which goes through
memcg_cache_params->kmem_caches_node of all kmem_caches which are
associated with the memcg. All memcg specific iterations, including
stat file access, are updated to use the new list instead.
Link: http://lkml.kernel.org/r/20170117235411.9408-6-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We're going to change how memcg caches are iterated. In preparation,
clean up and reorganize memcg_cache_params.
* The shared ->list is replaced by ->children in root and
->children_node in children.
* ->is_root_cache is removed. Instead ->root_cache is moved out of
the child union and now used by both root and children. NULL
indicates root cache. Non-NULL a memcg one.
This patch doesn't cause any observable behavior changes.
Link: http://lkml.kernel.org/r/20170117235411.9408-5-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure. When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code. This is one of the patches to address the issue.
SLAB_DESTORY_BY_RCU caches need to flush all RCU operations before
destruction because slab pages are freed through RCU and they need to be
able to dereference the associated kmem_cache. Currently, it's done
synchronously with rcu_barrier(). As rcu_barrier() is expensive
time-wise, slab implements a batching mechanism so that rcu_barrier()
can be done for multiple caches at the same time.
Unfortunately, the rcu_barrier() is in synchronous path which is called
while holding cgroup_mutex and the batching is too limited to be
actually helpful.
This patch updates the cache release path so that the batching is
asynchronous and global. All SLAB_DESTORY_BY_RCU caches are queued
globally and a work item consumes the list. The work item calls
rcu_barrier() only once for all caches that are currently queued.
* release_caches() is removed and shutdown_cache() now either directly
release the cache or schedules a RCU callback to do that. This
makes the cache inaccessible once shutdown_cache() is called and
makes it impossible for shutdown_memcg_caches() to do memcg-specific
cleanups afterwards. Move memcg-specific part into a helper,
unlink_memcg_cache(), and make shutdown_cache() call it directly.
Link: http://lkml.kernel.org/r/20170117235411.9408-4-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Acked-by: Vladimir Davydov <vdavydov@tarantool.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Separate out slub sysfs removal and release, and call the former earlier
from __kmem_cache_shutdown(). There's no reason to defer sysfs removal
through RCU and this will later allow us to remove sysfs files way
earlier during memory cgroup offline instead of release.
Link: http://lkml.kernel.org/r/20170117235411.9408-3-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "slab: make memcg slab destruction scalable", v3.
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure. When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.
I've seen machines which end up with hundred thousands of caches and
many millions of kernfs_nodes. The current code is O(N^2) on the total
number of caches and has synchronous rcu_barrier() and
synchronize_sched() in cgroup offline / release path which is executed
while holding cgroup_mutex. Combined, this leads to very expensive and
slow cache destruction operations which can easily keep running for half
a day.
This also messes up /proc/slabinfo along with other cache iterating
operations. seq_file operates on 4k chunks and on each 4k boundary
tries to seek to the last position in the list. With a huge number of
caches on the list, this becomes very slow and very prone to the list
content changing underneath it leading to a lot of missing and/or
duplicate entries.
This patchset addresses the scalability problem.
* Add root and per-memcg lists. Update each user to use the
appropriate list.
* Make rcu_barrier() for SLAB_DESTROY_BY_RCU caches globally batched
and asynchronous.
* For dying empty slub caches, remove the sysfs files after
deactivation so that we don't end up with millions of sysfs files
without any useful information on them.
This patchset contains the following nine patches.
0001-Revert-slub-move-synchronize_sched-out-of-slab_mutex.patch
0002-slub-separate-out-sysfs_slab_release-from-sysfs_slab.patch
0003-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch
0004-slab-reorganize-memcg_cache_params.patch
0005-slab-link-memcg-kmem_caches-on-their-associated-memo.patch
0006-slab-implement-slab_root_caches-list.patch
0007-slab-introduce-__kmemcg_cache_deactivate.patch
0008-slab-remove-synchronous-synchronize_sched-from-memcg.patch
0009-slab-remove-slub-sysfs-interface-files-early-for-emp.patch
0010-slab-use-memcg_kmem_cache_wq-for-slab-destruction-op.patch
0001 reverts an existing optimization to prepare for the following
changes. 0002 is a prep patch. 0003 makes rcu_barrier() in release
path batched and asynchronous. 0004-0006 separate out the lists.
0007-0008 replace synchronize_sched() in slub destruction path with
call_rcu_sched(). 0009 removes sysfs files early for empty dying
caches. 0010 makes destruction work items use a workqueue with limited
concurrency.
This patch (of 10):
Revert 89e364db71 ("slub: move synchronize_sched out of slab_mutex on
shrink").
With kmem cgroup support enabled, kmem_caches can be created and destroyed
frequently and a great number of near empty kmem_caches can accumulate if
there are a lot of transient cgroups and the system is not under memory
pressure. When memory reclaim starts under such conditions, it can lead
to consecutive deactivation and destruction of many kmem_caches, easily
hundreds of thousands on moderately large systems, exposing scalability
issues in the current slab management code. This is one of the patches to
address the issue.
Moving synchronize_sched() out of slab_mutex isn't enough as it's still
inside cgroup_mutex. The whole deactivation / release path will be
updated to avoid all synchronous RCU operations. Revert this insufficient
optimization in preparation to ease future changes.
Link: http://lkml.kernel.org/r/20170117235411.9408-2-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>