License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2006-12-13 16:34:23 +08:00
|
|
|
* Written by Mark Hemment, 1996 (markhe@nextd.demon.co.uk).
|
|
|
|
*
|
2008-07-05 00:59:22 +08:00
|
|
|
* (C) SGI 2006, Christoph Lameter
|
2006-12-13 16:34:23 +08:00
|
|
|
* Cleaned up and restructured to ease the addition of alternative
|
|
|
|
* implementations of SLAB allocators.
|
2013-09-05 00:35:34 +08:00
|
|
|
* (C) Linux Foundation 2008-2013
|
|
|
|
* Unified interface for all slab allocators
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef _LINUX_SLAB_H
|
|
|
|
#define _LINUX_SLAB_H
|
|
|
|
|
2006-12-07 12:33:22 +08:00
|
|
|
#include <linux/gfp.h>
|
2018-05-09 03:52:32 +08:00
|
|
|
#include <linux/overflow.h>
|
2006-12-07 12:33:22 +08:00
|
|
|
#include <linux/types.h>
|
2012-12-19 06:22:50 +08:00
|
|
|
#include <linux/workqueue.h>
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-12-13 16:34:23 +08:00
|
|
|
/*
|
|
|
|
* Flags to pass to kmem_cache_create().
|
2015-04-15 06:44:28 +08:00
|
|
|
* The ones marked DEBUG are only valid if CONFIG_DEBUG_SLAB is set.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2017-11-16 09:32:18 +08:00
|
|
|
/* DEBUG: Perform (expensive) checks on alloc/free */
|
2017-11-16 09:32:21 +08:00
|
|
|
#define SLAB_CONSISTENCY_CHECKS ((slab_flags_t __force)0x00000100U)
|
2017-11-16 09:32:18 +08:00
|
|
|
/* DEBUG: Red zone objs in a cache */
|
2017-11-16 09:32:21 +08:00
|
|
|
#define SLAB_RED_ZONE ((slab_flags_t __force)0x00000400U)
|
2017-11-16 09:32:18 +08:00
|
|
|
/* DEBUG: Poison objects */
|
2017-11-16 09:32:21 +08:00
|
|
|
#define SLAB_POISON ((slab_flags_t __force)0x00000800U)
|
2017-11-16 09:32:18 +08:00
|
|
|
/* Align objs on cache lines */
|
2017-11-16 09:32:21 +08:00
|
|
|
#define SLAB_HWCACHE_ALIGN ((slab_flags_t __force)0x00002000U)
|
2017-11-16 09:32:18 +08:00
|
|
|
/* Use GFP_DMA memory */
|
2017-11-16 09:32:21 +08:00
|
|
|
#define SLAB_CACHE_DMA ((slab_flags_t __force)0x00004000U)
|
2017-11-16 09:32:18 +08:00
|
|
|
/* DEBUG: Store the last owner for bug hunting */
|
2017-11-16 09:32:21 +08:00
|
|
|
#define SLAB_STORE_USER ((slab_flags_t __force)0x00010000U)
|
2017-11-16 09:32:18 +08:00
|
|
|
/* Panic if kmem_cache_create() fails */
|
2017-11-16 09:32:21 +08:00
|
|
|
#define SLAB_PANIC ((slab_flags_t __force)0x00040000U)
|
2008-11-14 02:40:12 +08:00
|
|
|
/*
|
2017-01-18 18:53:44 +08:00
|
|
|
* SLAB_TYPESAFE_BY_RCU - **WARNING** READ THIS!
|
2008-11-14 02:40:12 +08:00
|
|
|
*
|
|
|
|
* This delays freeing the SLAB page by a grace period, it does _NOT_
|
|
|
|
* delay object freeing. This means that if you do kmem_cache_free()
|
|
|
|
* that memory location is free to be reused at any time. Thus it may
|
|
|
|
* be possible to see another object there in the same RCU grace period.
|
|
|
|
*
|
|
|
|
* This feature only ensures the memory location backing the object
|
|
|
|
* stays valid, the trick to using this is relying on an independent
|
|
|
|
* object validation pass. Something like:
|
|
|
|
*
|
|
|
|
* rcu_read_lock()
|
|
|
|
* again:
|
|
|
|
* obj = lockless_lookup(key);
|
|
|
|
* if (obj) {
|
|
|
|
* if (!try_get_ref(obj)) // might fail for free objects
|
|
|
|
* goto again;
|
|
|
|
*
|
|
|
|
* if (obj->key != key) { // not the object we expected
|
|
|
|
* put_ref(obj);
|
|
|
|
* goto again;
|
|
|
|
* }
|
|
|
|
* }
|
|
|
|
* rcu_read_unlock();
|
|
|
|
*
|
2013-10-24 09:07:42 +08:00
|
|
|
* This is useful if we need to approach a kernel structure obliquely,
|
|
|
|
* from its address obtained without the usual locking. We can lock
|
|
|
|
* the structure to stabilize it and check it's still at the given address,
|
|
|
|
* only if we can be sure that the memory has not been meanwhile reused
|
|
|
|
* for some other kind of object (which our subsystem's lock might corrupt).
|
|
|
|
*
|
|
|
|
* rcu_read_lock before reading the address, then rcu_read_unlock after
|
|
|
|
* taking the spinlock within the structure expected at that address.
|
2017-01-18 18:53:44 +08:00
|
|
|
*
|
|
|
|
* Note that SLAB_TYPESAFE_BY_RCU was originally named SLAB_DESTROY_BY_RCU.
|
2008-11-14 02:40:12 +08:00
|
|
|
*/
|
2017-11-16 09:32:18 +08:00
|
|
|
/* Defer freeing slabs to RCU */
|
2017-11-16 09:32:21 +08:00
|
|
|
#define SLAB_TYPESAFE_BY_RCU ((slab_flags_t __force)0x00080000U)
|
2017-11-16 09:32:18 +08:00
|
|
|
/* Spread some memory over cpuset */
|
2017-11-16 09:32:21 +08:00
|
|
|
#define SLAB_MEM_SPREAD ((slab_flags_t __force)0x00100000U)
|
2017-11-16 09:32:18 +08:00
|
|
|
/* Trace allocations and frees */
|
2017-11-16 09:32:21 +08:00
|
|
|
#define SLAB_TRACE ((slab_flags_t __force)0x00200000U)
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2008-04-30 15:54:59 +08:00
|
|
|
/* Flag to prevent checks on free */
|
|
|
|
#ifdef CONFIG_DEBUG_OBJECTS
|
2017-11-16 09:32:21 +08:00
|
|
|
# define SLAB_DEBUG_OBJECTS ((slab_flags_t __force)0x00400000U)
|
2008-04-30 15:54:59 +08:00
|
|
|
#else
|
2017-11-16 09:32:21 +08:00
|
|
|
# define SLAB_DEBUG_OBJECTS 0
|
2008-04-30 15:54:59 +08:00
|
|
|
#endif
|
|
|
|
|
2017-11-16 09:32:18 +08:00
|
|
|
/* Avoid kmemleak tracing */
|
2017-11-16 09:32:21 +08:00
|
|
|
#define SLAB_NOLEAKTRACE ((slab_flags_t __force)0x00800000U)
|
2009-06-11 20:22:40 +08:00
|
|
|
|
2017-11-16 09:32:18 +08:00
|
|
|
/* Fault injection mark */
|
2010-02-26 14:36:12 +08:00
|
|
|
#ifdef CONFIG_FAILSLAB
|
2017-11-16 09:32:21 +08:00
|
|
|
# define SLAB_FAILSLAB ((slab_flags_t __force)0x02000000U)
|
2010-02-26 14:36:12 +08:00
|
|
|
#else
|
2017-11-16 09:32:21 +08:00
|
|
|
# define SLAB_FAILSLAB 0
|
2010-02-26 14:36:12 +08:00
|
|
|
#endif
|
2017-11-16 09:32:18 +08:00
|
|
|
/* Account to memcg */
|
2018-08-18 06:47:25 +08:00
|
|
|
#ifdef CONFIG_MEMCG_KMEM
|
2017-11-16 09:32:21 +08:00
|
|
|
# define SLAB_ACCOUNT ((slab_flags_t __force)0x04000000U)
|
2016-01-15 07:18:15 +08:00
|
|
|
#else
|
2017-11-16 09:32:21 +08:00
|
|
|
# define SLAB_ACCOUNT 0
|
2016-01-15 07:18:15 +08:00
|
|
|
#endif
|
kmemcheck: add mm functions
With kmemcheck enabled, the slab allocator needs to do this:
1. Tell kmemcheck to allocate the shadow memory which stores the status of
each byte in the allocation proper, e.g. whether it is initialized or
uninitialized.
2. Tell kmemcheck which parts of memory that should be marked uninitialized.
There are actually a few more states, such as "not yet allocated" and
"recently freed".
If a slab cache is set up using the SLAB_NOTRACK flag, it will never return
memory that can take page faults because of kmemcheck.
If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still
request memory with the __GFP_NOTRACK flag. This does not prevent the page
faults from occuring, however, but marks the object in question as being
initialized so that no warnings will ever be produced for this object.
In addition to (and in contrast to) __GFP_NOTRACK, the
__GFP_NOTRACK_FALSE_POSITIVE flag indicates that the allocation should
not be tracked _because_ it would produce a false positive. Their values
are identical, but need not be so in the future (for example, we could now
enable/disable false positives with a config option).
Parts of this patch were contributed by Pekka Enberg but merged for
atomicity.
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[rebased for mainline inclusion]
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
2008-05-31 21:56:17 +08:00
|
|
|
|
2016-03-26 05:21:59 +08:00
|
|
|
#ifdef CONFIG_KASAN
|
2017-11-16 09:32:21 +08:00
|
|
|
#define SLAB_KASAN ((slab_flags_t __force)0x08000000U)
|
2016-03-26 05:21:59 +08:00
|
|
|
#else
|
2017-11-16 09:32:21 +08:00
|
|
|
#define SLAB_KASAN 0
|
2016-03-26 05:21:59 +08:00
|
|
|
#endif
|
|
|
|
|
2007-10-16 16:25:52 +08:00
|
|
|
/* The following flags affect the page allocator grouping pages by mobility */
|
2017-11-16 09:32:18 +08:00
|
|
|
/* Objects are reclaimable */
|
2017-11-16 09:32:21 +08:00
|
|
|
#define SLAB_RECLAIM_ACCOUNT ((slab_flags_t __force)0x00020000U)
|
2007-10-16 16:25:52 +08:00
|
|
|
#define SLAB_TEMPORARY SLAB_RECLAIM_ACCOUNT /* Objects are short-lived */
|
2007-07-17 19:03:22 +08:00
|
|
|
/*
|
|
|
|
* ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
|
|
|
|
*
|
|
|
|
* Dereferencing ZERO_SIZE_PTR will lead to a distinct access fault.
|
|
|
|
*
|
|
|
|
* ZERO_SIZE_PTR can be passed to kfree though in the same way that NULL can.
|
|
|
|
* Both make kfree a no-op.
|
|
|
|
*/
|
|
|
|
#define ZERO_SIZE_PTR ((void *)16)
|
|
|
|
|
2007-07-21 03:13:20 +08:00
|
|
|
#define ZERO_OR_NULL_PTR(x) ((unsigned long)(x) <= \
|
2007-07-17 19:03:22 +08:00
|
|
|
(unsigned long)ZERO_SIZE_PTR)
|
|
|
|
|
2015-02-14 06:39:42 +08:00
|
|
|
#include <linux/kasan.h>
|
2012-06-13 23:24:57 +08:00
|
|
|
|
2012-12-19 06:22:34 +08:00
|
|
|
struct mem_cgroup;
|
2006-12-13 16:34:23 +08:00
|
|
|
/*
|
|
|
|
* struct kmem_cache related prototypes
|
|
|
|
*/
|
|
|
|
void __init kmem_cache_init(void);
|
2015-11-06 10:44:59 +08:00
|
|
|
bool slab_is_available(void);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-12-01 05:04:32 +08:00
|
|
|
extern bool usercopy_fallback;
|
|
|
|
|
2018-04-06 07:20:37 +08:00
|
|
|
struct kmem_cache *kmem_cache_create(const char *name, unsigned int size,
|
|
|
|
unsigned int align, slab_flags_t flags,
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 10:50:28 +08:00
|
|
|
void (*ctor)(void *));
|
|
|
|
struct kmem_cache *kmem_cache_create_usercopy(const char *name,
|
2018-04-06 07:20:37 +08:00
|
|
|
unsigned int size, unsigned int align,
|
|
|
|
slab_flags_t flags,
|
2018-04-06 07:21:31 +08:00
|
|
|
unsigned int useroffset, unsigned int usersize,
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 10:50:28 +08:00
|
|
|
void (*ctor)(void *));
|
2006-12-13 16:34:23 +08:00
|
|
|
void kmem_cache_destroy(struct kmem_cache *);
|
|
|
|
int kmem_cache_shrink(struct kmem_cache *);
|
2015-02-13 06:59:32 +08:00
|
|
|
|
|
|
|
void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
|
|
|
|
void memcg_deactivate_kmem_caches(struct mem_cgroup *);
|
|
|
|
void memcg_destroy_kmem_caches(struct mem_cgroup *);
|
2006-12-13 16:34:23 +08:00
|
|
|
|
2007-05-07 05:49:57 +08:00
|
|
|
/*
|
|
|
|
* Please use this macro to create slab caches. Simply specify the
|
|
|
|
* name of the structure and maybe some flags that are listed above.
|
|
|
|
*
|
|
|
|
* The alignment of the struct determines object alignment. If you
|
|
|
|
* f.e. add ____cacheline_aligned_in_smp to the struct declaration
|
|
|
|
* then the objects will be properly aligned in SMP configurations.
|
|
|
|
*/
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 10:50:28 +08:00
|
|
|
#define KMEM_CACHE(__struct, __flags) \
|
|
|
|
kmem_cache_create(#__struct, sizeof(struct __struct), \
|
|
|
|
__alignof__(struct __struct), (__flags), NULL)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* To whitelist a single field for copying to/from usercopy, use this
|
|
|
|
* macro instead for KMEM_CACHE() above.
|
|
|
|
*/
|
|
|
|
#define KMEM_CACHE_USERCOPY(__struct, __flags, __field) \
|
|
|
|
kmem_cache_create_usercopy(#__struct, \
|
|
|
|
sizeof(struct __struct), \
|
|
|
|
__alignof__(struct __struct), (__flags), \
|
|
|
|
offsetof(struct __struct, __field), \
|
|
|
|
sizeof_field(struct __struct, __field), NULL)
|
2007-05-07 05:49:57 +08:00
|
|
|
|
2013-01-11 03:00:53 +08:00
|
|
|
/*
|
|
|
|
* Common kmalloc functions provided by all allocators
|
|
|
|
*/
|
|
|
|
void * __must_check __krealloc(const void *, size_t, gfp_t);
|
|
|
|
void * __must_check krealloc(const void *, size_t, gfp_t);
|
|
|
|
void kfree(const void *);
|
|
|
|
void kzfree(const void *);
|
|
|
|
size_t ksize(const void *);
|
|
|
|
|
2016-06-08 02:05:33 +08:00
|
|
|
#ifdef CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR
|
2018-01-11 06:48:22 +08:00
|
|
|
void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
|
|
|
|
bool to_user);
|
2016-06-08 02:05:33 +08:00
|
|
|
#else
|
2018-01-11 06:48:22 +08:00
|
|
|
static inline void __check_heap_object(const void *ptr, unsigned long n,
|
|
|
|
struct page *page, bool to_user) { }
|
2016-06-08 02:05:33 +08:00
|
|
|
#endif
|
|
|
|
|
2013-02-06 00:36:47 +08:00
|
|
|
/*
|
|
|
|
* Some archs want to perform DMA into kmalloc caches and need a guaranteed
|
|
|
|
* alignment larger than the alignment of a 64-bit integer.
|
|
|
|
* Setting ARCH_KMALLOC_MINALIGN in arch headers allows that.
|
|
|
|
*/
|
|
|
|
#if defined(ARCH_DMA_MINALIGN) && ARCH_DMA_MINALIGN > 8
|
|
|
|
#define ARCH_KMALLOC_MINALIGN ARCH_DMA_MINALIGN
|
|
|
|
#define KMALLOC_MIN_SIZE ARCH_DMA_MINALIGN
|
|
|
|
#define KMALLOC_SHIFT_LOW ilog2(ARCH_DMA_MINALIGN)
|
|
|
|
#else
|
|
|
|
#define ARCH_KMALLOC_MINALIGN __alignof__(unsigned long long)
|
|
|
|
#endif
|
|
|
|
|
slab.h: sprinkle __assume_aligned attributes
The various allocators return aligned memory. Telling the compiler that
allows it to generate better code in many cases, for example when the
return value is immediately passed to memset().
Some code does become larger, but at least we win twice as much as we lose:
$ scripts/bloat-o-meter /tmp/vmlinux vmlinux
add/remove: 0/0 grow/shrink: 13/52 up/down: 995/-2140 (-1145)
An example of the different (and smaller) code can be seen in mm_alloc(). Before:
: 48 8d 78 08 lea 0x8(%rax),%rdi
: 48 89 c1 mov %rax,%rcx
: 48 89 c2 mov %rax,%rdx
: 48 c7 00 00 00 00 00 movq $0x0,(%rax)
: 48 c7 80 48 03 00 00 movq $0x0,0x348(%rax)
: 00 00 00 00
: 31 c0 xor %eax,%eax
: 48 83 e7 f8 and $0xfffffffffffffff8,%rdi
: 48 29 f9 sub %rdi,%rcx
: 81 c1 50 03 00 00 add $0x350,%ecx
: c1 e9 03 shr $0x3,%ecx
: f3 48 ab rep stos %rax,%es:(%rdi)
After:
: 48 89 c2 mov %rax,%rdx
: b9 6a 00 00 00 mov $0x6a,%ecx
: 31 c0 xor %eax,%eax
: 48 89 d7 mov %rdx,%rdi
: f3 48 ab rep stos %rax,%es:(%rdi)
So gcc's strategy is to do two possibly (but not really, of course)
unaligned stores to the first and last word, then do an aligned rep stos
covering the middle part with a little overlap. Maybe arches which do not
allow unaligned stores gain even more.
I don't know if gcc can actually make use of alignments greater than 8 for
anything, so one could probably drop the __assume_xyz_alignment macros and
just use __assume_aligned(8).
The increases in code size are mostly caused by gcc deciding to
opencode strlen() using the check-four-bytes-at-a-time trick when it
knows the buffer is sufficiently aligned (one function grew by 200
bytes). Now it turns out that many of these strlen() calls showing up
were in fact redundant, and they're gone from -next. Applying the two
patches to next-20151001 bloat-o-meter instead says
add/remove: 0/0 grow/shrink: 6/52 up/down: 244/-2140 (-1896)
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 07:56:48 +08:00
|
|
|
/*
|
|
|
|
* Setting ARCH_SLAB_MINALIGN in arch headers allows a different alignment.
|
|
|
|
* Intended for arches that get misalignment faults even for 64 bit integer
|
|
|
|
* aligned buffers.
|
|
|
|
*/
|
|
|
|
#ifndef ARCH_SLAB_MINALIGN
|
|
|
|
#define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* kmalloc and friends return ARCH_KMALLOC_MINALIGN aligned
|
|
|
|
* pointers. kmem_cache_alloc and friends return ARCH_SLAB_MINALIGN
|
|
|
|
* aligned pointers.
|
|
|
|
*/
|
|
|
|
#define __assume_kmalloc_alignment __assume_aligned(ARCH_KMALLOC_MINALIGN)
|
|
|
|
#define __assume_slab_alignment __assume_aligned(ARCH_SLAB_MINALIGN)
|
|
|
|
#define __assume_page_alignment __assume_aligned(PAGE_SIZE)
|
|
|
|
|
2007-05-17 13:11:01 +08:00
|
|
|
/*
|
2013-01-11 03:14:19 +08:00
|
|
|
* Kmalloc array related definitions
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifdef CONFIG_SLAB
|
|
|
|
/*
|
|
|
|
* The largest kmalloc size supported by the SLAB allocators is
|
2007-05-17 13:11:01 +08:00
|
|
|
* 32 megabyte (2^25) or the maximum allocatable page order if that is
|
|
|
|
* less than 32 MB.
|
|
|
|
*
|
|
|
|
* WARNING: Its not easy to increase this value since the allocators have
|
|
|
|
* to do various tricks to work around compiler limitations in order to
|
|
|
|
* ensure proper constant folding.
|
|
|
|
*/
|
2007-06-24 08:16:43 +08:00
|
|
|
#define KMALLOC_SHIFT_HIGH ((MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \
|
|
|
|
(MAX_ORDER + PAGE_SHIFT - 1) : 25)
|
2013-01-11 03:14:19 +08:00
|
|
|
#define KMALLOC_SHIFT_MAX KMALLOC_SHIFT_HIGH
|
2013-02-06 00:36:47 +08:00
|
|
|
#ifndef KMALLOC_SHIFT_LOW
|
2013-01-11 03:14:19 +08:00
|
|
|
#define KMALLOC_SHIFT_LOW 5
|
2013-02-06 00:36:47 +08:00
|
|
|
#endif
|
2013-06-15 03:55:13 +08:00
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifdef CONFIG_SLUB
|
2013-01-11 03:14:19 +08:00
|
|
|
/*
|
2014-01-29 06:24:50 +08:00
|
|
|
* SLUB directly allocates requests fitting in to an order-1 page
|
|
|
|
* (PAGE_SIZE*2). Larger requests are passed to the page allocator.
|
2013-01-11 03:14:19 +08:00
|
|
|
*/
|
|
|
|
#define KMALLOC_SHIFT_HIGH (PAGE_SHIFT + 1)
|
2017-01-11 08:57:27 +08:00
|
|
|
#define KMALLOC_SHIFT_MAX (MAX_ORDER + PAGE_SHIFT - 1)
|
2013-02-06 00:36:47 +08:00
|
|
|
#ifndef KMALLOC_SHIFT_LOW
|
2013-01-11 03:14:19 +08:00
|
|
|
#define KMALLOC_SHIFT_LOW 3
|
|
|
|
#endif
|
2013-02-06 00:36:47 +08:00
|
|
|
#endif
|
2007-05-17 13:11:01 +08:00
|
|
|
|
2013-06-15 03:55:13 +08:00
|
|
|
#ifdef CONFIG_SLOB
|
|
|
|
/*
|
2014-01-29 06:24:50 +08:00
|
|
|
* SLOB passes all requests larger than one page to the page allocator.
|
2013-06-15 03:55:13 +08:00
|
|
|
* No kmalloc array is necessary since objects of different sizes can
|
|
|
|
* be allocated from the same page.
|
|
|
|
*/
|
|
|
|
#define KMALLOC_SHIFT_HIGH PAGE_SHIFT
|
2017-01-11 08:57:27 +08:00
|
|
|
#define KMALLOC_SHIFT_MAX (MAX_ORDER + PAGE_SHIFT - 1)
|
2013-06-15 03:55:13 +08:00
|
|
|
#ifndef KMALLOC_SHIFT_LOW
|
|
|
|
#define KMALLOC_SHIFT_LOW 3
|
|
|
|
#endif
|
|
|
|
#endif
|
|
|
|
|
2013-01-11 03:14:19 +08:00
|
|
|
/* Maximum allocatable size */
|
|
|
|
#define KMALLOC_MAX_SIZE (1UL << KMALLOC_SHIFT_MAX)
|
|
|
|
/* Maximum size for which we actually use a slab cache */
|
|
|
|
#define KMALLOC_MAX_CACHE_SIZE (1UL << KMALLOC_SHIFT_HIGH)
|
|
|
|
/* Maximum order allocatable via the slab allocagtor */
|
|
|
|
#define KMALLOC_MAX_ORDER (KMALLOC_SHIFT_MAX - PAGE_SHIFT)
|
2007-05-17 13:11:01 +08:00
|
|
|
|
2013-01-11 03:14:19 +08:00
|
|
|
/*
|
|
|
|
* Kmalloc subsystem.
|
|
|
|
*/
|
2013-02-06 00:36:47 +08:00
|
|
|
#ifndef KMALLOC_MIN_SIZE
|
2013-01-11 03:14:19 +08:00
|
|
|
#define KMALLOC_MIN_SIZE (1 << KMALLOC_SHIFT_LOW)
|
2013-01-11 03:14:19 +08:00
|
|
|
#endif
|
|
|
|
|
2014-03-12 16:06:19 +08:00
|
|
|
/*
|
|
|
|
* This restriction comes from byte sized index implementation.
|
|
|
|
* Page size is normally 2^12 bytes and, in this case, if we want to use
|
|
|
|
* byte sized index which can represent 2^8 entries, the size of the object
|
|
|
|
* should be equal or greater to 2^12 / 2^8 = 2^4 = 16.
|
|
|
|
* If minimum size of kmalloc is less than 16, we use it as minimum object
|
|
|
|
* size and give up to use byte sized index.
|
|
|
|
*/
|
|
|
|
#define SLAB_OBJ_MIN_SIZE (KMALLOC_MIN_SIZE < 16 ? \
|
|
|
|
(KMALLOC_MIN_SIZE) : 16)
|
|
|
|
|
mm, slab/slub: introduce kmalloc-reclaimable caches
Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
indicates they contain objects which can be reclaimed under memory
pressure (typically through a shrinker). This makes the slab pages
accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
MemAvailable meminfo counter and in overcommit decisions. The slab pages
are also allocated with __GFP_RECLAIMABLE, which is good for
anti-fragmentation through grouping pages by mobility.
The generic kmalloc-X caches are created without this flag, but sometimes
are used also for objects that can be reclaimed, which due to varying size
cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A
prominent example are dcache external names, which prompted the creation
of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
in commit f1782c9bc547 ("dcache: account external names as indirectly
reclaimable memory").
To better handle this and any other similar cases, this patch introduces
SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X.
They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among
gfp flags. They are added to the kmalloc_caches array as a new type.
Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type
cache.
This change only applies to SLAB and SLUB, not SLOB. This is fine, since
SLOB's target are tiny system and this patch does add some overhead of
kmem management objects.
Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:38 +08:00
|
|
|
/*
|
|
|
|
* Whenever changing this, take care of that kmalloc_type() and
|
|
|
|
* create_kmalloc_caches() still work as intended.
|
|
|
|
*/
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:34 +08:00
|
|
|
enum kmalloc_cache_type {
|
|
|
|
KMALLOC_NORMAL = 0,
|
mm, slab/slub: introduce kmalloc-reclaimable caches
Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
indicates they contain objects which can be reclaimed under memory
pressure (typically through a shrinker). This makes the slab pages
accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
MemAvailable meminfo counter and in overcommit decisions. The slab pages
are also allocated with __GFP_RECLAIMABLE, which is good for
anti-fragmentation through grouping pages by mobility.
The generic kmalloc-X caches are created without this flag, but sometimes
are used also for objects that can be reclaimed, which due to varying size
cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A
prominent example are dcache external names, which prompted the creation
of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
in commit f1782c9bc547 ("dcache: account external names as indirectly
reclaimable memory").
To better handle this and any other similar cases, this patch introduces
SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X.
They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among
gfp flags. They are added to the kmalloc_caches array as a new type.
Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type
cache.
This change only applies to SLAB and SLUB, not SLOB. This is fine, since
SLOB's target are tiny system and this patch does add some overhead of
kmem management objects.
Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:38 +08:00
|
|
|
KMALLOC_RECLAIM,
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:34 +08:00
|
|
|
#ifdef CONFIG_ZONE_DMA
|
|
|
|
KMALLOC_DMA,
|
|
|
|
#endif
|
|
|
|
NR_KMALLOC_TYPES
|
|
|
|
};
|
|
|
|
|
2013-06-15 03:55:13 +08:00
|
|
|
#ifndef CONFIG_SLOB
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:34 +08:00
|
|
|
extern struct kmem_cache *
|
|
|
|
kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1];
|
|
|
|
|
|
|
|
static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags)
|
|
|
|
{
|
|
|
|
int is_dma = 0;
|
mm, slab/slub: introduce kmalloc-reclaimable caches
Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
indicates they contain objects which can be reclaimed under memory
pressure (typically through a shrinker). This makes the slab pages
accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
MemAvailable meminfo counter and in overcommit decisions. The slab pages
are also allocated with __GFP_RECLAIMABLE, which is good for
anti-fragmentation through grouping pages by mobility.
The generic kmalloc-X caches are created without this flag, but sometimes
are used also for objects that can be reclaimed, which due to varying size
cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A
prominent example are dcache external names, which prompted the creation
of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
in commit f1782c9bc547 ("dcache: account external names as indirectly
reclaimable memory").
To better handle this and any other similar cases, this patch introduces
SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X.
They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among
gfp flags. They are added to the kmalloc_caches array as a new type.
Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type
cache.
This change only applies to SLAB and SLUB, not SLOB. This is fine, since
SLOB's target are tiny system and this patch does add some overhead of
kmem management objects.
Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:38 +08:00
|
|
|
int type_dma = 0;
|
|
|
|
int is_reclaimable;
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:34 +08:00
|
|
|
|
2013-01-11 03:12:17 +08:00
|
|
|
#ifdef CONFIG_ZONE_DMA
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:34 +08:00
|
|
|
is_dma = !!(flags & __GFP_DMA);
|
mm, slab/slub: introduce kmalloc-reclaimable caches
Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
indicates they contain objects which can be reclaimed under memory
pressure (typically through a shrinker). This makes the slab pages
accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
MemAvailable meminfo counter and in overcommit decisions. The slab pages
are also allocated with __GFP_RECLAIMABLE, which is good for
anti-fragmentation through grouping pages by mobility.
The generic kmalloc-X caches are created without this flag, but sometimes
are used also for objects that can be reclaimed, which due to varying size
cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A
prominent example are dcache external names, which prompted the creation
of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
in commit f1782c9bc547 ("dcache: account external names as indirectly
reclaimable memory").
To better handle this and any other similar cases, this patch introduces
SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X.
They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among
gfp flags. They are added to the kmalloc_caches array as a new type.
Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type
cache.
This change only applies to SLAB and SLUB, not SLOB. This is fine, since
SLOB's target are tiny system and this patch does add some overhead of
kmem management objects.
Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:38 +08:00
|
|
|
type_dma = is_dma * KMALLOC_DMA;
|
2013-01-11 03:12:17 +08:00
|
|
|
#endif
|
|
|
|
|
mm, slab/slub: introduce kmalloc-reclaimable caches
Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
indicates they contain objects which can be reclaimed under memory
pressure (typically through a shrinker). This makes the slab pages
accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
MemAvailable meminfo counter and in overcommit decisions. The slab pages
are also allocated with __GFP_RECLAIMABLE, which is good for
anti-fragmentation through grouping pages by mobility.
The generic kmalloc-X caches are created without this flag, but sometimes
are used also for objects that can be reclaimed, which due to varying size
cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A
prominent example are dcache external names, which prompted the creation
of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
in commit f1782c9bc547 ("dcache: account external names as indirectly
reclaimable memory").
To better handle this and any other similar cases, this patch introduces
SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X.
They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among
gfp flags. They are added to the kmalloc_caches array as a new type.
Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type
cache.
This change only applies to SLAB and SLUB, not SLOB. This is fine, since
SLOB's target are tiny system and this patch does add some overhead of
kmem management objects.
Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:38 +08:00
|
|
|
is_reclaimable = !!(flags & __GFP_RECLAIMABLE);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If an allocation is both __GFP_DMA and __GFP_RECLAIMABLE, return
|
|
|
|
* KMALLOC_DMA and effectively ignore __GFP_RECLAIMABLE
|
|
|
|
*/
|
|
|
|
return type_dma + (is_reclaimable & !is_dma) * KMALLOC_RECLAIM;
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:34 +08:00
|
|
|
}
|
|
|
|
|
2013-01-11 03:14:19 +08:00
|
|
|
/*
|
|
|
|
* Figure out which kmalloc slab an allocation of a certain size
|
|
|
|
* belongs to.
|
|
|
|
* 0 = zero alloc
|
|
|
|
* 1 = 65 .. 96 bytes
|
2015-06-25 07:55:59 +08:00
|
|
|
* 2 = 129 .. 192 bytes
|
|
|
|
* n = 2^(n-1)+1 .. 2^n
|
2013-01-11 03:14:19 +08:00
|
|
|
*/
|
slab: make kmalloc_index() return "unsigned int"
kmalloc_index() return index into an array of kmalloc kmem caches,
therefore should be unsigned.
Space savings with SLUB on trimmed down .config:
add/remove: 0/1 grow/shrink: 6/56 up/down: 85/-557 (-472)
Function old new delta
calculate_sizes 924 983 +59
on_freelist 589 604 +15
init_cache_random_seq 122 127 +5
ext4_mb_init 1206 1210 +4
slab_pad_check.part 270 271 +1
cpu_partial_store 112 113 +1
usersize_show 28 27 -1
...
new_slab 1871 1837 -34
slab_order 204 - -204
This patch start a series of converting SLUB (mostly) to "unsigned int".
1) Most integers in the code are in fact unsigned entities: array
indexes, lengths, buffer sizes, allocation orders. It is therefore
better to use unsigned variables
2) Some integers in the code are either "size_t" or "unsigned long" for
no reason.
size_t usually comes from people trying to maintain type correctness
and figuring out that "sizeof" operator returns size_t or
memset/memcpy takes size_t so should everything passed to it.
However the number of 4GB+ objects in the kernel is very small. Most,
if not all, dynamically allocated objects with kmalloc() or
kmem_cache_create() aren't actually big. Maintaining wide types
doesn't do anything.
64-bit ops are bigger than 32-bit on our beloved x86_64,
so try to not use 64-bit where it isn't necessary
(read: everywhere where integers are integers not pointers)
3) in case of SLAB allocators, there are additional limitations
*) page->inuse, page->objects are only 16-/15-bit,
*) cache size was always 32-bit
*) slab orders are small, order 20 is needed to go 64-bit on x86_64
(PAGE_SIZE << order)
Basically everything is 32-bit except kmalloc(1ULL<<32) which gets
shortcut through page allocator.
Christoph said:
:
: That changes with large base page size on power and ARM64 f.e. but then
: we do not want to encourage larger allocations through slab anyways.
Link: http://lkml.kernel.org/r/20180305200730.15812-2-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-06 07:20:22 +08:00
|
|
|
static __always_inline unsigned int kmalloc_index(size_t size)
|
2013-01-11 03:14:19 +08:00
|
|
|
{
|
|
|
|
if (!size)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (size <= KMALLOC_MIN_SIZE)
|
|
|
|
return KMALLOC_SHIFT_LOW;
|
|
|
|
|
|
|
|
if (KMALLOC_MIN_SIZE <= 32 && size > 64 && size <= 96)
|
|
|
|
return 1;
|
|
|
|
if (KMALLOC_MIN_SIZE <= 64 && size > 128 && size <= 192)
|
|
|
|
return 2;
|
|
|
|
if (size <= 8) return 3;
|
|
|
|
if (size <= 16) return 4;
|
|
|
|
if (size <= 32) return 5;
|
|
|
|
if (size <= 64) return 6;
|
|
|
|
if (size <= 128) return 7;
|
|
|
|
if (size <= 256) return 8;
|
|
|
|
if (size <= 512) return 9;
|
|
|
|
if (size <= 1024) return 10;
|
|
|
|
if (size <= 2 * 1024) return 11;
|
|
|
|
if (size <= 4 * 1024) return 12;
|
|
|
|
if (size <= 8 * 1024) return 13;
|
|
|
|
if (size <= 16 * 1024) return 14;
|
|
|
|
if (size <= 32 * 1024) return 15;
|
|
|
|
if (size <= 64 * 1024) return 16;
|
|
|
|
if (size <= 128 * 1024) return 17;
|
|
|
|
if (size <= 256 * 1024) return 18;
|
|
|
|
if (size <= 512 * 1024) return 19;
|
|
|
|
if (size <= 1024 * 1024) return 20;
|
|
|
|
if (size <= 2 * 1024 * 1024) return 21;
|
|
|
|
if (size <= 4 * 1024 * 1024) return 22;
|
|
|
|
if (size <= 8 * 1024 * 1024) return 23;
|
|
|
|
if (size <= 16 * 1024 * 1024) return 24;
|
|
|
|
if (size <= 32 * 1024 * 1024) return 25;
|
|
|
|
if (size <= 64 * 1024 * 1024) return 26;
|
|
|
|
BUG();
|
|
|
|
|
|
|
|
/* Will never be reached. Needed because the compiler may complain */
|
|
|
|
return -1;
|
|
|
|
}
|
2013-06-15 03:55:13 +08:00
|
|
|
#endif /* !CONFIG_SLOB */
|
2013-01-11 03:14:19 +08:00
|
|
|
|
2016-05-20 08:10:55 +08:00
|
|
|
void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment __malloc;
|
|
|
|
void *kmem_cache_alloc(struct kmem_cache *, gfp_t flags) __assume_slab_alignment __malloc;
|
2015-02-13 06:59:32 +08:00
|
|
|
void kmem_cache_free(struct kmem_cache *, void *);
|
2013-09-05 00:35:34 +08:00
|
|
|
|
2015-09-05 06:45:34 +08:00
|
|
|
/*
|
2016-03-16 05:54:03 +08:00
|
|
|
* Bulk allocation and freeing operations. These are accelerated in an
|
2015-09-05 06:45:34 +08:00
|
|
|
* allocator specific way to avoid taking locks repeatedly or building
|
|
|
|
* metadata structures unnecessarily.
|
|
|
|
*
|
|
|
|
* Note that interrupts must be enabled when calling these functions.
|
|
|
|
*/
|
|
|
|
void kmem_cache_free_bulk(struct kmem_cache *, size_t, void **);
|
2015-11-21 07:57:58 +08:00
|
|
|
int kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **);
|
2015-09-05 06:45:34 +08:00
|
|
|
|
2016-03-16 05:54:00 +08:00
|
|
|
/*
|
|
|
|
* Caller must not use kfree_bulk() on memory not originally allocated
|
|
|
|
* by kmalloc(), because the SLOB allocator cannot handle this.
|
|
|
|
*/
|
|
|
|
static __always_inline void kfree_bulk(size_t size, void **p)
|
|
|
|
{
|
|
|
|
kmem_cache_free_bulk(NULL, size, p);
|
|
|
|
}
|
|
|
|
|
2013-09-05 00:35:34 +08:00
|
|
|
#ifdef CONFIG_NUMA
|
2016-05-20 08:10:55 +08:00
|
|
|
void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_kmalloc_alignment __malloc;
|
|
|
|
void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node) __assume_slab_alignment __malloc;
|
2013-09-05 00:35:34 +08:00
|
|
|
#else
|
|
|
|
static __always_inline void *__kmalloc_node(size_t size, gfp_t flags, int node)
|
|
|
|
{
|
|
|
|
return __kmalloc(size, flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static __always_inline void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node)
|
|
|
|
{
|
|
|
|
return kmem_cache_alloc(s, flags);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifdef CONFIG_TRACING
|
2016-05-20 08:10:55 +08:00
|
|
|
extern void *kmem_cache_alloc_trace(struct kmem_cache *, gfp_t, size_t) __assume_slab_alignment __malloc;
|
2013-09-05 00:35:34 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
extern void *kmem_cache_alloc_node_trace(struct kmem_cache *s,
|
|
|
|
gfp_t gfpflags,
|
2016-05-20 08:10:55 +08:00
|
|
|
int node, size_t size) __assume_slab_alignment __malloc;
|
2013-09-05 00:35:34 +08:00
|
|
|
#else
|
|
|
|
static __always_inline void *
|
|
|
|
kmem_cache_alloc_node_trace(struct kmem_cache *s,
|
|
|
|
gfp_t gfpflags,
|
|
|
|
int node, size_t size)
|
|
|
|
{
|
|
|
|
return kmem_cache_alloc_trace(s, gfpflags, size);
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_NUMA */
|
|
|
|
|
|
|
|
#else /* CONFIG_TRACING */
|
|
|
|
static __always_inline void *kmem_cache_alloc_trace(struct kmem_cache *s,
|
|
|
|
gfp_t flags, size_t size)
|
|
|
|
{
|
2015-02-14 06:39:42 +08:00
|
|
|
void *ret = kmem_cache_alloc(s, flags);
|
|
|
|
|
2016-03-26 05:22:02 +08:00
|
|
|
kasan_kmalloc(s, ret, size, flags);
|
2015-02-14 06:39:42 +08:00
|
|
|
return ret;
|
2013-09-05 00:35:34 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static __always_inline void *
|
|
|
|
kmem_cache_alloc_node_trace(struct kmem_cache *s,
|
|
|
|
gfp_t gfpflags,
|
|
|
|
int node, size_t size)
|
|
|
|
{
|
2015-02-14 06:39:42 +08:00
|
|
|
void *ret = kmem_cache_alloc_node(s, gfpflags, node);
|
|
|
|
|
2016-03-26 05:22:02 +08:00
|
|
|
kasan_kmalloc(s, ret, size, gfpflags);
|
2015-02-14 06:39:42 +08:00
|
|
|
return ret;
|
2013-09-05 00:35:34 +08:00
|
|
|
}
|
|
|
|
#endif /* CONFIG_TRACING */
|
|
|
|
|
2016-05-20 08:10:55 +08:00
|
|
|
extern void *kmalloc_order(size_t size, gfp_t flags, unsigned int order) __assume_page_alignment __malloc;
|
2013-09-05 00:35:34 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_TRACING
|
2016-05-20 08:10:55 +08:00
|
|
|
extern void *kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order) __assume_page_alignment __malloc;
|
2013-09-05 00:35:34 +08:00
|
|
|
#else
|
|
|
|
static __always_inline void *
|
|
|
|
kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order)
|
|
|
|
{
|
|
|
|
return kmalloc_order(size, flags, order);
|
|
|
|
}
|
2013-01-11 03:14:19 +08:00
|
|
|
#endif
|
|
|
|
|
2013-09-05 00:35:34 +08:00
|
|
|
static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
|
|
|
|
{
|
|
|
|
unsigned int order = get_order(size);
|
|
|
|
return kmalloc_order_trace(size, flags, order);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* kmalloc - allocate memory
|
|
|
|
* @size: how many bytes of memory are required.
|
2013-11-23 10:14:38 +08:00
|
|
|
* @flags: the type of memory to allocate.
|
2013-09-05 00:35:34 +08:00
|
|
|
*
|
|
|
|
* kmalloc is the normal method of allocating memory
|
|
|
|
* for objects smaller than page size in the kernel.
|
2013-11-23 10:14:38 +08:00
|
|
|
*
|
|
|
|
* The @flags argument may be one of:
|
|
|
|
*
|
|
|
|
* %GFP_USER - Allocate memory on behalf of user. May sleep.
|
|
|
|
*
|
|
|
|
* %GFP_KERNEL - Allocate normal kernel ram. May sleep.
|
|
|
|
*
|
|
|
|
* %GFP_ATOMIC - Allocation will not sleep. May use emergency pools.
|
|
|
|
* For example, use this inside interrupt handlers.
|
|
|
|
*
|
|
|
|
* %GFP_HIGHUSER - Allocate pages from high memory.
|
|
|
|
*
|
|
|
|
* %GFP_NOIO - Do not do any I/O at all while trying to get memory.
|
|
|
|
*
|
|
|
|
* %GFP_NOFS - Do not make any fs calls while trying to get memory.
|
|
|
|
*
|
|
|
|
* %GFP_NOWAIT - Allocation will not sleep.
|
|
|
|
*
|
2014-03-11 06:49:43 +08:00
|
|
|
* %__GFP_THISNODE - Allocate node-local memory only.
|
2013-11-23 10:14:38 +08:00
|
|
|
*
|
|
|
|
* %GFP_DMA - Allocation suitable for DMA.
|
|
|
|
* Should only be used for kmalloc() caches. Otherwise, use a
|
|
|
|
* slab created with SLAB_DMA.
|
|
|
|
*
|
|
|
|
* Also it is possible to set different flags by OR'ing
|
|
|
|
* in one or more of the following additional @flags:
|
|
|
|
*
|
|
|
|
* %__GFP_HIGH - This allocation has high priority and may use emergency pools.
|
|
|
|
*
|
|
|
|
* %__GFP_NOFAIL - Indicate that this allocation is in no way allowed to fail
|
|
|
|
* (think twice before using).
|
|
|
|
*
|
|
|
|
* %__GFP_NORETRY - If memory is not immediately available,
|
|
|
|
* then give up at once.
|
|
|
|
*
|
|
|
|
* %__GFP_NOWARN - If allocation fails, don't issue any warnings.
|
|
|
|
*
|
2017-07-13 05:36:45 +08:00
|
|
|
* %__GFP_RETRY_MAYFAIL - Try really hard to succeed the allocation but fail
|
|
|
|
* eventually.
|
2013-11-23 10:14:38 +08:00
|
|
|
*
|
|
|
|
* There are other flags available as well, but these are not intended
|
|
|
|
* for general use, and so are not documented here. For a full list of
|
|
|
|
* potential flags, always refer to linux/gfp.h.
|
2013-09-05 00:35:34 +08:00
|
|
|
*/
|
|
|
|
static __always_inline void *kmalloc(size_t size, gfp_t flags)
|
|
|
|
{
|
|
|
|
if (__builtin_constant_p(size)) {
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:34 +08:00
|
|
|
#ifndef CONFIG_SLOB
|
|
|
|
unsigned int index;
|
|
|
|
#endif
|
2013-09-05 00:35:34 +08:00
|
|
|
if (size > KMALLOC_MAX_CACHE_SIZE)
|
|
|
|
return kmalloc_large(size, flags);
|
|
|
|
#ifndef CONFIG_SLOB
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:34 +08:00
|
|
|
index = kmalloc_index(size);
|
2013-09-05 00:35:34 +08:00
|
|
|
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:34 +08:00
|
|
|
if (!index)
|
|
|
|
return ZERO_SIZE_PTR;
|
2013-09-05 00:35:34 +08:00
|
|
|
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:34 +08:00
|
|
|
return kmem_cache_alloc_trace(
|
|
|
|
kmalloc_caches[kmalloc_type(flags)][index],
|
|
|
|
flags, size);
|
2013-09-05 00:35:34 +08:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
return __kmalloc(size, flags);
|
|
|
|
}
|
|
|
|
|
2013-01-11 03:14:19 +08:00
|
|
|
/*
|
|
|
|
* Determine size used for the nth kmalloc cache.
|
|
|
|
* return size or 0 if a kmalloc cache for that
|
|
|
|
* size does not exist
|
|
|
|
*/
|
2018-04-06 07:20:26 +08:00
|
|
|
static __always_inline unsigned int kmalloc_size(unsigned int n)
|
2013-01-11 03:14:19 +08:00
|
|
|
{
|
2013-06-15 03:55:13 +08:00
|
|
|
#ifndef CONFIG_SLOB
|
2013-01-11 03:14:19 +08:00
|
|
|
if (n > 2)
|
2018-04-06 07:20:26 +08:00
|
|
|
return 1U << n;
|
2013-01-11 03:14:19 +08:00
|
|
|
|
|
|
|
if (n == 1 && KMALLOC_MIN_SIZE <= 32)
|
|
|
|
return 96;
|
|
|
|
|
|
|
|
if (n == 2 && KMALLOC_MIN_SIZE <= 64)
|
|
|
|
return 192;
|
2013-06-15 03:55:13 +08:00
|
|
|
#endif
|
2013-01-11 03:14:19 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-09-05 00:35:34 +08:00
|
|
|
static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
|
|
|
|
{
|
|
|
|
#ifndef CONFIG_SLOB
|
|
|
|
if (__builtin_constant_p(size) &&
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:34 +08:00
|
|
|
size <= KMALLOC_MAX_CACHE_SIZE) {
|
slab: make kmalloc_index() return "unsigned int"
kmalloc_index() return index into an array of kmalloc kmem caches,
therefore should be unsigned.
Space savings with SLUB on trimmed down .config:
add/remove: 0/1 grow/shrink: 6/56 up/down: 85/-557 (-472)
Function old new delta
calculate_sizes 924 983 +59
on_freelist 589 604 +15
init_cache_random_seq 122 127 +5
ext4_mb_init 1206 1210 +4
slab_pad_check.part 270 271 +1
cpu_partial_store 112 113 +1
usersize_show 28 27 -1
...
new_slab 1871 1837 -34
slab_order 204 - -204
This patch start a series of converting SLUB (mostly) to "unsigned int".
1) Most integers in the code are in fact unsigned entities: array
indexes, lengths, buffer sizes, allocation orders. It is therefore
better to use unsigned variables
2) Some integers in the code are either "size_t" or "unsigned long" for
no reason.
size_t usually comes from people trying to maintain type correctness
and figuring out that "sizeof" operator returns size_t or
memset/memcpy takes size_t so should everything passed to it.
However the number of 4GB+ objects in the kernel is very small. Most,
if not all, dynamically allocated objects with kmalloc() or
kmem_cache_create() aren't actually big. Maintaining wide types
doesn't do anything.
64-bit ops are bigger than 32-bit on our beloved x86_64,
so try to not use 64-bit where it isn't necessary
(read: everywhere where integers are integers not pointers)
3) in case of SLAB allocators, there are additional limitations
*) page->inuse, page->objects are only 16-/15-bit,
*) cache size was always 32-bit
*) slab orders are small, order 20 is needed to go 64-bit on x86_64
(PAGE_SIZE << order)
Basically everything is 32-bit except kmalloc(1ULL<<32) which gets
shortcut through page allocator.
Christoph said:
:
: That changes with large base page size on power and ARM64 f.e. but then
: we do not want to encourage larger allocations through slab anyways.
Link: http://lkml.kernel.org/r/20180305200730.15812-2-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-06 07:20:22 +08:00
|
|
|
unsigned int i = kmalloc_index(size);
|
2013-09-05 00:35:34 +08:00
|
|
|
|
|
|
|
if (!i)
|
|
|
|
return ZERO_SIZE_PTR;
|
|
|
|
|
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 06:05:34 +08:00
|
|
|
return kmem_cache_alloc_node_trace(
|
|
|
|
kmalloc_caches[kmalloc_type(flags)][i],
|
2013-09-05 00:35:34 +08:00
|
|
|
flags, node, size);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
return __kmalloc_node(size, flags, node);
|
|
|
|
}
|
|
|
|
|
2015-02-13 06:59:20 +08:00
|
|
|
struct memcg_cache_array {
|
|
|
|
struct rcu_head rcu;
|
|
|
|
struct kmem_cache *entries[0];
|
|
|
|
};
|
|
|
|
|
2012-12-19 06:22:27 +08:00
|
|
|
/*
|
|
|
|
* This is the main placeholder for memcg-related information in kmem caches.
|
|
|
|
* Both the root cache and the child caches will have it. For the root cache,
|
|
|
|
* this will hold a dynamically allocated array large enough to hold
|
2014-01-24 07:53:06 +08:00
|
|
|
* information about the currently limited memcgs in the system. To allow the
|
|
|
|
* array to be accessed without taking any locks, on relocation we free the old
|
|
|
|
* version only after a grace period.
|
2012-12-19 06:22:27 +08:00
|
|
|
*
|
2017-02-23 07:41:17 +08:00
|
|
|
* Root and child caches hold different metadata.
|
2012-12-19 06:22:27 +08:00
|
|
|
*
|
2017-02-23 07:41:17 +08:00
|
|
|
* @root_cache: Common to root and child caches. NULL for root, pointer to
|
|
|
|
* the root cache for children.
|
2015-02-13 06:59:23 +08:00
|
|
|
*
|
2017-02-23 07:41:17 +08:00
|
|
|
* The following fields are specific to root caches.
|
|
|
|
*
|
|
|
|
* @memcg_caches: kmemcg ID indexed table of child caches. This table is
|
|
|
|
* used to index child cachces during allocation and cleared
|
|
|
|
* early during shutdown.
|
|
|
|
*
|
2017-02-23 07:41:24 +08:00
|
|
|
* @root_caches_node: List node for slab_root_caches list.
|
|
|
|
*
|
2017-02-23 07:41:17 +08:00
|
|
|
* @children: List of all child caches. While the child caches are also
|
|
|
|
* reachable through @memcg_caches, a child cache remains on
|
|
|
|
* this list until it is actually destroyed.
|
|
|
|
*
|
|
|
|
* The following fields are specific to child caches.
|
|
|
|
*
|
|
|
|
* @memcg: Pointer to the memcg this cache belongs to.
|
|
|
|
*
|
|
|
|
* @children_node: List node for @root_cache->children list.
|
2017-02-23 07:41:21 +08:00
|
|
|
*
|
|
|
|
* @kmem_caches_node: List node for @memcg->kmem_caches list.
|
2012-12-19 06:22:27 +08:00
|
|
|
*/
|
|
|
|
struct memcg_cache_params {
|
2017-02-23 07:41:17 +08:00
|
|
|
struct kmem_cache *root_cache;
|
2012-12-19 06:22:27 +08:00
|
|
|
union {
|
2017-02-23 07:41:17 +08:00
|
|
|
struct {
|
|
|
|
struct memcg_cache_array __rcu *memcg_caches;
|
2017-02-23 07:41:24 +08:00
|
|
|
struct list_head __root_caches_node;
|
2017-02-23 07:41:17 +08:00
|
|
|
struct list_head children;
|
2018-06-15 06:26:27 +08:00
|
|
|
bool dying;
|
2017-02-23 07:41:17 +08:00
|
|
|
};
|
2012-12-19 06:22:34 +08:00
|
|
|
struct {
|
|
|
|
struct mem_cgroup *memcg;
|
2017-02-23 07:41:17 +08:00
|
|
|
struct list_head children_node;
|
2017-02-23 07:41:21 +08:00
|
|
|
struct list_head kmem_caches_node;
|
2017-02-23 07:41:30 +08:00
|
|
|
|
|
|
|
void (*deact_fn)(struct kmem_cache *);
|
|
|
|
union {
|
|
|
|
struct rcu_head deact_rcu_head;
|
|
|
|
struct work_struct deact_work;
|
|
|
|
};
|
2012-12-19 06:22:34 +08:00
|
|
|
};
|
2012-12-19 06:22:27 +08:00
|
|
|
};
|
|
|
|
};
|
|
|
|
|
2012-12-19 06:22:34 +08:00
|
|
|
int memcg_update_all_caches(int num_memcgs);
|
|
|
|
|
2013-06-26 00:16:55 +08:00
|
|
|
/**
|
|
|
|
* kmalloc_array - allocate memory for an array.
|
|
|
|
* @n: number of elements.
|
|
|
|
* @size: element size.
|
|
|
|
* @flags: the type of memory to allocate (see kmalloc).
|
2006-06-23 17:03:48 +08:00
|
|
|
*/
|
2012-03-06 07:14:41 +08:00
|
|
|
static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2018-05-09 03:52:32 +08:00
|
|
|
size_t bytes;
|
|
|
|
|
|
|
|
if (unlikely(check_mul_overflow(n, size, &bytes)))
|
slob: initial NUMA support
This adds preliminary NUMA support to SLOB, primarily aimed at systems with
small nodes (tested all the way down to a 128kB SRAM block), whether
asymmetric or otherwise.
We follow the same conventions as SLAB/SLUB, preferring current node
placement for new pages, or with explicit placement, if a node has been
specified. Presently on UP NUMA this has the side-effect of preferring
node#0 allocations (since numa_node_id() == 0, though this could be
reworked if we could hand off a pfn to determine node placement), so
single-CPU NUMA systems will want to place smaller nodes further out in
terms of node id. Once a page has been bound to a node (via explicit node
id typing), we only do block allocations from partial free pages that have
a matching node id in the page flags.
The current implementation does have some scalability problems, in that all
partial free pages are tracked in the global freelist (with contention due
to the single spinlock). However, these are things that are being reworked
for SMP scalability first, while things like per-node freelists can easily
be built on top of this sort of functionality once it's been added.
More background can be found in:
http://marc.info/?l=linux-mm&m=118117916022379&w=2
http://marc.info/?l=linux-mm&m=118170446306199&w=2
http://marc.info/?l=linux-mm&m=118187859420048&w=2
and subsequent threads.
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 14:38:22 +08:00
|
|
|
return NULL;
|
2016-07-27 06:22:08 +08:00
|
|
|
if (__builtin_constant_p(n) && __builtin_constant_p(size))
|
2018-05-09 03:52:32 +08:00
|
|
|
return kmalloc(bytes, flags);
|
|
|
|
return __kmalloc(bytes, flags);
|
2012-03-06 07:14:41 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* kcalloc - allocate memory for an array. The memory is set to zero.
|
|
|
|
* @n: number of elements.
|
|
|
|
* @size: element size.
|
|
|
|
* @flags: the type of memory to allocate (see kmalloc).
|
|
|
|
*/
|
|
|
|
static inline void *kcalloc(size_t n, size_t size, gfp_t flags)
|
|
|
|
{
|
|
|
|
return kmalloc_array(n, size, flags | __GFP_ZERO);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2006-10-04 17:15:25 +08:00
|
|
|
/*
|
|
|
|
* kmalloc_track_caller is a special version of kmalloc that records the
|
|
|
|
* calling function of the routine calling it for slab leak tracking instead
|
|
|
|
* of just the calling function (confusing, eh?).
|
|
|
|
* It's useful when the call to kmalloc comes from a widely-used standard
|
|
|
|
* allocator where we care about the real place the memory allocation
|
|
|
|
* request comes from.
|
|
|
|
*/
|
2008-08-20 01:43:25 +08:00
|
|
|
extern void *__kmalloc_track_caller(size_t, gfp_t, unsigned long);
|
2006-10-04 17:15:25 +08:00
|
|
|
#define kmalloc_track_caller(size, flags) \
|
2008-08-20 01:43:25 +08:00
|
|
|
__kmalloc_track_caller(size, flags, _RET_IP_)
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-11-16 09:32:29 +08:00
|
|
|
static inline void *kmalloc_array_node(size_t n, size_t size, gfp_t flags,
|
|
|
|
int node)
|
|
|
|
{
|
2018-05-09 03:52:32 +08:00
|
|
|
size_t bytes;
|
|
|
|
|
|
|
|
if (unlikely(check_mul_overflow(n, size, &bytes)))
|
2017-11-16 09:32:29 +08:00
|
|
|
return NULL;
|
|
|
|
if (__builtin_constant_p(n) && __builtin_constant_p(size))
|
2018-05-09 03:52:32 +08:00
|
|
|
return kmalloc_node(bytes, flags, node);
|
|
|
|
return __kmalloc_node(bytes, flags, node);
|
2017-11-16 09:32:29 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void *kcalloc_node(size_t n, size_t size, gfp_t flags, int node)
|
|
|
|
{
|
|
|
|
return kmalloc_array_node(n, size, flags | __GFP_ZERO, node);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2005-05-01 23:58:38 +08:00
|
|
|
#ifdef CONFIG_NUMA
|
2008-08-20 01:43:25 +08:00
|
|
|
extern void *__kmalloc_node_track_caller(size_t, gfp_t, int, unsigned long);
|
2006-12-07 12:32:30 +08:00
|
|
|
#define kmalloc_node_track_caller(size, flags, node) \
|
|
|
|
__kmalloc_node_track_caller(size, flags, node, \
|
2008-08-20 01:43:25 +08:00
|
|
|
_RET_IP_)
|
2006-12-13 16:34:23 +08:00
|
|
|
|
2006-12-07 12:32:30 +08:00
|
|
|
#else /* CONFIG_NUMA */
|
|
|
|
|
|
|
|
#define kmalloc_node_track_caller(size, flags, node) \
|
|
|
|
kmalloc_track_caller(size, flags)
|
2005-05-01 23:58:38 +08:00
|
|
|
|
2008-11-25 22:08:19 +08:00
|
|
|
#endif /* CONFIG_NUMA */
|
2006-01-08 17:01:45 +08:00
|
|
|
|
2007-07-17 19:03:29 +08:00
|
|
|
/*
|
|
|
|
* Shortcuts
|
|
|
|
*/
|
|
|
|
static inline void *kmem_cache_zalloc(struct kmem_cache *k, gfp_t flags)
|
|
|
|
{
|
|
|
|
return kmem_cache_alloc(k, flags | __GFP_ZERO);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* kzalloc - allocate memory. The memory is set to zero.
|
|
|
|
* @size: how many bytes of memory are required.
|
|
|
|
* @flags: the type of memory to allocate (see kmalloc).
|
|
|
|
*/
|
|
|
|
static inline void *kzalloc(size_t size, gfp_t flags)
|
|
|
|
{
|
|
|
|
return kmalloc(size, flags | __GFP_ZERO);
|
|
|
|
}
|
|
|
|
|
2008-06-06 13:47:00 +08:00
|
|
|
/**
|
|
|
|
* kzalloc_node - allocate zeroed memory from a particular memory node.
|
|
|
|
* @size: how many bytes of memory are required.
|
|
|
|
* @flags: the type of memory to allocate (see kmalloc).
|
|
|
|
* @node: memory node from which to allocate
|
|
|
|
*/
|
|
|
|
static inline void *kzalloc_node(size_t size, gfp_t flags, int node)
|
|
|
|
{
|
|
|
|
return kmalloc_node(size, flags | __GFP_ZERO, node);
|
|
|
|
}
|
|
|
|
|
2014-10-10 06:26:00 +08:00
|
|
|
unsigned int kmem_cache_size(struct kmem_cache *s);
|
2009-06-12 19:03:06 +08:00
|
|
|
void __init kmem_cache_init_late(void);
|
|
|
|
|
2016-08-23 20:53:19 +08:00
|
|
|
#if defined(CONFIG_SMP) && defined(CONFIG_SLAB)
|
|
|
|
int slab_prepare_cpu(unsigned int cpu);
|
|
|
|
int slab_dead_cpu(unsigned int cpu);
|
|
|
|
#else
|
|
|
|
#define slab_prepare_cpu NULL
|
|
|
|
#define slab_dead_cpu NULL
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif /* _LINUX_SLAB_H */
|