License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2012-07-07 04:25:10 +08:00
|
|
|
/*
|
|
|
|
* Slab allocator functions that are independent of the allocator strategy
|
|
|
|
*
|
|
|
|
* (C) 2012 Christoph Lameter <cl@linux.com>
|
|
|
|
*/
|
|
|
|
#include <linux/slab.h>
|
|
|
|
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/poison.h>
|
|
|
|
#include <linux/interrupt.h>
|
|
|
|
#include <linux/memory.h>
|
2018-04-06 07:20:11 +08:00
|
|
|
#include <linux/cache.h>
|
2012-07-07 04:25:10 +08:00
|
|
|
#include <linux/compiler.h>
|
|
|
|
#include <linux/module.h>
|
2012-07-07 04:25:13 +08:00
|
|
|
#include <linux/cpu.h>
|
|
|
|
#include <linux/uaccess.h>
|
2012-10-19 22:20:25 +08:00
|
|
|
#include <linux/seq_file.h>
|
|
|
|
#include <linux/proc_fs.h>
|
2012-07-07 04:25:10 +08:00
|
|
|
#include <asm/cacheflush.h>
|
|
|
|
#include <asm/tlbflush.h>
|
|
|
|
#include <asm/page.h>
|
2012-12-19 06:22:34 +08:00
|
|
|
#include <linux/memcontrol.h>
|
2014-08-07 07:04:44 +08:00
|
|
|
|
|
|
|
#define CREATE_TRACE_POINTS
|
2013-09-05 00:35:34 +08:00
|
|
|
#include <trace/events/kmem.h>
|
2012-07-07 04:25:10 +08:00
|
|
|
|
2012-07-07 04:25:11 +08:00
|
|
|
#include "slab.h"
|
|
|
|
|
|
|
|
enum slab_state slab_state;
|
2012-07-07 04:25:12 +08:00
|
|
|
LIST_HEAD(slab_caches);
|
|
|
|
DEFINE_MUTEX(slab_mutex);
|
2012-09-05 08:20:33 +08:00
|
|
|
struct kmem_cache *kmem_cache;
|
2012-07-07 04:25:11 +08:00
|
|
|
|
2017-12-01 05:04:32 +08:00
|
|
|
#ifdef CONFIG_HARDENED_USERCOPY
|
|
|
|
bool usercopy_fallback __ro_after_init =
|
|
|
|
IS_ENABLED(CONFIG_HARDENED_USERCOPY_FALLBACK);
|
|
|
|
module_param(usercopy_fallback, bool, 0400);
|
|
|
|
MODULE_PARM_DESC(usercopy_fallback,
|
|
|
|
"WARN instead of reject usercopy whitelist violations");
|
|
|
|
#endif
|
|
|
|
|
2017-02-23 07:41:14 +08:00
|
|
|
static LIST_HEAD(slab_caches_to_rcu_destroy);
|
|
|
|
static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work);
|
|
|
|
static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
|
|
|
|
slab_caches_to_rcu_destroy_workfn);
|
|
|
|
|
2014-10-10 06:26:22 +08:00
|
|
|
/*
|
|
|
|
* Set of flags that will prevent slab merging
|
|
|
|
*/
|
|
|
|
#define SLAB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
|
2017-01-18 18:53:44 +08:00
|
|
|
SLAB_TRACE | SLAB_TYPESAFE_BY_RCU | SLAB_NOLEAKTRACE | \
|
2016-03-26 05:21:59 +08:00
|
|
|
SLAB_FAILSLAB | SLAB_KASAN)
|
2014-10-10 06:26:22 +08:00
|
|
|
|
2016-01-15 07:18:15 +08:00
|
|
|
#define SLAB_MERGE_SAME (SLAB_RECLAIM_ACCOUNT | SLAB_CACHE_DMA | \
|
2017-11-16 09:35:54 +08:00
|
|
|
SLAB_ACCOUNT)
|
2014-10-10 06:26:22 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Merge control. If this is set then no merging of slab caches will occur.
|
|
|
|
*/
|
2017-07-07 06:36:40 +08:00
|
|
|
static bool slab_nomerge = !IS_ENABLED(CONFIG_SLAB_MERGE_DEFAULT);
|
2014-10-10 06:26:22 +08:00
|
|
|
|
|
|
|
static int __init setup_slab_nomerge(char *str)
|
|
|
|
{
|
2017-07-07 06:36:40 +08:00
|
|
|
slab_nomerge = true;
|
2014-10-10 06:26:22 +08:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_SLUB
|
|
|
|
__setup_param("slub_nomerge", slub_nomerge, setup_slab_nomerge, 0);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
__setup("slab_nomerge", setup_slab_nomerge);
|
|
|
|
|
2014-10-10 06:26:00 +08:00
|
|
|
/*
|
|
|
|
* Determine the size of a slab object
|
|
|
|
*/
|
|
|
|
unsigned int kmem_cache_size(struct kmem_cache *s)
|
|
|
|
{
|
|
|
|
return s->object_size;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kmem_cache_size);
|
|
|
|
|
2012-08-16 15:09:46 +08:00
|
|
|
#ifdef CONFIG_DEBUG_VM
|
2018-04-06 07:20:37 +08:00
|
|
|
static int kmem_cache_sanity_check(const char *name, unsigned int size)
|
2012-07-07 04:25:10 +08:00
|
|
|
{
|
|
|
|
if (!name || in_interrupt() || size < sizeof(void *) ||
|
|
|
|
size > KMALLOC_MAX_SIZE) {
|
2012-08-16 15:09:46 +08:00
|
|
|
pr_err("kmem_cache_create(%s) integrity check failed\n", name);
|
|
|
|
return -EINVAL;
|
2012-07-07 04:25:10 +08:00
|
|
|
}
|
2012-08-16 15:12:18 +08:00
|
|
|
|
2012-07-07 04:25:13 +08:00
|
|
|
WARN_ON(strchr(name, ' ')); /* It confuses parsers */
|
2012-08-16 15:09:46 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#else
|
2018-04-06 07:20:37 +08:00
|
|
|
static inline int kmem_cache_sanity_check(const char *name, unsigned int size)
|
2012-08-16 15:09:46 +08:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2012-07-07 04:25:13 +08:00
|
|
|
#endif
|
|
|
|
|
2015-09-05 06:45:34 +08:00
|
|
|
void __kmem_cache_free_bulk(struct kmem_cache *s, size_t nr, void **p)
|
|
|
|
{
|
|
|
|
size_t i;
|
|
|
|
|
2016-03-16 05:54:00 +08:00
|
|
|
for (i = 0; i < nr; i++) {
|
|
|
|
if (s)
|
|
|
|
kmem_cache_free(s, p[i]);
|
|
|
|
else
|
|
|
|
kfree(p[i]);
|
|
|
|
}
|
2015-09-05 06:45:34 +08:00
|
|
|
}
|
|
|
|
|
2015-11-21 07:57:58 +08:00
|
|
|
int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
|
2015-09-05 06:45:34 +08:00
|
|
|
void **p)
|
|
|
|
{
|
|
|
|
size_t i;
|
|
|
|
|
|
|
|
for (i = 0; i < nr; i++) {
|
|
|
|
void *x = p[i] = kmem_cache_alloc(s, flags);
|
|
|
|
if (!x) {
|
|
|
|
__kmem_cache_free_bulk(s, i, p);
|
2015-11-21 07:57:58 +08:00
|
|
|
return 0;
|
2015-09-05 06:45:34 +08:00
|
|
|
}
|
|
|
|
}
|
2015-11-21 07:57:58 +08:00
|
|
|
return i;
|
2015-09-05 06:45:34 +08:00
|
|
|
}
|
|
|
|
|
2018-08-18 06:47:25 +08:00
|
|
|
#ifdef CONFIG_MEMCG_KMEM
|
2017-02-23 07:41:24 +08:00
|
|
|
|
|
|
|
LIST_HEAD(slab_root_caches);
|
|
|
|
|
2015-02-13 06:59:20 +08:00
|
|
|
void slab_init_memcg_params(struct kmem_cache *s)
|
memcg: move memcg_{alloc,free}_cache_params to slab_common.c
The only reason why they live in memcontrol.c is that we get/put css
reference to the owner memory cgroup in them. However, we can do that in
memcg_{un,}register_cache. OTOH, there are several reasons to move them
to slab_common.c.
First, I think that the less public interface functions we have in
memcontrol.h the better. Since the functions I move don't depend on
memcontrol, I think it's worth making them private to slab, especially
taking into account that the arrays are defined on the slab's side too.
Second, the way how per-memcg arrays are updated looks rather awkward: it
proceeds from memcontrol.c (__memcg_activate_kmem) to slab_common.c
(memcg_update_all_caches) and back to memcontrol.c again
(memcg_update_array_size). In the following patches I move the function
relocating the arrays (memcg_update_array_size) to slab_common.c and
therefore get rid this circular call path. I think we should have the
cache allocation stuff in the same place where we have relocation, because
it's easier to follow the code then. So I move arrays alloc/free
functions to slab_common.c too.
The third point isn't obvious. I'm going to make the list_lru structure
per-memcg to allow targeted kmem reclaim. That means we will have
per-memcg arrays in list_lrus too. It turns out that it's much easier to
update these arrays in list_lru.c rather than in memcontrol.c, because all
the stuff we need is defined there. This patch makes memcg caches arrays
allocation path conform that of the upcoming list_lru.
So let's move these functions to slab_common.c and make them static.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-10 06:28:43 +08:00
|
|
|
{
|
2017-02-23 07:41:17 +08:00
|
|
|
s->memcg_params.root_cache = NULL;
|
2015-02-13 06:59:20 +08:00
|
|
|
RCU_INIT_POINTER(s->memcg_params.memcg_caches, NULL);
|
2017-02-23 07:41:17 +08:00
|
|
|
INIT_LIST_HEAD(&s->memcg_params.children);
|
2018-06-15 06:26:27 +08:00
|
|
|
s->memcg_params.dying = false;
|
2015-02-13 06:59:20 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int init_memcg_params(struct kmem_cache *s,
|
|
|
|
struct mem_cgroup *memcg, struct kmem_cache *root_cache)
|
|
|
|
{
|
|
|
|
struct memcg_cache_array *arr;
|
memcg: move memcg_{alloc,free}_cache_params to slab_common.c
The only reason why they live in memcontrol.c is that we get/put css
reference to the owner memory cgroup in them. However, we can do that in
memcg_{un,}register_cache. OTOH, there are several reasons to move them
to slab_common.c.
First, I think that the less public interface functions we have in
memcontrol.h the better. Since the functions I move don't depend on
memcontrol, I think it's worth making them private to slab, especially
taking into account that the arrays are defined on the slab's side too.
Second, the way how per-memcg arrays are updated looks rather awkward: it
proceeds from memcontrol.c (__memcg_activate_kmem) to slab_common.c
(memcg_update_all_caches) and back to memcontrol.c again
(memcg_update_array_size). In the following patches I move the function
relocating the arrays (memcg_update_array_size) to slab_common.c and
therefore get rid this circular call path. I think we should have the
cache allocation stuff in the same place where we have relocation, because
it's easier to follow the code then. So I move arrays alloc/free
functions to slab_common.c too.
The third point isn't obvious. I'm going to make the list_lru structure
per-memcg to allow targeted kmem reclaim. That means we will have
per-memcg arrays in list_lrus too. It turns out that it's much easier to
update these arrays in list_lru.c rather than in memcontrol.c, because all
the stuff we need is defined there. This patch makes memcg caches arrays
allocation path conform that of the upcoming list_lru.
So let's move these functions to slab_common.c and make them static.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-10 06:28:43 +08:00
|
|
|
|
2017-02-23 07:41:17 +08:00
|
|
|
if (root_cache) {
|
2015-02-13 06:59:20 +08:00
|
|
|
s->memcg_params.root_cache = root_cache;
|
2017-02-23 07:41:17 +08:00
|
|
|
s->memcg_params.memcg = memcg;
|
|
|
|
INIT_LIST_HEAD(&s->memcg_params.children_node);
|
2017-02-23 07:41:21 +08:00
|
|
|
INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node);
|
memcg: move memcg_{alloc,free}_cache_params to slab_common.c
The only reason why they live in memcontrol.c is that we get/put css
reference to the owner memory cgroup in them. However, we can do that in
memcg_{un,}register_cache. OTOH, there are several reasons to move them
to slab_common.c.
First, I think that the less public interface functions we have in
memcontrol.h the better. Since the functions I move don't depend on
memcontrol, I think it's worth making them private to slab, especially
taking into account that the arrays are defined on the slab's side too.
Second, the way how per-memcg arrays are updated looks rather awkward: it
proceeds from memcontrol.c (__memcg_activate_kmem) to slab_common.c
(memcg_update_all_caches) and back to memcontrol.c again
(memcg_update_array_size). In the following patches I move the function
relocating the arrays (memcg_update_array_size) to slab_common.c and
therefore get rid this circular call path. I think we should have the
cache allocation stuff in the same place where we have relocation, because
it's easier to follow the code then. So I move arrays alloc/free
functions to slab_common.c too.
The third point isn't obvious. I'm going to make the list_lru structure
per-memcg to allow targeted kmem reclaim. That means we will have
per-memcg arrays in list_lrus too. It turns out that it's much easier to
update these arrays in list_lru.c rather than in memcontrol.c, because all
the stuff we need is defined there. This patch makes memcg caches arrays
allocation path conform that of the upcoming list_lru.
So let's move these functions to slab_common.c and make them static.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-10 06:28:43 +08:00
|
|
|
return 0;
|
2015-02-13 06:59:20 +08:00
|
|
|
}
|
memcg: move memcg_{alloc,free}_cache_params to slab_common.c
The only reason why they live in memcontrol.c is that we get/put css
reference to the owner memory cgroup in them. However, we can do that in
memcg_{un,}register_cache. OTOH, there are several reasons to move them
to slab_common.c.
First, I think that the less public interface functions we have in
memcontrol.h the better. Since the functions I move don't depend on
memcontrol, I think it's worth making them private to slab, especially
taking into account that the arrays are defined on the slab's side too.
Second, the way how per-memcg arrays are updated looks rather awkward: it
proceeds from memcontrol.c (__memcg_activate_kmem) to slab_common.c
(memcg_update_all_caches) and back to memcontrol.c again
(memcg_update_array_size). In the following patches I move the function
relocating the arrays (memcg_update_array_size) to slab_common.c and
therefore get rid this circular call path. I think we should have the
cache allocation stuff in the same place where we have relocation, because
it's easier to follow the code then. So I move arrays alloc/free
functions to slab_common.c too.
The third point isn't obvious. I'm going to make the list_lru structure
per-memcg to allow targeted kmem reclaim. That means we will have
per-memcg arrays in list_lrus too. It turns out that it's much easier to
update these arrays in list_lru.c rather than in memcontrol.c, because all
the stuff we need is defined there. This patch makes memcg caches arrays
allocation path conform that of the upcoming list_lru.
So let's move these functions to slab_common.c and make them static.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-10 06:28:43 +08:00
|
|
|
|
2015-02-13 06:59:20 +08:00
|
|
|
slab_init_memcg_params(s);
|
memcg: move memcg_{alloc,free}_cache_params to slab_common.c
The only reason why they live in memcontrol.c is that we get/put css
reference to the owner memory cgroup in them. However, we can do that in
memcg_{un,}register_cache. OTOH, there are several reasons to move them
to slab_common.c.
First, I think that the less public interface functions we have in
memcontrol.h the better. Since the functions I move don't depend on
memcontrol, I think it's worth making them private to slab, especially
taking into account that the arrays are defined on the slab's side too.
Second, the way how per-memcg arrays are updated looks rather awkward: it
proceeds from memcontrol.c (__memcg_activate_kmem) to slab_common.c
(memcg_update_all_caches) and back to memcontrol.c again
(memcg_update_array_size). In the following patches I move the function
relocating the arrays (memcg_update_array_size) to slab_common.c and
therefore get rid this circular call path. I think we should have the
cache allocation stuff in the same place where we have relocation, because
it's easier to follow the code then. So I move arrays alloc/free
functions to slab_common.c too.
The third point isn't obvious. I'm going to make the list_lru structure
per-memcg to allow targeted kmem reclaim. That means we will have
per-memcg arrays in list_lrus too. It turns out that it's much easier to
update these arrays in list_lru.c rather than in memcontrol.c, because all
the stuff we need is defined there. This patch makes memcg caches arrays
allocation path conform that of the upcoming list_lru.
So let's move these functions to slab_common.c and make them static.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-10 06:28:43 +08:00
|
|
|
|
2015-02-13 06:59:20 +08:00
|
|
|
if (!memcg_nr_cache_ids)
|
|
|
|
return 0;
|
memcg: move memcg_{alloc,free}_cache_params to slab_common.c
The only reason why they live in memcontrol.c is that we get/put css
reference to the owner memory cgroup in them. However, we can do that in
memcg_{un,}register_cache. OTOH, there are several reasons to move them
to slab_common.c.
First, I think that the less public interface functions we have in
memcontrol.h the better. Since the functions I move don't depend on
memcontrol, I think it's worth making them private to slab, especially
taking into account that the arrays are defined on the slab's side too.
Second, the way how per-memcg arrays are updated looks rather awkward: it
proceeds from memcontrol.c (__memcg_activate_kmem) to slab_common.c
(memcg_update_all_caches) and back to memcontrol.c again
(memcg_update_array_size). In the following patches I move the function
relocating the arrays (memcg_update_array_size) to slab_common.c and
therefore get rid this circular call path. I think we should have the
cache allocation stuff in the same place where we have relocation, because
it's easier to follow the code then. So I move arrays alloc/free
functions to slab_common.c too.
The third point isn't obvious. I'm going to make the list_lru structure
per-memcg to allow targeted kmem reclaim. That means we will have
per-memcg arrays in list_lrus too. It turns out that it's much easier to
update these arrays in list_lru.c rather than in memcontrol.c, because all
the stuff we need is defined there. This patch makes memcg caches arrays
allocation path conform that of the upcoming list_lru.
So let's move these functions to slab_common.c and make them static.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-10 06:28:43 +08:00
|
|
|
|
mm: memcontrol: use vmalloc fallback for large kmem memcg arrays
For quick per-memcg indexing, slab caches and list_lru structures
maintain linear arrays of descriptors. As the number of concurrent
memory cgroups in the system goes up, this requires large contiguous
allocations (8k cgroups = order-5, 16k cgroups = order-6 etc.) for every
existing slab cache and list_lru, which can easily fail on loaded
systems. E.g.:
mkdir: page allocation failure: order:5, mode:0x14040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null)
CPU: 1 PID: 6399 Comm: mkdir Not tainted 4.13.0-mm1-00065-g720bbe532b7c-dirty #481
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
Call Trace:
? __alloc_pages_direct_compact+0x4c/0x110
__alloc_pages_nodemask+0xf50/0x1430
alloc_pages_current+0x60/0xc0
kmalloc_order_trace+0x29/0x1b0
__kmalloc+0x1f4/0x320
memcg_update_all_list_lrus+0xca/0x2e0
mem_cgroup_css_alloc+0x612/0x670
cgroup_apply_control_enable+0x19e/0x360
cgroup_mkdir+0x322/0x490
kernfs_iop_mkdir+0x55/0x80
vfs_mkdir+0xd0/0x120
SyS_mkdirat+0x6c/0xe0
SyS_mkdir+0x14/0x20
entry_SYSCALL_64_fastpath+0x18/0xad
Mem-Info:
active_anon:2965 inactive_anon:19 isolated_anon:0
active_file:100270 inactive_file:98846 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
slab_reclaimable:7328 slab_unreclaimable:16402
mapped:771 shmem:52 pagetables:278 bounce:0
free:13718 free_pcp:0 free_cma:0
This output is from an artificial reproducer, but we have repeatedly
observed order-7 failures in production in the Facebook fleet. These
systems become useless as they cannot run more jobs, even though there
is plenty of memory to allocate 128 individual pages.
Use kvmalloc and kvzalloc to fall back to vmalloc space if these arrays
prove too large for allocating them physically contiguous.
Link: http://lkml.kernel.org/r/20170918184919.20644-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-04 07:16:10 +08:00
|
|
|
arr = kvzalloc(sizeof(struct memcg_cache_array) +
|
|
|
|
memcg_nr_cache_ids * sizeof(void *),
|
|
|
|
GFP_KERNEL);
|
2015-02-13 06:59:20 +08:00
|
|
|
if (!arr)
|
|
|
|
return -ENOMEM;
|
memcg: move memcg_{alloc,free}_cache_params to slab_common.c
The only reason why they live in memcontrol.c is that we get/put css
reference to the owner memory cgroup in them. However, we can do that in
memcg_{un,}register_cache. OTOH, there are several reasons to move them
to slab_common.c.
First, I think that the less public interface functions we have in
memcontrol.h the better. Since the functions I move don't depend on
memcontrol, I think it's worth making them private to slab, especially
taking into account that the arrays are defined on the slab's side too.
Second, the way how per-memcg arrays are updated looks rather awkward: it
proceeds from memcontrol.c (__memcg_activate_kmem) to slab_common.c
(memcg_update_all_caches) and back to memcontrol.c again
(memcg_update_array_size). In the following patches I move the function
relocating the arrays (memcg_update_array_size) to slab_common.c and
therefore get rid this circular call path. I think we should have the
cache allocation stuff in the same place where we have relocation, because
it's easier to follow the code then. So I move arrays alloc/free
functions to slab_common.c too.
The third point isn't obvious. I'm going to make the list_lru structure
per-memcg to allow targeted kmem reclaim. That means we will have
per-memcg arrays in list_lrus too. It turns out that it's much easier to
update these arrays in list_lru.c rather than in memcontrol.c, because all
the stuff we need is defined there. This patch makes memcg caches arrays
allocation path conform that of the upcoming list_lru.
So let's move these functions to slab_common.c and make them static.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-10 06:28:43 +08:00
|
|
|
|
2015-02-13 06:59:20 +08:00
|
|
|
RCU_INIT_POINTER(s->memcg_params.memcg_caches, arr);
|
memcg: move memcg_{alloc,free}_cache_params to slab_common.c
The only reason why they live in memcontrol.c is that we get/put css
reference to the owner memory cgroup in them. However, we can do that in
memcg_{un,}register_cache. OTOH, there are several reasons to move them
to slab_common.c.
First, I think that the less public interface functions we have in
memcontrol.h the better. Since the functions I move don't depend on
memcontrol, I think it's worth making them private to slab, especially
taking into account that the arrays are defined on the slab's side too.
Second, the way how per-memcg arrays are updated looks rather awkward: it
proceeds from memcontrol.c (__memcg_activate_kmem) to slab_common.c
(memcg_update_all_caches) and back to memcontrol.c again
(memcg_update_array_size). In the following patches I move the function
relocating the arrays (memcg_update_array_size) to slab_common.c and
therefore get rid this circular call path. I think we should have the
cache allocation stuff in the same place where we have relocation, because
it's easier to follow the code then. So I move arrays alloc/free
functions to slab_common.c too.
The third point isn't obvious. I'm going to make the list_lru structure
per-memcg to allow targeted kmem reclaim. That means we will have
per-memcg arrays in list_lrus too. It turns out that it's much easier to
update these arrays in list_lru.c rather than in memcontrol.c, because all
the stuff we need is defined there. This patch makes memcg caches arrays
allocation path conform that of the upcoming list_lru.
So let's move these functions to slab_common.c and make them static.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-10 06:28:43 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-02-13 06:59:20 +08:00
|
|
|
static void destroy_memcg_params(struct kmem_cache *s)
|
memcg: move memcg_{alloc,free}_cache_params to slab_common.c
The only reason why they live in memcontrol.c is that we get/put css
reference to the owner memory cgroup in them. However, we can do that in
memcg_{un,}register_cache. OTOH, there are several reasons to move them
to slab_common.c.
First, I think that the less public interface functions we have in
memcontrol.h the better. Since the functions I move don't depend on
memcontrol, I think it's worth making them private to slab, especially
taking into account that the arrays are defined on the slab's side too.
Second, the way how per-memcg arrays are updated looks rather awkward: it
proceeds from memcontrol.c (__memcg_activate_kmem) to slab_common.c
(memcg_update_all_caches) and back to memcontrol.c again
(memcg_update_array_size). In the following patches I move the function
relocating the arrays (memcg_update_array_size) to slab_common.c and
therefore get rid this circular call path. I think we should have the
cache allocation stuff in the same place where we have relocation, because
it's easier to follow the code then. So I move arrays alloc/free
functions to slab_common.c too.
The third point isn't obvious. I'm going to make the list_lru structure
per-memcg to allow targeted kmem reclaim. That means we will have
per-memcg arrays in list_lrus too. It turns out that it's much easier to
update these arrays in list_lru.c rather than in memcontrol.c, because all
the stuff we need is defined there. This patch makes memcg caches arrays
allocation path conform that of the upcoming list_lru.
So let's move these functions to slab_common.c and make them static.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-10 06:28:43 +08:00
|
|
|
{
|
2015-02-13 06:59:20 +08:00
|
|
|
if (is_root_cache(s))
|
mm: memcontrol: use vmalloc fallback for large kmem memcg arrays
For quick per-memcg indexing, slab caches and list_lru structures
maintain linear arrays of descriptors. As the number of concurrent
memory cgroups in the system goes up, this requires large contiguous
allocations (8k cgroups = order-5, 16k cgroups = order-6 etc.) for every
existing slab cache and list_lru, which can easily fail on loaded
systems. E.g.:
mkdir: page allocation failure: order:5, mode:0x14040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null)
CPU: 1 PID: 6399 Comm: mkdir Not tainted 4.13.0-mm1-00065-g720bbe532b7c-dirty #481
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
Call Trace:
? __alloc_pages_direct_compact+0x4c/0x110
__alloc_pages_nodemask+0xf50/0x1430
alloc_pages_current+0x60/0xc0
kmalloc_order_trace+0x29/0x1b0
__kmalloc+0x1f4/0x320
memcg_update_all_list_lrus+0xca/0x2e0
mem_cgroup_css_alloc+0x612/0x670
cgroup_apply_control_enable+0x19e/0x360
cgroup_mkdir+0x322/0x490
kernfs_iop_mkdir+0x55/0x80
vfs_mkdir+0xd0/0x120
SyS_mkdirat+0x6c/0xe0
SyS_mkdir+0x14/0x20
entry_SYSCALL_64_fastpath+0x18/0xad
Mem-Info:
active_anon:2965 inactive_anon:19 isolated_anon:0
active_file:100270 inactive_file:98846 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
slab_reclaimable:7328 slab_unreclaimable:16402
mapped:771 shmem:52 pagetables:278 bounce:0
free:13718 free_pcp:0 free_cma:0
This output is from an artificial reproducer, but we have repeatedly
observed order-7 failures in production in the Facebook fleet. These
systems become useless as they cannot run more jobs, even though there
is plenty of memory to allocate 128 individual pages.
Use kvmalloc and kvzalloc to fall back to vmalloc space if these arrays
prove too large for allocating them physically contiguous.
Link: http://lkml.kernel.org/r/20170918184919.20644-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-04 07:16:10 +08:00
|
|
|
kvfree(rcu_access_pointer(s->memcg_params.memcg_caches));
|
|
|
|
}
|
|
|
|
|
|
|
|
static void free_memcg_params(struct rcu_head *rcu)
|
|
|
|
{
|
|
|
|
struct memcg_cache_array *old;
|
|
|
|
|
|
|
|
old = container_of(rcu, struct memcg_cache_array, rcu);
|
|
|
|
kvfree(old);
|
memcg: move memcg_{alloc,free}_cache_params to slab_common.c
The only reason why they live in memcontrol.c is that we get/put css
reference to the owner memory cgroup in them. However, we can do that in
memcg_{un,}register_cache. OTOH, there are several reasons to move them
to slab_common.c.
First, I think that the less public interface functions we have in
memcontrol.h the better. Since the functions I move don't depend on
memcontrol, I think it's worth making them private to slab, especially
taking into account that the arrays are defined on the slab's side too.
Second, the way how per-memcg arrays are updated looks rather awkward: it
proceeds from memcontrol.c (__memcg_activate_kmem) to slab_common.c
(memcg_update_all_caches) and back to memcontrol.c again
(memcg_update_array_size). In the following patches I move the function
relocating the arrays (memcg_update_array_size) to slab_common.c and
therefore get rid this circular call path. I think we should have the
cache allocation stuff in the same place where we have relocation, because
it's easier to follow the code then. So I move arrays alloc/free
functions to slab_common.c too.
The third point isn't obvious. I'm going to make the list_lru structure
per-memcg to allow targeted kmem reclaim. That means we will have
per-memcg arrays in list_lrus too. It turns out that it's much easier to
update these arrays in list_lru.c rather than in memcontrol.c, because all
the stuff we need is defined there. This patch makes memcg caches arrays
allocation path conform that of the upcoming list_lru.
So let's move these functions to slab_common.c and make them static.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-10 06:28:43 +08:00
|
|
|
}
|
|
|
|
|
2015-02-13 06:59:20 +08:00
|
|
|
static int update_memcg_params(struct kmem_cache *s, int new_array_size)
|
2014-10-10 06:28:47 +08:00
|
|
|
{
|
2015-02-13 06:59:20 +08:00
|
|
|
struct memcg_cache_array *old, *new;
|
2014-10-10 06:28:47 +08:00
|
|
|
|
mm: memcontrol: use vmalloc fallback for large kmem memcg arrays
For quick per-memcg indexing, slab caches and list_lru structures
maintain linear arrays of descriptors. As the number of concurrent
memory cgroups in the system goes up, this requires large contiguous
allocations (8k cgroups = order-5, 16k cgroups = order-6 etc.) for every
existing slab cache and list_lru, which can easily fail on loaded
systems. E.g.:
mkdir: page allocation failure: order:5, mode:0x14040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null)
CPU: 1 PID: 6399 Comm: mkdir Not tainted 4.13.0-mm1-00065-g720bbe532b7c-dirty #481
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
Call Trace:
? __alloc_pages_direct_compact+0x4c/0x110
__alloc_pages_nodemask+0xf50/0x1430
alloc_pages_current+0x60/0xc0
kmalloc_order_trace+0x29/0x1b0
__kmalloc+0x1f4/0x320
memcg_update_all_list_lrus+0xca/0x2e0
mem_cgroup_css_alloc+0x612/0x670
cgroup_apply_control_enable+0x19e/0x360
cgroup_mkdir+0x322/0x490
kernfs_iop_mkdir+0x55/0x80
vfs_mkdir+0xd0/0x120
SyS_mkdirat+0x6c/0xe0
SyS_mkdir+0x14/0x20
entry_SYSCALL_64_fastpath+0x18/0xad
Mem-Info:
active_anon:2965 inactive_anon:19 isolated_anon:0
active_file:100270 inactive_file:98846 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
slab_reclaimable:7328 slab_unreclaimable:16402
mapped:771 shmem:52 pagetables:278 bounce:0
free:13718 free_pcp:0 free_cma:0
This output is from an artificial reproducer, but we have repeatedly
observed order-7 failures in production in the Facebook fleet. These
systems become useless as they cannot run more jobs, even though there
is plenty of memory to allocate 128 individual pages.
Use kvmalloc and kvzalloc to fall back to vmalloc space if these arrays
prove too large for allocating them physically contiguous.
Link: http://lkml.kernel.org/r/20170918184919.20644-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-04 07:16:10 +08:00
|
|
|
new = kvzalloc(sizeof(struct memcg_cache_array) +
|
|
|
|
new_array_size * sizeof(void *), GFP_KERNEL);
|
2015-02-13 06:59:20 +08:00
|
|
|
if (!new)
|
2014-10-10 06:28:47 +08:00
|
|
|
return -ENOMEM;
|
|
|
|
|
2015-02-13 06:59:20 +08:00
|
|
|
old = rcu_dereference_protected(s->memcg_params.memcg_caches,
|
|
|
|
lockdep_is_held(&slab_mutex));
|
|
|
|
if (old)
|
|
|
|
memcpy(new->entries, old->entries,
|
|
|
|
memcg_nr_cache_ids * sizeof(void *));
|
2014-10-10 06:28:47 +08:00
|
|
|
|
2015-02-13 06:59:20 +08:00
|
|
|
rcu_assign_pointer(s->memcg_params.memcg_caches, new);
|
|
|
|
if (old)
|
mm: memcontrol: use vmalloc fallback for large kmem memcg arrays
For quick per-memcg indexing, slab caches and list_lru structures
maintain linear arrays of descriptors. As the number of concurrent
memory cgroups in the system goes up, this requires large contiguous
allocations (8k cgroups = order-5, 16k cgroups = order-6 etc.) for every
existing slab cache and list_lru, which can easily fail on loaded
systems. E.g.:
mkdir: page allocation failure: order:5, mode:0x14040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null)
CPU: 1 PID: 6399 Comm: mkdir Not tainted 4.13.0-mm1-00065-g720bbe532b7c-dirty #481
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
Call Trace:
? __alloc_pages_direct_compact+0x4c/0x110
__alloc_pages_nodemask+0xf50/0x1430
alloc_pages_current+0x60/0xc0
kmalloc_order_trace+0x29/0x1b0
__kmalloc+0x1f4/0x320
memcg_update_all_list_lrus+0xca/0x2e0
mem_cgroup_css_alloc+0x612/0x670
cgroup_apply_control_enable+0x19e/0x360
cgroup_mkdir+0x322/0x490
kernfs_iop_mkdir+0x55/0x80
vfs_mkdir+0xd0/0x120
SyS_mkdirat+0x6c/0xe0
SyS_mkdir+0x14/0x20
entry_SYSCALL_64_fastpath+0x18/0xad
Mem-Info:
active_anon:2965 inactive_anon:19 isolated_anon:0
active_file:100270 inactive_file:98846 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
slab_reclaimable:7328 slab_unreclaimable:16402
mapped:771 shmem:52 pagetables:278 bounce:0
free:13718 free_pcp:0 free_cma:0
This output is from an artificial reproducer, but we have repeatedly
observed order-7 failures in production in the Facebook fleet. These
systems become useless as they cannot run more jobs, even though there
is plenty of memory to allocate 128 individual pages.
Use kvmalloc and kvzalloc to fall back to vmalloc space if these arrays
prove too large for allocating them physically contiguous.
Link: http://lkml.kernel.org/r/20170918184919.20644-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-04 07:16:10 +08:00
|
|
|
call_rcu(&old->rcu, free_memcg_params);
|
2014-10-10 06:28:47 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
memcg: allocate memory for memcg caches whenever a new memcg appears
Every cache that is considered a root cache (basically the "original"
caches, tied to the root memcg/no-memcg) will have an array that should be
large enough to store a cache pointer per each memcg in the system.
Theoreticaly, this is as high as 1 << sizeof(css_id), which is currently
in the 64k pointers range. Most of the time, we won't be using that much.
What goes in this patch, is a simple scheme to dynamically allocate such
an array, in order to minimize memory usage for memcg caches. Because we
would also like to avoid allocations all the time, at least for now, the
array will only grow. It will tend to be big enough to hold the maximum
number of kmem-limited memcgs ever achieved.
We'll allocate it to be a minimum of 64 kmem-limited memcgs. When we have
more than that, we'll start doubling the size of this array every time the
limit is reached.
Because we are only considering kmem limited memcgs, a natural point for
this to happen is when we write to the limit. At that point, we already
have set_limit_mutex held, so that will become our natural synchronization
mechanism.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 06:22:38 +08:00
|
|
|
int memcg_update_all_caches(int num_memcgs)
|
|
|
|
{
|
|
|
|
struct kmem_cache *s;
|
|
|
|
int ret = 0;
|
|
|
|
|
2015-02-13 06:59:01 +08:00
|
|
|
mutex_lock(&slab_mutex);
|
2017-02-23 07:41:24 +08:00
|
|
|
list_for_each_entry(s, &slab_root_caches, root_caches_node) {
|
2015-02-13 06:59:20 +08:00
|
|
|
ret = update_memcg_params(s, num_memcgs);
|
memcg: allocate memory for memcg caches whenever a new memcg appears
Every cache that is considered a root cache (basically the "original"
caches, tied to the root memcg/no-memcg) will have an array that should be
large enough to store a cache pointer per each memcg in the system.
Theoreticaly, this is as high as 1 << sizeof(css_id), which is currently
in the 64k pointers range. Most of the time, we won't be using that much.
What goes in this patch, is a simple scheme to dynamically allocate such
an array, in order to minimize memory usage for memcg caches. Because we
would also like to avoid allocations all the time, at least for now, the
array will only grow. It will tend to be big enough to hold the maximum
number of kmem-limited memcgs ever achieved.
We'll allocate it to be a minimum of 64 kmem-limited memcgs. When we have
more than that, we'll start doubling the size of this array every time the
limit is reached.
Because we are only considering kmem limited memcgs, a natural point for
this to happen is when we write to the limit. At that point, we already
have set_limit_mutex held, so that will become our natural synchronization
mechanism.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 06:22:38 +08:00
|
|
|
/*
|
|
|
|
* Instead of freeing the memory, we'll just leave the caches
|
|
|
|
* up to this point in an updated state.
|
|
|
|
*/
|
|
|
|
if (ret)
|
2015-02-13 06:59:01 +08:00
|
|
|
break;
|
memcg: allocate memory for memcg caches whenever a new memcg appears
Every cache that is considered a root cache (basically the "original"
caches, tied to the root memcg/no-memcg) will have an array that should be
large enough to store a cache pointer per each memcg in the system.
Theoreticaly, this is as high as 1 << sizeof(css_id), which is currently
in the 64k pointers range. Most of the time, we won't be using that much.
What goes in this patch, is a simple scheme to dynamically allocate such
an array, in order to minimize memory usage for memcg caches. Because we
would also like to avoid allocations all the time, at least for now, the
array will only grow. It will tend to be big enough to hold the maximum
number of kmem-limited memcgs ever achieved.
We'll allocate it to be a minimum of 64 kmem-limited memcgs. When we have
more than that, we'll start doubling the size of this array every time the
limit is reached.
Because we are only considering kmem limited memcgs, a natural point for
this to happen is when we write to the limit. At that point, we already
have set_limit_mutex held, so that will become our natural synchronization
mechanism.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 06:22:38 +08:00
|
|
|
}
|
|
|
|
mutex_unlock(&slab_mutex);
|
|
|
|
return ret;
|
|
|
|
}
|
2017-02-23 07:41:14 +08:00
|
|
|
|
2017-02-23 07:41:24 +08:00
|
|
|
void memcg_link_cache(struct kmem_cache *s)
|
2017-02-23 07:41:14 +08:00
|
|
|
{
|
2017-02-23 07:41:24 +08:00
|
|
|
if (is_root_cache(s)) {
|
|
|
|
list_add(&s->root_caches_node, &slab_root_caches);
|
|
|
|
} else {
|
|
|
|
list_add(&s->memcg_params.children_node,
|
|
|
|
&s->memcg_params.root_cache->memcg_params.children);
|
|
|
|
list_add(&s->memcg_params.kmem_caches_node,
|
|
|
|
&s->memcg_params.memcg->kmem_caches);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void memcg_unlink_cache(struct kmem_cache *s)
|
|
|
|
{
|
|
|
|
if (is_root_cache(s)) {
|
|
|
|
list_del(&s->root_caches_node);
|
|
|
|
} else {
|
|
|
|
list_del(&s->memcg_params.children_node);
|
|
|
|
list_del(&s->memcg_params.kmem_caches_node);
|
|
|
|
}
|
2017-02-23 07:41:14 +08:00
|
|
|
}
|
memcg: move memcg_{alloc,free}_cache_params to slab_common.c
The only reason why they live in memcontrol.c is that we get/put css
reference to the owner memory cgroup in them. However, we can do that in
memcg_{un,}register_cache. OTOH, there are several reasons to move them
to slab_common.c.
First, I think that the less public interface functions we have in
memcontrol.h the better. Since the functions I move don't depend on
memcontrol, I think it's worth making them private to slab, especially
taking into account that the arrays are defined on the slab's side too.
Second, the way how per-memcg arrays are updated looks rather awkward: it
proceeds from memcontrol.c (__memcg_activate_kmem) to slab_common.c
(memcg_update_all_caches) and back to memcontrol.c again
(memcg_update_array_size). In the following patches I move the function
relocating the arrays (memcg_update_array_size) to slab_common.c and
therefore get rid this circular call path. I think we should have the
cache allocation stuff in the same place where we have relocation, because
it's easier to follow the code then. So I move arrays alloc/free
functions to slab_common.c too.
The third point isn't obvious. I'm going to make the list_lru structure
per-memcg to allow targeted kmem reclaim. That means we will have
per-memcg arrays in list_lrus too. It turns out that it's much easier to
update these arrays in list_lru.c rather than in memcontrol.c, because all
the stuff we need is defined there. This patch makes memcg caches arrays
allocation path conform that of the upcoming list_lru.
So let's move these functions to slab_common.c and make them static.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-10 06:28:43 +08:00
|
|
|
#else
|
2015-02-13 06:59:20 +08:00
|
|
|
static inline int init_memcg_params(struct kmem_cache *s,
|
|
|
|
struct mem_cgroup *memcg, struct kmem_cache *root_cache)
|
memcg: move memcg_{alloc,free}_cache_params to slab_common.c
The only reason why they live in memcontrol.c is that we get/put css
reference to the owner memory cgroup in them. However, we can do that in
memcg_{un,}register_cache. OTOH, there are several reasons to move them
to slab_common.c.
First, I think that the less public interface functions we have in
memcontrol.h the better. Since the functions I move don't depend on
memcontrol, I think it's worth making them private to slab, especially
taking into account that the arrays are defined on the slab's side too.
Second, the way how per-memcg arrays are updated looks rather awkward: it
proceeds from memcontrol.c (__memcg_activate_kmem) to slab_common.c
(memcg_update_all_caches) and back to memcontrol.c again
(memcg_update_array_size). In the following patches I move the function
relocating the arrays (memcg_update_array_size) to slab_common.c and
therefore get rid this circular call path. I think we should have the
cache allocation stuff in the same place where we have relocation, because
it's easier to follow the code then. So I move arrays alloc/free
functions to slab_common.c too.
The third point isn't obvious. I'm going to make the list_lru structure
per-memcg to allow targeted kmem reclaim. That means we will have
per-memcg arrays in list_lrus too. It turns out that it's much easier to
update these arrays in list_lru.c rather than in memcontrol.c, because all
the stuff we need is defined there. This patch makes memcg caches arrays
allocation path conform that of the upcoming list_lru.
So let's move these functions to slab_common.c and make them static.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-10 06:28:43 +08:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-02-13 06:59:20 +08:00
|
|
|
static inline void destroy_memcg_params(struct kmem_cache *s)
|
memcg: move memcg_{alloc,free}_cache_params to slab_common.c
The only reason why they live in memcontrol.c is that we get/put css
reference to the owner memory cgroup in them. However, we can do that in
memcg_{un,}register_cache. OTOH, there are several reasons to move them
to slab_common.c.
First, I think that the less public interface functions we have in
memcontrol.h the better. Since the functions I move don't depend on
memcontrol, I think it's worth making them private to slab, especially
taking into account that the arrays are defined on the slab's side too.
Second, the way how per-memcg arrays are updated looks rather awkward: it
proceeds from memcontrol.c (__memcg_activate_kmem) to slab_common.c
(memcg_update_all_caches) and back to memcontrol.c again
(memcg_update_array_size). In the following patches I move the function
relocating the arrays (memcg_update_array_size) to slab_common.c and
therefore get rid this circular call path. I think we should have the
cache allocation stuff in the same place where we have relocation, because
it's easier to follow the code then. So I move arrays alloc/free
functions to slab_common.c too.
The third point isn't obvious. I'm going to make the list_lru structure
per-memcg to allow targeted kmem reclaim. That means we will have
per-memcg arrays in list_lrus too. It turns out that it's much easier to
update these arrays in list_lru.c rather than in memcontrol.c, because all
the stuff we need is defined there. This patch makes memcg caches arrays
allocation path conform that of the upcoming list_lru.
So let's move these functions to slab_common.c and make them static.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-10 06:28:43 +08:00
|
|
|
{
|
|
|
|
}
|
2017-02-23 07:41:14 +08:00
|
|
|
|
2017-02-23 07:41:24 +08:00
|
|
|
static inline void memcg_unlink_cache(struct kmem_cache *s)
|
2017-02-23 07:41:14 +08:00
|
|
|
{
|
|
|
|
}
|
2018-08-18 06:47:25 +08:00
|
|
|
#endif /* CONFIG_MEMCG_KMEM */
|
memcg: allocate memory for memcg caches whenever a new memcg appears
Every cache that is considered a root cache (basically the "original"
caches, tied to the root memcg/no-memcg) will have an array that should be
large enough to store a cache pointer per each memcg in the system.
Theoreticaly, this is as high as 1 << sizeof(css_id), which is currently
in the 64k pointers range. Most of the time, we won't be using that much.
What goes in this patch, is a simple scheme to dynamically allocate such
an array, in order to minimize memory usage for memcg caches. Because we
would also like to avoid allocations all the time, at least for now, the
array will only grow. It will tend to be big enough to hold the maximum
number of kmem-limited memcgs ever achieved.
We'll allocate it to be a minimum of 64 kmem-limited memcgs. When we have
more than that, we'll start doubling the size of this array every time the
limit is reached.
Because we are only considering kmem limited memcgs, a natural point for
this to happen is when we write to the limit. At that point, we already
have set_limit_mutex held, so that will become our natural synchronization
mechanism.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 06:22:38 +08:00
|
|
|
|
2018-02-01 08:15:36 +08:00
|
|
|
/*
|
|
|
|
* Figure out what the alignment of the objects will be given a set of
|
|
|
|
* flags, a user specified alignment and the size of the objects.
|
|
|
|
*/
|
2018-04-06 07:20:37 +08:00
|
|
|
static unsigned int calculate_alignment(slab_flags_t flags,
|
|
|
|
unsigned int align, unsigned int size)
|
2018-02-01 08:15:36 +08:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If the user wants hardware cache aligned objects then follow that
|
|
|
|
* suggestion if the object is sufficiently large.
|
|
|
|
*
|
|
|
|
* The hardware cache alignment cannot override the specified
|
|
|
|
* alignment though. If that is greater then use it.
|
|
|
|
*/
|
|
|
|
if (flags & SLAB_HWCACHE_ALIGN) {
|
2018-04-06 07:20:37 +08:00
|
|
|
unsigned int ralign;
|
2018-02-01 08:15:36 +08:00
|
|
|
|
|
|
|
ralign = cache_line_size();
|
|
|
|
while (size <= ralign / 2)
|
|
|
|
ralign /= 2;
|
|
|
|
align = max(align, ralign);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (align < ARCH_SLAB_MINALIGN)
|
|
|
|
align = ARCH_SLAB_MINALIGN;
|
|
|
|
|
|
|
|
return ALIGN(align, sizeof(void *));
|
|
|
|
}
|
|
|
|
|
2014-10-10 06:26:22 +08:00
|
|
|
/*
|
|
|
|
* Find a mergeable slab cache
|
|
|
|
*/
|
|
|
|
int slab_unmergeable(struct kmem_cache *s)
|
|
|
|
{
|
|
|
|
if (slab_nomerge || (s->flags & SLAB_NEVER_MERGE))
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
if (!is_root_cache(s))
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
if (s->ctor)
|
|
|
|
return 1;
|
|
|
|
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 10:50:28 +08:00
|
|
|
if (s->usersize)
|
|
|
|
return 1;
|
|
|
|
|
2014-10-10 06:26:22 +08:00
|
|
|
/*
|
|
|
|
* We may have set a slab to be unmergeable during bootstrap.
|
|
|
|
*/
|
|
|
|
if (s->refcount < 0)
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-04-06 07:20:37 +08:00
|
|
|
struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
|
2017-11-16 09:32:18 +08:00
|
|
|
slab_flags_t flags, const char *name, void (*ctor)(void *))
|
2014-10-10 06:26:22 +08:00
|
|
|
{
|
|
|
|
struct kmem_cache *s;
|
|
|
|
|
2017-02-23 07:40:59 +08:00
|
|
|
if (slab_nomerge)
|
2014-10-10 06:26:22 +08:00
|
|
|
return NULL;
|
|
|
|
|
|
|
|
if (ctor)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
size = ALIGN(size, sizeof(void *));
|
|
|
|
align = calculate_alignment(flags, align, size);
|
|
|
|
size = ALIGN(size, align);
|
|
|
|
flags = kmem_cache_flags(size, flags, name, NULL);
|
|
|
|
|
2017-02-23 07:40:59 +08:00
|
|
|
if (flags & SLAB_NEVER_MERGE)
|
|
|
|
return NULL;
|
|
|
|
|
2017-02-23 07:41:24 +08:00
|
|
|
list_for_each_entry_reverse(s, &slab_root_caches, root_caches_node) {
|
2014-10-10 06:26:22 +08:00
|
|
|
if (slab_unmergeable(s))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (size > s->size)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if ((flags & SLAB_MERGE_SAME) != (s->flags & SLAB_MERGE_SAME))
|
|
|
|
continue;
|
|
|
|
/*
|
|
|
|
* Check if alignment is compatible.
|
|
|
|
* Courtesy of Adrian Drzewiecki
|
|
|
|
*/
|
|
|
|
if ((s->size & ~(align - 1)) != s->size)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (s->size - size >= sizeof(void *))
|
|
|
|
continue;
|
|
|
|
|
mm/slab: fix unalignment problem on Malta with EVA due to slab merge
Unlike SLUB, sometimes, object isn't started at the beginning of the
slab in SLAB. This causes the unalignment problem after slab merging is
supported by commit 12220dea07f1 ("mm/slab: support slab merge").
Following is the report from Markos that fail to boot on Malta with EVA.
Calibrating delay loop... 19.86 BogoMIPS (lpj=99328)
pid_max: default: 32768 minimum: 301
Mount-cache hash table entries: 4096 (order: 0, 16384 bytes)
Mountpoint-cache hash table entries: 4096 (order: 0, 16384 bytes)
Kernel bug detected[#1]:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.17.0-05639-g12220dea07f1 #1631
task: 1f04f5d8 ti: 1f050000 task.ti: 1f050000
epc : 80141190 alloc_unbound_pwq+0x234/0x304
Not tainted
ra : 80141184 alloc_unbound_pwq+0x228/0x304
Process swapper/0 (pid: 1, threadinfo=1f050000, task=1f04f5d8, tls=00000000)
Call Trace:
alloc_unbound_pwq+0x234/0x304
apply_workqueue_attrs+0x11c/0x294
__alloc_workqueue_key+0x23c/0x470
init_workqueues+0x320/0x400
do_one_initcall+0xe8/0x23c
kernel_init_freeable+0x9c/0x224
kernel_init+0x10/0x100
ret_from_kernel_thread+0x14/0x1c
[ end trace cb88537fdc8fa200 ]
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
alloc_unbound_pwq() allocates slab object from pool_workqueue. This
kmem_cache requires 256 bytes alignment, but, current merging code
doesn't honor that, and merge it with kmalloc-256. kmalloc-256 requires
only cacheline size alignment so that above failure occurs. However, in
x86, kmalloc-256 is luckily aligned in 256 bytes, so the problem didn't
happen on it.
To fix this problem, this patch introduces alignment mismatch check in
find_mergeable(). This will fix the problem.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reported-by: Markos Chandras <Markos.Chandras@imgtec.com>
Tested-by: Markos Chandras <Markos.Chandras@imgtec.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-14 07:19:25 +08:00
|
|
|
if (IS_ENABLED(CONFIG_SLAB) && align &&
|
|
|
|
(align > s->align || s->align % align))
|
|
|
|
continue;
|
|
|
|
|
2014-10-10 06:26:22 +08:00
|
|
|
return s;
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2015-11-06 10:45:08 +08:00
|
|
|
static struct kmem_cache *create_cache(const char *name,
|
2018-04-06 07:21:50 +08:00
|
|
|
unsigned int object_size, unsigned int align,
|
2018-04-06 07:21:31 +08:00
|
|
|
slab_flags_t flags, unsigned int useroffset,
|
|
|
|
unsigned int usersize, void (*ctor)(void *),
|
2015-11-06 10:45:08 +08:00
|
|
|
struct mem_cgroup *memcg, struct kmem_cache *root_cache)
|
2014-04-08 06:39:26 +08:00
|
|
|
{
|
|
|
|
struct kmem_cache *s;
|
|
|
|
int err;
|
|
|
|
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 10:50:28 +08:00
|
|
|
if (WARN_ON(useroffset + usersize > object_size))
|
|
|
|
useroffset = usersize = 0;
|
|
|
|
|
2014-04-08 06:39:26 +08:00
|
|
|
err = -ENOMEM;
|
|
|
|
s = kmem_cache_zalloc(kmem_cache, GFP_KERNEL);
|
|
|
|
if (!s)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
s->name = name;
|
2018-04-06 07:21:50 +08:00
|
|
|
s->size = s->object_size = object_size;
|
2014-04-08 06:39:26 +08:00
|
|
|
s->align = align;
|
|
|
|
s->ctor = ctor;
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 10:50:28 +08:00
|
|
|
s->useroffset = useroffset;
|
|
|
|
s->usersize = usersize;
|
2014-04-08 06:39:26 +08:00
|
|
|
|
2015-02-13 06:59:20 +08:00
|
|
|
err = init_memcg_params(s, memcg, root_cache);
|
2014-04-08 06:39:26 +08:00
|
|
|
if (err)
|
|
|
|
goto out_free_cache;
|
|
|
|
|
|
|
|
err = __kmem_cache_create(s, flags);
|
|
|
|
if (err)
|
|
|
|
goto out_free_cache;
|
|
|
|
|
|
|
|
s->refcount = 1;
|
|
|
|
list_add(&s->list, &slab_caches);
|
2017-02-23 07:41:24 +08:00
|
|
|
memcg_link_cache(s);
|
2014-04-08 06:39:26 +08:00
|
|
|
out:
|
|
|
|
if (err)
|
|
|
|
return ERR_PTR(err);
|
|
|
|
return s;
|
|
|
|
|
|
|
|
out_free_cache:
|
2015-02-13 06:59:20 +08:00
|
|
|
destroy_memcg_params(s);
|
2015-02-11 06:09:40 +08:00
|
|
|
kmem_cache_free(kmem_cache, s);
|
2014-04-08 06:39:26 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2012-11-29 00:23:16 +08:00
|
|
|
|
2012-08-16 15:09:46 +08:00
|
|
|
/*
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 10:50:28 +08:00
|
|
|
* kmem_cache_create_usercopy - Create a cache.
|
2012-08-16 15:09:46 +08:00
|
|
|
* @name: A string which is used in /proc/slabinfo to identify this cache.
|
|
|
|
* @size: The size of objects to be created in this cache.
|
|
|
|
* @align: The required alignment for the objects.
|
|
|
|
* @flags: SLAB flags
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 10:50:28 +08:00
|
|
|
* @useroffset: Usercopy region offset
|
|
|
|
* @usersize: Usercopy region size
|
2012-08-16 15:09:46 +08:00
|
|
|
* @ctor: A constructor for the objects.
|
|
|
|
*
|
|
|
|
* Returns a ptr to the cache on success, NULL on failure.
|
|
|
|
* Cannot be called within a interrupt, but can be interrupted.
|
|
|
|
* The @ctor is run when new pages are allocated by the cache.
|
|
|
|
*
|
|
|
|
* The flags are
|
|
|
|
*
|
|
|
|
* %SLAB_POISON - Poison the slab with a known test pattern (a5a5a5a5)
|
|
|
|
* to catch references to uninitialised memory.
|
|
|
|
*
|
|
|
|
* %SLAB_RED_ZONE - Insert `Red' zones around the allocated memory to check
|
|
|
|
* for buffer overruns.
|
|
|
|
*
|
|
|
|
* %SLAB_HWCACHE_ALIGN - Align the objects in this cache to a hardware
|
|
|
|
* cacheline. This can be beneficial if you're counting cycles as closely
|
|
|
|
* as davem.
|
|
|
|
*/
|
2012-12-19 06:22:34 +08:00
|
|
|
struct kmem_cache *
|
2018-04-06 07:20:37 +08:00
|
|
|
kmem_cache_create_usercopy(const char *name,
|
|
|
|
unsigned int size, unsigned int align,
|
2018-04-06 07:21:31 +08:00
|
|
|
slab_flags_t flags,
|
|
|
|
unsigned int useroffset, unsigned int usersize,
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 10:50:28 +08:00
|
|
|
void (*ctor)(void *))
|
2012-08-16 15:09:46 +08:00
|
|
|
{
|
2015-11-06 10:45:43 +08:00
|
|
|
struct kmem_cache *s = NULL;
|
2015-02-14 06:36:38 +08:00
|
|
|
const char *cache_name;
|
2014-01-24 07:52:55 +08:00
|
|
|
int err;
|
2012-07-07 04:25:10 +08:00
|
|
|
|
2012-08-16 15:09:46 +08:00
|
|
|
get_online_cpus();
|
slab: get_online_mems for kmem_cache_{create,destroy,shrink}
When we create a sl[au]b cache, we allocate kmem_cache_node structures
for each online NUMA node. To handle nodes taken online/offline, we
register memory hotplug notifier and allocate/free kmem_cache_node
corresponding to the node that changes its state for each kmem cache.
To synchronize between the two paths we hold the slab_mutex during both
the cache creationg/destruction path and while tuning per-node parts of
kmem caches in memory hotplug handler, but that's not quite right,
because it does not guarantee that a newly created cache will have all
kmem_cache_nodes initialized in case it races with memory hotplug. For
instance, in case of slub:
CPU0 CPU1
---- ----
kmem_cache_create: online_pages:
__kmem_cache_create: slab_memory_callback:
slab_mem_going_online_callback:
lock slab_mutex
for each slab_caches list entry
allocate kmem_cache node
unlock slab_mutex
lock slab_mutex
init_kmem_cache_nodes:
for_each_node_state(node, N_NORMAL_MEMORY)
allocate kmem_cache node
add kmem_cache to slab_caches list
unlock slab_mutex
online_pages (continued):
node_states_set_node
As a result we'll get a kmem cache with not all kmem_cache_nodes
allocated.
To avoid issues like that we should hold get/put_online_mems() during
the whole kmem cache creation/destruction/shrink paths, just like we
deal with cpu hotplug. This patch does the trick.
Note, that after it's applied, there is no need in taking the slab_mutex
for kmem_cache_shrink any more, so it is removed from there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:20 +08:00
|
|
|
get_online_mems();
|
2015-02-13 06:59:01 +08:00
|
|
|
memcg_get_cache_ids();
|
slab: get_online_mems for kmem_cache_{create,destroy,shrink}
When we create a sl[au]b cache, we allocate kmem_cache_node structures
for each online NUMA node. To handle nodes taken online/offline, we
register memory hotplug notifier and allocate/free kmem_cache_node
corresponding to the node that changes its state for each kmem cache.
To synchronize between the two paths we hold the slab_mutex during both
the cache creationg/destruction path and while tuning per-node parts of
kmem caches in memory hotplug handler, but that's not quite right,
because it does not guarantee that a newly created cache will have all
kmem_cache_nodes initialized in case it races with memory hotplug. For
instance, in case of slub:
CPU0 CPU1
---- ----
kmem_cache_create: online_pages:
__kmem_cache_create: slab_memory_callback:
slab_mem_going_online_callback:
lock slab_mutex
for each slab_caches list entry
allocate kmem_cache node
unlock slab_mutex
lock slab_mutex
init_kmem_cache_nodes:
for_each_node_state(node, N_NORMAL_MEMORY)
allocate kmem_cache node
add kmem_cache to slab_caches list
unlock slab_mutex
online_pages (continued):
node_states_set_node
As a result we'll get a kmem cache with not all kmem_cache_nodes
allocated.
To avoid issues like that we should hold get/put_online_mems() during
the whole kmem cache creation/destruction/shrink paths, just like we
deal with cpu hotplug. This patch does the trick.
Note, that after it's applied, there is no need in taking the slab_mutex
for kmem_cache_shrink any more, so it is removed from there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:20 +08:00
|
|
|
|
2012-08-16 15:09:46 +08:00
|
|
|
mutex_lock(&slab_mutex);
|
2012-09-05 08:20:33 +08:00
|
|
|
|
2014-04-08 06:39:26 +08:00
|
|
|
err = kmem_cache_sanity_check(name, size);
|
2014-10-10 06:25:58 +08:00
|
|
|
if (err) {
|
2014-01-24 07:52:55 +08:00
|
|
|
goto out_unlock;
|
2014-10-10 06:25:58 +08:00
|
|
|
}
|
2012-09-05 08:20:33 +08:00
|
|
|
|
2016-12-13 08:41:38 +08:00
|
|
|
/* Refuse requests with allocator specific flags */
|
|
|
|
if (flags & ~SLAB_FLAGS_PERMITTED) {
|
|
|
|
err = -EINVAL;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
2012-10-17 19:36:51 +08:00
|
|
|
/*
|
|
|
|
* Some allocators will constraint the set of valid flags to a subset
|
|
|
|
* of all flags. We expect them to define CACHE_CREATE_MASK in this
|
|
|
|
* case, and we'll just provide them with a sanitized version of the
|
|
|
|
* passed flags.
|
|
|
|
*/
|
|
|
|
flags &= CACHE_CREATE_MASK;
|
2012-09-05 08:20:33 +08:00
|
|
|
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 10:50:28 +08:00
|
|
|
/* Fail closed on bad usersize of useroffset values. */
|
|
|
|
if (WARN_ON(!usersize && useroffset) ||
|
|
|
|
WARN_ON(size < usersize || size - usersize < useroffset))
|
|
|
|
usersize = useroffset = 0;
|
|
|
|
|
|
|
|
if (!usersize)
|
|
|
|
s = __kmem_cache_alias(name, size, align, flags, ctor);
|
2014-04-08 06:39:26 +08:00
|
|
|
if (s)
|
2014-01-24 07:52:55 +08:00
|
|
|
goto out_unlock;
|
2012-12-19 06:22:34 +08:00
|
|
|
|
2015-02-14 06:36:38 +08:00
|
|
|
cache_name = kstrdup_const(name, GFP_KERNEL);
|
2014-04-08 06:39:26 +08:00
|
|
|
if (!cache_name) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
2012-09-05 07:38:33 +08:00
|
|
|
|
2018-04-06 07:21:50 +08:00
|
|
|
s = create_cache(cache_name, size,
|
2015-11-06 10:45:08 +08:00
|
|
|
calculate_alignment(flags, align, size),
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 10:50:28 +08:00
|
|
|
flags, useroffset, usersize, ctor, NULL, NULL);
|
2014-04-08 06:39:26 +08:00
|
|
|
if (IS_ERR(s)) {
|
|
|
|
err = PTR_ERR(s);
|
2015-02-14 06:36:38 +08:00
|
|
|
kfree_const(cache_name);
|
2014-04-08 06:39:26 +08:00
|
|
|
}
|
2014-01-24 07:52:55 +08:00
|
|
|
|
|
|
|
out_unlock:
|
2012-07-07 04:25:13 +08:00
|
|
|
mutex_unlock(&slab_mutex);
|
slab: get_online_mems for kmem_cache_{create,destroy,shrink}
When we create a sl[au]b cache, we allocate kmem_cache_node structures
for each online NUMA node. To handle nodes taken online/offline, we
register memory hotplug notifier and allocate/free kmem_cache_node
corresponding to the node that changes its state for each kmem cache.
To synchronize between the two paths we hold the slab_mutex during both
the cache creationg/destruction path and while tuning per-node parts of
kmem caches in memory hotplug handler, but that's not quite right,
because it does not guarantee that a newly created cache will have all
kmem_cache_nodes initialized in case it races with memory hotplug. For
instance, in case of slub:
CPU0 CPU1
---- ----
kmem_cache_create: online_pages:
__kmem_cache_create: slab_memory_callback:
slab_mem_going_online_callback:
lock slab_mutex
for each slab_caches list entry
allocate kmem_cache node
unlock slab_mutex
lock slab_mutex
init_kmem_cache_nodes:
for_each_node_state(node, N_NORMAL_MEMORY)
allocate kmem_cache node
add kmem_cache to slab_caches list
unlock slab_mutex
online_pages (continued):
node_states_set_node
As a result we'll get a kmem cache with not all kmem_cache_nodes
allocated.
To avoid issues like that we should hold get/put_online_mems() during
the whole kmem cache creation/destruction/shrink paths, just like we
deal with cpu hotplug. This patch does the trick.
Note, that after it's applied, there is no need in taking the slab_mutex
for kmem_cache_shrink any more, so it is removed from there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:20 +08:00
|
|
|
|
2015-02-13 06:59:01 +08:00
|
|
|
memcg_put_cache_ids();
|
slab: get_online_mems for kmem_cache_{create,destroy,shrink}
When we create a sl[au]b cache, we allocate kmem_cache_node structures
for each online NUMA node. To handle nodes taken online/offline, we
register memory hotplug notifier and allocate/free kmem_cache_node
corresponding to the node that changes its state for each kmem cache.
To synchronize between the two paths we hold the slab_mutex during both
the cache creationg/destruction path and while tuning per-node parts of
kmem caches in memory hotplug handler, but that's not quite right,
because it does not guarantee that a newly created cache will have all
kmem_cache_nodes initialized in case it races with memory hotplug. For
instance, in case of slub:
CPU0 CPU1
---- ----
kmem_cache_create: online_pages:
__kmem_cache_create: slab_memory_callback:
slab_mem_going_online_callback:
lock slab_mutex
for each slab_caches list entry
allocate kmem_cache node
unlock slab_mutex
lock slab_mutex
init_kmem_cache_nodes:
for_each_node_state(node, N_NORMAL_MEMORY)
allocate kmem_cache node
add kmem_cache to slab_caches list
unlock slab_mutex
online_pages (continued):
node_states_set_node
As a result we'll get a kmem cache with not all kmem_cache_nodes
allocated.
To avoid issues like that we should hold get/put_online_mems() during
the whole kmem cache creation/destruction/shrink paths, just like we
deal with cpu hotplug. This patch does the trick.
Note, that after it's applied, there is no need in taking the slab_mutex
for kmem_cache_shrink any more, so it is removed from there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:20 +08:00
|
|
|
put_online_mems();
|
2012-07-07 04:25:13 +08:00
|
|
|
put_online_cpus();
|
|
|
|
|
slab: fix wrong retval on kmem_cache_create_memcg error path
On kmem_cache_create_memcg() error path we set 'err', but leave 's' (the
new cache ptr) undefined. The latter can be NULL if we could not
allocate the cache, or pointing to a freed area if we failed somewhere
later while trying to initialize it. Initially we checked 'err'
immediately before exiting the function and returned NULL if it was set
ignoring the value of 's':
out_unlock:
...
if (err) {
/* report error */
return NULL;
}
return s;
Recently this check was, in fact, broken by commit f717eb3abb5e ("slab:
do not panic if we fail to create memcg cache"), which turned it to:
out_unlock:
...
if (err && !memcg) {
/* report error */
return NULL;
}
return s;
As a result, if we are failing creating a cache for a memcg, we will
skip the check and return 's' that can contain crap. Obviously, commit
f717eb3abb5e intended not to return crap on error allocating a cache for
a memcg, but only to remove the error reporting in this case, so the
check should look like this:
out_unlock:
...
if (err) {
if (!memcg)
return NULL;
/* report error */
return NULL;
}
return s;
[rientjes@google.com: despaghettification]
[vdavydov@parallels.com: patch monkeying]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Reported-by: Dave Jones <davej@redhat.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-30 06:05:48 +08:00
|
|
|
if (err) {
|
2012-09-05 08:20:33 +08:00
|
|
|
if (flags & SLAB_PANIC)
|
|
|
|
panic("kmem_cache_create: Failed to create slab '%s'. Error %d\n",
|
|
|
|
name, err);
|
|
|
|
else {
|
2016-03-18 05:19:50 +08:00
|
|
|
pr_warn("kmem_cache_create(%s) failed with error %d\n",
|
2012-09-05 08:20:33 +08:00
|
|
|
name, err);
|
|
|
|
dump_stack();
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
2012-07-07 04:25:10 +08:00
|
|
|
return s;
|
|
|
|
}
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 10:50:28 +08:00
|
|
|
EXPORT_SYMBOL(kmem_cache_create_usercopy);
|
|
|
|
|
|
|
|
struct kmem_cache *
|
2018-04-06 07:20:37 +08:00
|
|
|
kmem_cache_create(const char *name, unsigned int size, unsigned int align,
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 10:50:28 +08:00
|
|
|
slab_flags_t flags, void (*ctor)(void *))
|
|
|
|
{
|
2017-06-15 07:12:04 +08:00
|
|
|
return kmem_cache_create_usercopy(name, size, align, flags, 0, 0,
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 10:50:28 +08:00
|
|
|
ctor);
|
|
|
|
}
|
2014-04-08 06:39:26 +08:00
|
|
|
EXPORT_SYMBOL(kmem_cache_create);
|
2012-12-19 06:22:34 +08:00
|
|
|
|
2017-02-23 07:41:14 +08:00
|
|
|
static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work)
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
{
|
2017-02-23 07:41:14 +08:00
|
|
|
LIST_HEAD(to_destroy);
|
|
|
|
struct kmem_cache *s, *s2;
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
|
2017-02-23 07:41:14 +08:00
|
|
|
/*
|
2017-01-18 18:53:44 +08:00
|
|
|
* On destruction, SLAB_TYPESAFE_BY_RCU kmem_caches are put on the
|
2017-02-23 07:41:14 +08:00
|
|
|
* @slab_caches_to_rcu_destroy list. The slab pages are freed
|
|
|
|
* through RCU and and the associated kmem_cache are dereferenced
|
|
|
|
* while freeing the pages, so the kmem_caches should be freed only
|
|
|
|
* after the pending RCU operations are finished. As rcu_barrier()
|
|
|
|
* is a pretty slow operation, we batch all pending destructions
|
|
|
|
* asynchronously.
|
|
|
|
*/
|
|
|
|
mutex_lock(&slab_mutex);
|
|
|
|
list_splice_init(&slab_caches_to_rcu_destroy, &to_destroy);
|
|
|
|
mutex_unlock(&slab_mutex);
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
|
2017-02-23 07:41:14 +08:00
|
|
|
if (list_empty(&to_destroy))
|
|
|
|
return;
|
|
|
|
|
|
|
|
rcu_barrier();
|
|
|
|
|
|
|
|
list_for_each_entry_safe(s, s2, &to_destroy, list) {
|
|
|
|
#ifdef SLAB_SUPPORTS_SYSFS
|
|
|
|
sysfs_slab_release(s);
|
|
|
|
#else
|
|
|
|
slab_kmem_cache_release(s);
|
|
|
|
#endif
|
|
|
|
}
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
}
|
|
|
|
|
2017-02-23 07:41:14 +08:00
|
|
|
static int shutdown_cache(struct kmem_cache *s)
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
{
|
2017-02-25 07:00:05 +08:00
|
|
|
/* free asan quarantined objects */
|
|
|
|
kasan_cache_shutdown(s);
|
|
|
|
|
2017-02-23 07:41:14 +08:00
|
|
|
if (__kmem_cache_shutdown(s) != 0)
|
|
|
|
return -EBUSY;
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
|
2017-02-23 07:41:24 +08:00
|
|
|
memcg_unlink_cache(s);
|
2017-02-23 07:41:14 +08:00
|
|
|
list_del(&s->list);
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
|
2017-01-18 18:53:44 +08:00
|
|
|
if (s->flags & SLAB_TYPESAFE_BY_RCU) {
|
2018-06-28 14:26:09 +08:00
|
|
|
#ifdef SLAB_SUPPORTS_SYSFS
|
|
|
|
sysfs_slab_unlink(s);
|
|
|
|
#endif
|
2017-02-23 07:41:14 +08:00
|
|
|
list_add_tail(&s->list, &slab_caches_to_rcu_destroy);
|
|
|
|
schedule_work(&slab_caches_to_rcu_destroy_work);
|
|
|
|
} else {
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
#ifdef SLAB_SUPPORTS_SYSFS
|
2018-06-28 14:26:09 +08:00
|
|
|
sysfs_slab_unlink(s);
|
2017-02-23 07:41:11 +08:00
|
|
|
sysfs_slab_release(s);
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
#else
|
|
|
|
slab_kmem_cache_release(s);
|
|
|
|
#endif
|
|
|
|
}
|
2017-02-23 07:41:14 +08:00
|
|
|
|
|
|
|
return 0;
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
}
|
|
|
|
|
2018-08-18 06:47:25 +08:00
|
|
|
#ifdef CONFIG_MEMCG_KMEM
|
2014-04-08 06:39:26 +08:00
|
|
|
/*
|
2014-06-05 07:10:02 +08:00
|
|
|
* memcg_create_kmem_cache - Create a cache for a memory cgroup.
|
2014-04-08 06:39:26 +08:00
|
|
|
* @memcg: The memory cgroup the new cache is for.
|
|
|
|
* @root_cache: The parent of the new cache.
|
|
|
|
*
|
|
|
|
* This function attempts to create a kmem cache that will serve allocation
|
|
|
|
* requests going from @memcg to @root_cache. The new cache inherits properties
|
|
|
|
* from its parent.
|
|
|
|
*/
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
void memcg_create_kmem_cache(struct mem_cgroup *memcg,
|
|
|
|
struct kmem_cache *root_cache)
|
2012-12-19 06:22:34 +08:00
|
|
|
{
|
2015-02-11 06:11:44 +08:00
|
|
|
static char memcg_name_buf[NAME_MAX + 1]; /* protected by slab_mutex */
|
2015-09-09 06:01:02 +08:00
|
|
|
struct cgroup_subsys_state *css = &memcg->css;
|
2015-02-13 06:59:20 +08:00
|
|
|
struct memcg_cache_array *arr;
|
memcg, slab: simplify synchronization scheme
At present, we have the following mutexes protecting data related to per
memcg kmem caches:
- slab_mutex. This one is held during the whole kmem cache creation
and destruction paths. We also take it when updating per root cache
memcg_caches arrays (see memcg_update_all_caches). As a result, taking
it guarantees there will be no changes to any kmem cache (including per
memcg). Why do we need something else then? The point is it is
private to slab implementation and has some internal dependencies with
other mutexes (get_online_cpus). So we just don't want to rely upon it
and prefer to introduce additional mutexes instead.
- activate_kmem_mutex. Initially it was added to synchronize
initializing kmem limit (memcg_activate_kmem). However, since we can
grow per root cache memcg_caches arrays only on kmem limit
initialization (see memcg_update_all_caches), we also employ it to
protect against memcg_caches arrays relocation (e.g. see
__kmem_cache_destroy_memcg_children).
- We have a convention not to take slab_mutex in memcontrol.c, but we
want to walk over per memcg memcg_slab_caches lists there (e.g. for
destroying all memcg caches on offline). So we have per memcg
slab_caches_mutex's protecting those lists.
The mutexes are taken in the following order:
activate_kmem_mutex -> slab_mutex -> memcg::slab_caches_mutex
Such a syncrhonization scheme has a number of flaws, for instance:
- We can't call kmem_cache_{destroy,shrink} while walking over a
memcg::memcg_slab_caches list due to locking order. As a result, in
mem_cgroup_destroy_all_caches we schedule the
memcg_cache_params::destroy work shrinking and destroying the cache.
- We don't have a mutex to synchronize per memcg caches destruction
between memcg offline (mem_cgroup_destroy_all_caches) and root cache
destruction (__kmem_cache_destroy_memcg_children). Currently we just
don't bother about it.
This patch simplifies it by substituting per memcg slab_caches_mutex's
with the global memcg_slab_mutex. It will be held whenever a new per
memcg cache is created or destroyed, so it protects per root cache
memcg_caches arrays and per memcg memcg_slab_caches lists. The locking
order is following:
activate_kmem_mutex -> memcg_slab_mutex -> slab_mutex
This allows us to call kmem_cache_{create,shrink,destroy} under the
memcg_slab_mutex. As a result, we don't need memcg_cache_params::destroy
work any more - we can simply destroy caches while iterating over a per
memcg slab caches list.
Also using the global mutex simplifies synchronization between concurrent
per memcg caches creation/destruction, e.g. mem_cgroup_destroy_all_caches
vs __kmem_cache_destroy_memcg_children.
The downside of this is that we substitute per-memcg slab_caches_mutex's
with a hummer-like global mutex, but since we already take either the
slab_mutex or the cgroup_mutex along with a memcg::slab_caches_mutex, it
shouldn't hurt concurrency a lot.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:40 +08:00
|
|
|
struct kmem_cache *s = NULL;
|
2014-04-08 06:39:26 +08:00
|
|
|
char *cache_name;
|
2015-02-13 06:59:20 +08:00
|
|
|
int idx;
|
2014-04-08 06:39:26 +08:00
|
|
|
|
|
|
|
get_online_cpus();
|
slab: get_online_mems for kmem_cache_{create,destroy,shrink}
When we create a sl[au]b cache, we allocate kmem_cache_node structures
for each online NUMA node. To handle nodes taken online/offline, we
register memory hotplug notifier and allocate/free kmem_cache_node
corresponding to the node that changes its state for each kmem cache.
To synchronize between the two paths we hold the slab_mutex during both
the cache creationg/destruction path and while tuning per-node parts of
kmem caches in memory hotplug handler, but that's not quite right,
because it does not guarantee that a newly created cache will have all
kmem_cache_nodes initialized in case it races with memory hotplug. For
instance, in case of slub:
CPU0 CPU1
---- ----
kmem_cache_create: online_pages:
__kmem_cache_create: slab_memory_callback:
slab_mem_going_online_callback:
lock slab_mutex
for each slab_caches list entry
allocate kmem_cache node
unlock slab_mutex
lock slab_mutex
init_kmem_cache_nodes:
for_each_node_state(node, N_NORMAL_MEMORY)
allocate kmem_cache node
add kmem_cache to slab_caches list
unlock slab_mutex
online_pages (continued):
node_states_set_node
As a result we'll get a kmem cache with not all kmem_cache_nodes
allocated.
To avoid issues like that we should hold get/put_online_mems() during
the whole kmem cache creation/destruction/shrink paths, just like we
deal with cpu hotplug. This patch does the trick.
Note, that after it's applied, there is no need in taking the slab_mutex
for kmem_cache_shrink any more, so it is removed from there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:20 +08:00
|
|
|
get_online_mems();
|
|
|
|
|
2014-04-08 06:39:26 +08:00
|
|
|
mutex_lock(&slab_mutex);
|
|
|
|
|
2015-02-13 06:59:32 +08:00
|
|
|
/*
|
2016-01-21 07:02:24 +08:00
|
|
|
* The memory cgroup could have been offlined while the cache
|
2015-02-13 06:59:32 +08:00
|
|
|
* creation work was pending.
|
|
|
|
*/
|
2018-06-15 06:26:27 +08:00
|
|
|
if (memcg->kmem_state != KMEM_ONLINE || root_cache->memcg_params.dying)
|
2015-02-13 06:59:32 +08:00
|
|
|
goto out_unlock;
|
|
|
|
|
2015-02-13 06:59:20 +08:00
|
|
|
idx = memcg_cache_id(memcg);
|
|
|
|
arr = rcu_dereference_protected(root_cache->memcg_params.memcg_caches,
|
|
|
|
lockdep_is_held(&slab_mutex));
|
|
|
|
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
/*
|
|
|
|
* Since per-memcg caches are created asynchronously on first
|
|
|
|
* allocation (see memcg_kmem_get_cache()), several threads can try to
|
|
|
|
* create the same cache, but only one of them may succeed.
|
|
|
|
*/
|
2015-02-13 06:59:20 +08:00
|
|
|
if (arr->entries[idx])
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
goto out_unlock;
|
|
|
|
|
2015-02-13 06:59:29 +08:00
|
|
|
cgroup_name(css->cgroup, memcg_name_buf, sizeof(memcg_name_buf));
|
2016-07-21 06:44:57 +08:00
|
|
|
cache_name = kasprintf(GFP_KERNEL, "%s(%llu:%s)", root_cache->name,
|
|
|
|
css->serial_nr, memcg_name_buf);
|
2014-04-08 06:39:26 +08:00
|
|
|
if (!cache_name)
|
|
|
|
goto out_unlock;
|
|
|
|
|
2015-11-06 10:45:08 +08:00
|
|
|
s = create_cache(cache_name, root_cache->object_size,
|
2018-04-06 07:21:50 +08:00
|
|
|
root_cache->align,
|
2016-11-11 02:46:41 +08:00
|
|
|
root_cache->flags & CACHE_CREATE_MASK,
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 10:50:28 +08:00
|
|
|
root_cache->useroffset, root_cache->usersize,
|
2016-11-11 02:46:41 +08:00
|
|
|
root_cache->ctor, memcg, root_cache);
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
/*
|
|
|
|
* If we could not create a memcg cache, do not complain, because
|
|
|
|
* that's not critical at all as we can always proceed with the root
|
|
|
|
* cache.
|
|
|
|
*/
|
memcg, slab: simplify synchronization scheme
At present, we have the following mutexes protecting data related to per
memcg kmem caches:
- slab_mutex. This one is held during the whole kmem cache creation
and destruction paths. We also take it when updating per root cache
memcg_caches arrays (see memcg_update_all_caches). As a result, taking
it guarantees there will be no changes to any kmem cache (including per
memcg). Why do we need something else then? The point is it is
private to slab implementation and has some internal dependencies with
other mutexes (get_online_cpus). So we just don't want to rely upon it
and prefer to introduce additional mutexes instead.
- activate_kmem_mutex. Initially it was added to synchronize
initializing kmem limit (memcg_activate_kmem). However, since we can
grow per root cache memcg_caches arrays only on kmem limit
initialization (see memcg_update_all_caches), we also employ it to
protect against memcg_caches arrays relocation (e.g. see
__kmem_cache_destroy_memcg_children).
- We have a convention not to take slab_mutex in memcontrol.c, but we
want to walk over per memcg memcg_slab_caches lists there (e.g. for
destroying all memcg caches on offline). So we have per memcg
slab_caches_mutex's protecting those lists.
The mutexes are taken in the following order:
activate_kmem_mutex -> slab_mutex -> memcg::slab_caches_mutex
Such a syncrhonization scheme has a number of flaws, for instance:
- We can't call kmem_cache_{destroy,shrink} while walking over a
memcg::memcg_slab_caches list due to locking order. As a result, in
mem_cgroup_destroy_all_caches we schedule the
memcg_cache_params::destroy work shrinking and destroying the cache.
- We don't have a mutex to synchronize per memcg caches destruction
between memcg offline (mem_cgroup_destroy_all_caches) and root cache
destruction (__kmem_cache_destroy_memcg_children). Currently we just
don't bother about it.
This patch simplifies it by substituting per memcg slab_caches_mutex's
with the global memcg_slab_mutex. It will be held whenever a new per
memcg cache is created or destroyed, so it protects per root cache
memcg_caches arrays and per memcg memcg_slab_caches lists. The locking
order is following:
activate_kmem_mutex -> memcg_slab_mutex -> slab_mutex
This allows us to call kmem_cache_{create,shrink,destroy} under the
memcg_slab_mutex. As a result, we don't need memcg_cache_params::destroy
work any more - we can simply destroy caches while iterating over a per
memcg slab caches list.
Also using the global mutex simplifies synchronization between concurrent
per memcg caches creation/destruction, e.g. mem_cgroup_destroy_all_caches
vs __kmem_cache_destroy_memcg_children.
The downside of this is that we substitute per-memcg slab_caches_mutex's
with a hummer-like global mutex, but since we already take either the
slab_mutex or the cgroup_mutex along with a memcg::slab_caches_mutex, it
shouldn't hurt concurrency a lot.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:40 +08:00
|
|
|
if (IS_ERR(s)) {
|
2014-04-08 06:39:26 +08:00
|
|
|
kfree(cache_name);
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
goto out_unlock;
|
memcg, slab: simplify synchronization scheme
At present, we have the following mutexes protecting data related to per
memcg kmem caches:
- slab_mutex. This one is held during the whole kmem cache creation
and destruction paths. We also take it when updating per root cache
memcg_caches arrays (see memcg_update_all_caches). As a result, taking
it guarantees there will be no changes to any kmem cache (including per
memcg). Why do we need something else then? The point is it is
private to slab implementation and has some internal dependencies with
other mutexes (get_online_cpus). So we just don't want to rely upon it
and prefer to introduce additional mutexes instead.
- activate_kmem_mutex. Initially it was added to synchronize
initializing kmem limit (memcg_activate_kmem). However, since we can
grow per root cache memcg_caches arrays only on kmem limit
initialization (see memcg_update_all_caches), we also employ it to
protect against memcg_caches arrays relocation (e.g. see
__kmem_cache_destroy_memcg_children).
- We have a convention not to take slab_mutex in memcontrol.c, but we
want to walk over per memcg memcg_slab_caches lists there (e.g. for
destroying all memcg caches on offline). So we have per memcg
slab_caches_mutex's protecting those lists.
The mutexes are taken in the following order:
activate_kmem_mutex -> slab_mutex -> memcg::slab_caches_mutex
Such a syncrhonization scheme has a number of flaws, for instance:
- We can't call kmem_cache_{destroy,shrink} while walking over a
memcg::memcg_slab_caches list due to locking order. As a result, in
mem_cgroup_destroy_all_caches we schedule the
memcg_cache_params::destroy work shrinking and destroying the cache.
- We don't have a mutex to synchronize per memcg caches destruction
between memcg offline (mem_cgroup_destroy_all_caches) and root cache
destruction (__kmem_cache_destroy_memcg_children). Currently we just
don't bother about it.
This patch simplifies it by substituting per memcg slab_caches_mutex's
with the global memcg_slab_mutex. It will be held whenever a new per
memcg cache is created or destroyed, so it protects per root cache
memcg_caches arrays and per memcg memcg_slab_caches lists. The locking
order is following:
activate_kmem_mutex -> memcg_slab_mutex -> slab_mutex
This allows us to call kmem_cache_{create,shrink,destroy} under the
memcg_slab_mutex. As a result, we don't need memcg_cache_params::destroy
work any more - we can simply destroy caches while iterating over a per
memcg slab caches list.
Also using the global mutex simplifies synchronization between concurrent
per memcg caches creation/destruction, e.g. mem_cgroup_destroy_all_caches
vs __kmem_cache_destroy_memcg_children.
The downside of this is that we substitute per-memcg slab_caches_mutex's
with a hummer-like global mutex, but since we already take either the
slab_mutex or the cgroup_mutex along with a memcg::slab_caches_mutex, it
shouldn't hurt concurrency a lot.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:40 +08:00
|
|
|
}
|
2014-04-08 06:39:26 +08:00
|
|
|
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
/*
|
|
|
|
* Since readers won't lock (see cache_from_memcg_idx()), we need a
|
|
|
|
* barrier here to ensure nobody will see the kmem_cache partially
|
|
|
|
* initialized.
|
|
|
|
*/
|
|
|
|
smp_wmb();
|
2015-02-13 06:59:20 +08:00
|
|
|
arr->entries[idx] = s;
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
|
2014-04-08 06:39:26 +08:00
|
|
|
out_unlock:
|
|
|
|
mutex_unlock(&slab_mutex);
|
slab: get_online_mems for kmem_cache_{create,destroy,shrink}
When we create a sl[au]b cache, we allocate kmem_cache_node structures
for each online NUMA node. To handle nodes taken online/offline, we
register memory hotplug notifier and allocate/free kmem_cache_node
corresponding to the node that changes its state for each kmem cache.
To synchronize between the two paths we hold the slab_mutex during both
the cache creationg/destruction path and while tuning per-node parts of
kmem caches in memory hotplug handler, but that's not quite right,
because it does not guarantee that a newly created cache will have all
kmem_cache_nodes initialized in case it races with memory hotplug. For
instance, in case of slub:
CPU0 CPU1
---- ----
kmem_cache_create: online_pages:
__kmem_cache_create: slab_memory_callback:
slab_mem_going_online_callback:
lock slab_mutex
for each slab_caches list entry
allocate kmem_cache node
unlock slab_mutex
lock slab_mutex
init_kmem_cache_nodes:
for_each_node_state(node, N_NORMAL_MEMORY)
allocate kmem_cache node
add kmem_cache to slab_caches list
unlock slab_mutex
online_pages (continued):
node_states_set_node
As a result we'll get a kmem cache with not all kmem_cache_nodes
allocated.
To avoid issues like that we should hold get/put_online_mems() during
the whole kmem cache creation/destruction/shrink paths, just like we
deal with cpu hotplug. This patch does the trick.
Note, that after it's applied, there is no need in taking the slab_mutex
for kmem_cache_shrink any more, so it is removed from there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:20 +08:00
|
|
|
|
|
|
|
put_online_mems();
|
2014-04-08 06:39:26 +08:00
|
|
|
put_online_cpus();
|
2012-12-19 06:22:34 +08:00
|
|
|
}
|
2014-04-08 06:39:28 +08:00
|
|
|
|
2017-02-23 07:41:30 +08:00
|
|
|
static void kmemcg_deactivate_workfn(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct kmem_cache *s = container_of(work, struct kmem_cache,
|
|
|
|
memcg_params.deact_work);
|
|
|
|
|
|
|
|
get_online_cpus();
|
|
|
|
get_online_mems();
|
|
|
|
|
|
|
|
mutex_lock(&slab_mutex);
|
|
|
|
|
|
|
|
s->memcg_params.deact_fn(s);
|
|
|
|
|
|
|
|
mutex_unlock(&slab_mutex);
|
|
|
|
|
|
|
|
put_online_mems();
|
|
|
|
put_online_cpus();
|
|
|
|
|
|
|
|
/* done, put the ref from slab_deactivate_memcg_cache_rcu_sched() */
|
|
|
|
css_put(&s->memcg_params.memcg->css);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void kmemcg_deactivate_rcufn(struct rcu_head *head)
|
|
|
|
{
|
|
|
|
struct kmem_cache *s = container_of(head, struct kmem_cache,
|
|
|
|
memcg_params.deact_rcu_head);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We need to grab blocking locks. Bounce to ->deact_work. The
|
|
|
|
* work item shares the space with the RCU head and can't be
|
|
|
|
* initialized eariler.
|
|
|
|
*/
|
|
|
|
INIT_WORK(&s->memcg_params.deact_work, kmemcg_deactivate_workfn);
|
2017-02-23 07:41:36 +08:00
|
|
|
queue_work(memcg_kmem_cache_wq, &s->memcg_params.deact_work);
|
2017-02-23 07:41:30 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* slab_deactivate_memcg_cache_rcu_sched - schedule deactivation after a
|
|
|
|
* sched RCU grace period
|
|
|
|
* @s: target kmem_cache
|
|
|
|
* @deact_fn: deactivation function to call
|
|
|
|
*
|
|
|
|
* Schedule @deact_fn to be invoked with online cpus, mems and slab_mutex
|
|
|
|
* held after a sched RCU grace period. The slab is guaranteed to stay
|
|
|
|
* alive until @deact_fn is finished. This is to be used from
|
|
|
|
* __kmemcg_cache_deactivate().
|
|
|
|
*/
|
|
|
|
void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
|
|
|
|
void (*deact_fn)(struct kmem_cache *))
|
|
|
|
{
|
|
|
|
if (WARN_ON_ONCE(is_root_cache(s)) ||
|
|
|
|
WARN_ON_ONCE(s->memcg_params.deact_fn))
|
|
|
|
return;
|
|
|
|
|
2018-06-15 06:26:27 +08:00
|
|
|
if (s->memcg_params.root_cache->memcg_params.dying)
|
|
|
|
return;
|
|
|
|
|
2017-02-23 07:41:30 +08:00
|
|
|
/* pin memcg so that @s doesn't get destroyed in the middle */
|
|
|
|
css_get(&s->memcg_params.memcg->css);
|
|
|
|
|
|
|
|
s->memcg_params.deact_fn = deact_fn;
|
|
|
|
call_rcu_sched(&s->memcg_params.deact_rcu_head, kmemcg_deactivate_rcufn);
|
|
|
|
}
|
|
|
|
|
2015-02-13 06:59:32 +08:00
|
|
|
void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
|
|
|
|
{
|
|
|
|
int idx;
|
|
|
|
struct memcg_cache_array *arr;
|
slub: make dead caches discard free slabs immediately
To speed up further allocations SLUB may store empty slabs in per cpu/node
partial lists instead of freeing them immediately. This prevents per
memcg caches destruction, because kmem caches created for a memory cgroup
are only destroyed after the last page charged to the cgroup is freed.
To fix this issue, this patch resurrects approach first proposed in [1].
It forbids SLUB to cache empty slabs after the memory cgroup that the
cache belongs to was destroyed. It is achieved by setting kmem_cache's
cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
that it would drop frozen empty slabs immediately if cpu_partial = 0.
The runtime overhead is minimal. From all the hot functions, we only
touch relatively cold put_cpu_partial(): we make it call
unfreeze_partials() after freezing a slab that belongs to an offline
memory cgroup. Since slab freezing exists to avoid moving slabs from/to a
partial list on free/alloc, and there can't be allocations from dead
caches, it shouldn't cause any overhead. We do have to disable preemption
for put_cpu_partial() to achieve that though.
The original patch was accepted well and even merged to the mm tree.
However, I decided to withdraw it due to changes happening to the memcg
core at that time. I had an idea of introducing per-memcg shrinkers for
kmem caches, but now, as memcg has finally settled down, I do not see it
as an option, because SLUB shrinker would be too costly to call since SLUB
does not keep free slabs on a separate list. Besides, we currently do not
even call per-memcg shrinkers for offline memcgs. Overall, it would
introduce much more complexity to both SLUB and memcg than this small
patch.
Regarding to SLAB, there's no problem with it, because it shrinks
per-cpu/node caches periodically. Thanks to list_lru reparenting, we no
longer keep entries for offline cgroups in per-memcg arrays (such as
memcg_cache_params->memcg_caches), so we do not have to bother if a
per-memcg cache will be shrunk a bit later than it could be.
[1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 06:59:47 +08:00
|
|
|
struct kmem_cache *s, *c;
|
2015-02-13 06:59:32 +08:00
|
|
|
|
|
|
|
idx = memcg_cache_id(memcg);
|
|
|
|
|
slub: make dead caches discard free slabs immediately
To speed up further allocations SLUB may store empty slabs in per cpu/node
partial lists instead of freeing them immediately. This prevents per
memcg caches destruction, because kmem caches created for a memory cgroup
are only destroyed after the last page charged to the cgroup is freed.
To fix this issue, this patch resurrects approach first proposed in [1].
It forbids SLUB to cache empty slabs after the memory cgroup that the
cache belongs to was destroyed. It is achieved by setting kmem_cache's
cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
that it would drop frozen empty slabs immediately if cpu_partial = 0.
The runtime overhead is minimal. From all the hot functions, we only
touch relatively cold put_cpu_partial(): we make it call
unfreeze_partials() after freezing a slab that belongs to an offline
memory cgroup. Since slab freezing exists to avoid moving slabs from/to a
partial list on free/alloc, and there can't be allocations from dead
caches, it shouldn't cause any overhead. We do have to disable preemption
for put_cpu_partial() to achieve that though.
The original patch was accepted well and even merged to the mm tree.
However, I decided to withdraw it due to changes happening to the memcg
core at that time. I had an idea of introducing per-memcg shrinkers for
kmem caches, but now, as memcg has finally settled down, I do not see it
as an option, because SLUB shrinker would be too costly to call since SLUB
does not keep free slabs on a separate list. Besides, we currently do not
even call per-memcg shrinkers for offline memcgs. Overall, it would
introduce much more complexity to both SLUB and memcg than this small
patch.
Regarding to SLAB, there's no problem with it, because it shrinks
per-cpu/node caches periodically. Thanks to list_lru reparenting, we no
longer keep entries for offline cgroups in per-memcg arrays (such as
memcg_cache_params->memcg_caches), so we do not have to bother if a
per-memcg cache will be shrunk a bit later than it could be.
[1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 06:59:47 +08:00
|
|
|
get_online_cpus();
|
|
|
|
get_online_mems();
|
|
|
|
|
2015-02-13 06:59:32 +08:00
|
|
|
mutex_lock(&slab_mutex);
|
2017-02-23 07:41:24 +08:00
|
|
|
list_for_each_entry(s, &slab_root_caches, root_caches_node) {
|
2015-02-13 06:59:32 +08:00
|
|
|
arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
|
|
|
|
lockdep_is_held(&slab_mutex));
|
slub: make dead caches discard free slabs immediately
To speed up further allocations SLUB may store empty slabs in per cpu/node
partial lists instead of freeing them immediately. This prevents per
memcg caches destruction, because kmem caches created for a memory cgroup
are only destroyed after the last page charged to the cgroup is freed.
To fix this issue, this patch resurrects approach first proposed in [1].
It forbids SLUB to cache empty slabs after the memory cgroup that the
cache belongs to was destroyed. It is achieved by setting kmem_cache's
cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
that it would drop frozen empty slabs immediately if cpu_partial = 0.
The runtime overhead is minimal. From all the hot functions, we only
touch relatively cold put_cpu_partial(): we make it call
unfreeze_partials() after freezing a slab that belongs to an offline
memory cgroup. Since slab freezing exists to avoid moving slabs from/to a
partial list on free/alloc, and there can't be allocations from dead
caches, it shouldn't cause any overhead. We do have to disable preemption
for put_cpu_partial() to achieve that though.
The original patch was accepted well and even merged to the mm tree.
However, I decided to withdraw it due to changes happening to the memcg
core at that time. I had an idea of introducing per-memcg shrinkers for
kmem caches, but now, as memcg has finally settled down, I do not see it
as an option, because SLUB shrinker would be too costly to call since SLUB
does not keep free slabs on a separate list. Besides, we currently do not
even call per-memcg shrinkers for offline memcgs. Overall, it would
introduce much more complexity to both SLUB and memcg than this small
patch.
Regarding to SLAB, there's no problem with it, because it shrinks
per-cpu/node caches periodically. Thanks to list_lru reparenting, we no
longer keep entries for offline cgroups in per-memcg arrays (such as
memcg_cache_params->memcg_caches), so we do not have to bother if a
per-memcg cache will be shrunk a bit later than it could be.
[1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 06:59:47 +08:00
|
|
|
c = arr->entries[idx];
|
|
|
|
if (!c)
|
|
|
|
continue;
|
|
|
|
|
2017-02-23 07:41:27 +08:00
|
|
|
__kmemcg_cache_deactivate(c);
|
2015-02-13 06:59:32 +08:00
|
|
|
arr->entries[idx] = NULL;
|
|
|
|
}
|
|
|
|
mutex_unlock(&slab_mutex);
|
slub: make dead caches discard free slabs immediately
To speed up further allocations SLUB may store empty slabs in per cpu/node
partial lists instead of freeing them immediately. This prevents per
memcg caches destruction, because kmem caches created for a memory cgroup
are only destroyed after the last page charged to the cgroup is freed.
To fix this issue, this patch resurrects approach first proposed in [1].
It forbids SLUB to cache empty slabs after the memory cgroup that the
cache belongs to was destroyed. It is achieved by setting kmem_cache's
cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
that it would drop frozen empty slabs immediately if cpu_partial = 0.
The runtime overhead is minimal. From all the hot functions, we only
touch relatively cold put_cpu_partial(): we make it call
unfreeze_partials() after freezing a slab that belongs to an offline
memory cgroup. Since slab freezing exists to avoid moving slabs from/to a
partial list on free/alloc, and there can't be allocations from dead
caches, it shouldn't cause any overhead. We do have to disable preemption
for put_cpu_partial() to achieve that though.
The original patch was accepted well and even merged to the mm tree.
However, I decided to withdraw it due to changes happening to the memcg
core at that time. I had an idea of introducing per-memcg shrinkers for
kmem caches, but now, as memcg has finally settled down, I do not see it
as an option, because SLUB shrinker would be too costly to call since SLUB
does not keep free slabs on a separate list. Besides, we currently do not
even call per-memcg shrinkers for offline memcgs. Overall, it would
introduce much more complexity to both SLUB and memcg than this small
patch.
Regarding to SLAB, there's no problem with it, because it shrinks
per-cpu/node caches periodically. Thanks to list_lru reparenting, we no
longer keep entries for offline cgroups in per-memcg arrays (such as
memcg_cache_params->memcg_caches), so we do not have to bother if a
per-memcg cache will be shrunk a bit later than it could be.
[1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 06:59:47 +08:00
|
|
|
|
|
|
|
put_online_mems();
|
|
|
|
put_online_cpus();
|
2015-02-13 06:59:32 +08:00
|
|
|
}
|
|
|
|
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
void memcg_destroy_kmem_caches(struct mem_cgroup *memcg)
|
2014-04-08 06:39:28 +08:00
|
|
|
{
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
struct kmem_cache *s, *s2;
|
2014-04-08 06:39:28 +08:00
|
|
|
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
get_online_cpus();
|
|
|
|
get_online_mems();
|
2014-04-08 06:39:28 +08:00
|
|
|
|
|
|
|
mutex_lock(&slab_mutex);
|
2017-02-23 07:41:21 +08:00
|
|
|
list_for_each_entry_safe(s, s2, &memcg->kmem_caches,
|
|
|
|
memcg_params.kmem_caches_node) {
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
/*
|
|
|
|
* The cgroup is about to be freed and therefore has no charges
|
|
|
|
* left. Hence, all its caches must be empty by now.
|
|
|
|
*/
|
2017-02-23 07:41:14 +08:00
|
|
|
BUG_ON(shutdown_cache(s));
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
}
|
|
|
|
mutex_unlock(&slab_mutex);
|
2014-04-08 06:39:28 +08:00
|
|
|
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
put_online_mems();
|
|
|
|
put_online_cpus();
|
2014-04-08 06:39:28 +08:00
|
|
|
}
|
2015-11-06 10:45:11 +08:00
|
|
|
|
2017-02-23 07:41:14 +08:00
|
|
|
static int shutdown_memcg_caches(struct kmem_cache *s)
|
2015-11-06 10:45:11 +08:00
|
|
|
{
|
|
|
|
struct memcg_cache_array *arr;
|
|
|
|
struct kmem_cache *c, *c2;
|
|
|
|
LIST_HEAD(busy);
|
|
|
|
int i;
|
|
|
|
|
|
|
|
BUG_ON(!is_root_cache(s));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* First, shutdown active caches, i.e. caches that belong to online
|
|
|
|
* memory cgroups.
|
|
|
|
*/
|
|
|
|
arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
|
|
|
|
lockdep_is_held(&slab_mutex));
|
|
|
|
for_each_memcg_cache_index(i) {
|
|
|
|
c = arr->entries[i];
|
|
|
|
if (!c)
|
|
|
|
continue;
|
2017-02-23 07:41:14 +08:00
|
|
|
if (shutdown_cache(c))
|
2015-11-06 10:45:11 +08:00
|
|
|
/*
|
|
|
|
* The cache still has objects. Move it to a temporary
|
|
|
|
* list so as not to try to destroy it for a second
|
|
|
|
* time while iterating over inactive caches below.
|
|
|
|
*/
|
2017-02-23 07:41:17 +08:00
|
|
|
list_move(&c->memcg_params.children_node, &busy);
|
2015-11-06 10:45:11 +08:00
|
|
|
else
|
|
|
|
/*
|
|
|
|
* The cache is empty and will be destroyed soon. Clear
|
|
|
|
* the pointer to it in the memcg_caches array so that
|
|
|
|
* it will never be accessed even if the root cache
|
|
|
|
* stays alive.
|
|
|
|
*/
|
|
|
|
arr->entries[i] = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Second, shutdown all caches left from memory cgroups that are now
|
|
|
|
* offline.
|
|
|
|
*/
|
2017-02-23 07:41:17 +08:00
|
|
|
list_for_each_entry_safe(c, c2, &s->memcg_params.children,
|
|
|
|
memcg_params.children_node)
|
2017-02-23 07:41:14 +08:00
|
|
|
shutdown_cache(c);
|
2015-11-06 10:45:11 +08:00
|
|
|
|
2017-02-23 07:41:17 +08:00
|
|
|
list_splice(&busy, &s->memcg_params.children);
|
2015-11-06 10:45:11 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* A cache being destroyed must be empty. In particular, this means
|
|
|
|
* that all per memcg caches attached to it must be empty too.
|
|
|
|
*/
|
2017-02-23 07:41:17 +08:00
|
|
|
if (!list_empty(&s->memcg_params.children))
|
2015-11-06 10:45:11 +08:00
|
|
|
return -EBUSY;
|
|
|
|
return 0;
|
|
|
|
}
|
2018-06-15 06:26:27 +08:00
|
|
|
|
|
|
|
static void flush_memcg_workqueue(struct kmem_cache *s)
|
|
|
|
{
|
|
|
|
mutex_lock(&slab_mutex);
|
|
|
|
s->memcg_params.dying = true;
|
|
|
|
mutex_unlock(&slab_mutex);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* SLUB deactivates the kmem_caches through call_rcu_sched. Make
|
|
|
|
* sure all registered rcu callbacks have been invoked.
|
|
|
|
*/
|
|
|
|
if (IS_ENABLED(CONFIG_SLUB))
|
|
|
|
rcu_barrier_sched();
|
|
|
|
|
|
|
|
/*
|
|
|
|
* SLAB and SLUB create memcg kmem_caches through workqueue and SLUB
|
|
|
|
* deactivates the memcg kmem_caches through workqueue. Make sure all
|
|
|
|
* previous workitems on workqueue are processed.
|
|
|
|
*/
|
|
|
|
flush_workqueue(memcg_kmem_cache_wq);
|
|
|
|
}
|
2015-11-06 10:45:11 +08:00
|
|
|
#else
|
2017-02-23 07:41:14 +08:00
|
|
|
static inline int shutdown_memcg_caches(struct kmem_cache *s)
|
2015-11-06 10:45:11 +08:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2018-06-15 06:26:27 +08:00
|
|
|
|
|
|
|
static inline void flush_memcg_workqueue(struct kmem_cache *s)
|
|
|
|
{
|
|
|
|
}
|
2018-08-18 06:47:25 +08:00
|
|
|
#endif /* CONFIG_MEMCG_KMEM */
|
2012-07-07 04:25:11 +08:00
|
|
|
|
2014-05-07 03:50:08 +08:00
|
|
|
void slab_kmem_cache_release(struct kmem_cache *s)
|
|
|
|
{
|
2016-02-18 05:11:37 +08:00
|
|
|
__kmem_cache_release(s);
|
2015-02-13 06:59:20 +08:00
|
|
|
destroy_memcg_params(s);
|
2015-02-14 06:36:38 +08:00
|
|
|
kfree_const(s->name);
|
2014-05-07 03:50:08 +08:00
|
|
|
kmem_cache_free(kmem_cache, s);
|
|
|
|
}
|
|
|
|
|
2012-09-05 07:18:33 +08:00
|
|
|
void kmem_cache_destroy(struct kmem_cache *s)
|
|
|
|
{
|
2015-11-06 10:45:11 +08:00
|
|
|
int err;
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
|
2015-09-09 06:00:50 +08:00
|
|
|
if (unlikely(!s))
|
|
|
|
return;
|
|
|
|
|
2018-06-15 06:26:27 +08:00
|
|
|
flush_memcg_workqueue(s);
|
|
|
|
|
2012-09-05 07:18:33 +08:00
|
|
|
get_online_cpus();
|
slab: get_online_mems for kmem_cache_{create,destroy,shrink}
When we create a sl[au]b cache, we allocate kmem_cache_node structures
for each online NUMA node. To handle nodes taken online/offline, we
register memory hotplug notifier and allocate/free kmem_cache_node
corresponding to the node that changes its state for each kmem cache.
To synchronize between the two paths we hold the slab_mutex during both
the cache creationg/destruction path and while tuning per-node parts of
kmem caches in memory hotplug handler, but that's not quite right,
because it does not guarantee that a newly created cache will have all
kmem_cache_nodes initialized in case it races with memory hotplug. For
instance, in case of slub:
CPU0 CPU1
---- ----
kmem_cache_create: online_pages:
__kmem_cache_create: slab_memory_callback:
slab_mem_going_online_callback:
lock slab_mutex
for each slab_caches list entry
allocate kmem_cache node
unlock slab_mutex
lock slab_mutex
init_kmem_cache_nodes:
for_each_node_state(node, N_NORMAL_MEMORY)
allocate kmem_cache node
add kmem_cache to slab_caches list
unlock slab_mutex
online_pages (continued):
node_states_set_node
As a result we'll get a kmem cache with not all kmem_cache_nodes
allocated.
To avoid issues like that we should hold get/put_online_mems() during
the whole kmem cache creation/destruction/shrink paths, just like we
deal with cpu hotplug. This patch does the trick.
Note, that after it's applied, there is no need in taking the slab_mutex
for kmem_cache_shrink any more, so it is removed from there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:20 +08:00
|
|
|
get_online_mems();
|
|
|
|
|
2012-09-05 07:18:33 +08:00
|
|
|
mutex_lock(&slab_mutex);
|
2014-04-08 06:39:28 +08:00
|
|
|
|
2012-09-05 07:18:33 +08:00
|
|
|
s->refcount--;
|
2014-04-08 06:39:28 +08:00
|
|
|
if (s->refcount)
|
|
|
|
goto out_unlock;
|
|
|
|
|
2017-02-23 07:41:14 +08:00
|
|
|
err = shutdown_memcg_caches(s);
|
2015-11-06 10:45:11 +08:00
|
|
|
if (!err)
|
2017-02-23 07:41:14 +08:00
|
|
|
err = shutdown_cache(s);
|
2014-04-08 06:39:28 +08:00
|
|
|
|
2015-11-06 10:45:14 +08:00
|
|
|
if (err) {
|
2016-03-18 05:19:47 +08:00
|
|
|
pr_err("kmem_cache_destroy %s: Slab cache still has objects\n",
|
|
|
|
s->name);
|
2015-11-06 10:45:14 +08:00
|
|
|
dump_stack();
|
|
|
|
}
|
2014-04-08 06:39:28 +08:00
|
|
|
out_unlock:
|
|
|
|
mutex_unlock(&slab_mutex);
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 06:11:47 +08:00
|
|
|
|
slab: get_online_mems for kmem_cache_{create,destroy,shrink}
When we create a sl[au]b cache, we allocate kmem_cache_node structures
for each online NUMA node. To handle nodes taken online/offline, we
register memory hotplug notifier and allocate/free kmem_cache_node
corresponding to the node that changes its state for each kmem cache.
To synchronize between the two paths we hold the slab_mutex during both
the cache creationg/destruction path and while tuning per-node parts of
kmem caches in memory hotplug handler, but that's not quite right,
because it does not guarantee that a newly created cache will have all
kmem_cache_nodes initialized in case it races with memory hotplug. For
instance, in case of slub:
CPU0 CPU1
---- ----
kmem_cache_create: online_pages:
__kmem_cache_create: slab_memory_callback:
slab_mem_going_online_callback:
lock slab_mutex
for each slab_caches list entry
allocate kmem_cache node
unlock slab_mutex
lock slab_mutex
init_kmem_cache_nodes:
for_each_node_state(node, N_NORMAL_MEMORY)
allocate kmem_cache node
add kmem_cache to slab_caches list
unlock slab_mutex
online_pages (continued):
node_states_set_node
As a result we'll get a kmem cache with not all kmem_cache_nodes
allocated.
To avoid issues like that we should hold get/put_online_mems() during
the whole kmem cache creation/destruction/shrink paths, just like we
deal with cpu hotplug. This patch does the trick.
Note, that after it's applied, there is no need in taking the slab_mutex
for kmem_cache_shrink any more, so it is removed from there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:20 +08:00
|
|
|
put_online_mems();
|
2012-09-05 07:18:33 +08:00
|
|
|
put_online_cpus();
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kmem_cache_destroy);
|
|
|
|
|
slab: get_online_mems for kmem_cache_{create,destroy,shrink}
When we create a sl[au]b cache, we allocate kmem_cache_node structures
for each online NUMA node. To handle nodes taken online/offline, we
register memory hotplug notifier and allocate/free kmem_cache_node
corresponding to the node that changes its state for each kmem cache.
To synchronize between the two paths we hold the slab_mutex during both
the cache creationg/destruction path and while tuning per-node parts of
kmem caches in memory hotplug handler, but that's not quite right,
because it does not guarantee that a newly created cache will have all
kmem_cache_nodes initialized in case it races with memory hotplug. For
instance, in case of slub:
CPU0 CPU1
---- ----
kmem_cache_create: online_pages:
__kmem_cache_create: slab_memory_callback:
slab_mem_going_online_callback:
lock slab_mutex
for each slab_caches list entry
allocate kmem_cache node
unlock slab_mutex
lock slab_mutex
init_kmem_cache_nodes:
for_each_node_state(node, N_NORMAL_MEMORY)
allocate kmem_cache node
add kmem_cache to slab_caches list
unlock slab_mutex
online_pages (continued):
node_states_set_node
As a result we'll get a kmem cache with not all kmem_cache_nodes
allocated.
To avoid issues like that we should hold get/put_online_mems() during
the whole kmem cache creation/destruction/shrink paths, just like we
deal with cpu hotplug. This patch does the trick.
Note, that after it's applied, there is no need in taking the slab_mutex
for kmem_cache_shrink any more, so it is removed from there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:20 +08:00
|
|
|
/**
|
|
|
|
* kmem_cache_shrink - Shrink a cache.
|
|
|
|
* @cachep: The cache to shrink.
|
|
|
|
*
|
|
|
|
* Releases as many slabs as possible for a cache.
|
|
|
|
* To help debugging, a zero exit status indicates all slabs were released.
|
|
|
|
*/
|
|
|
|
int kmem_cache_shrink(struct kmem_cache *cachep)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
get_online_cpus();
|
|
|
|
get_online_mems();
|
mm: kasan: initial memory quarantine implementation
Quarantine isolates freed objects in a separate queue. The objects are
returned to the allocator later, which helps to detect use-after-free
errors.
When the object is freed, its state changes from KASAN_STATE_ALLOC to
KASAN_STATE_QUARANTINE. The object is poisoned and put into quarantine
instead of being returned to the allocator, therefore every subsequent
access to that object triggers a KASAN error, and the error handler is
able to say where the object has been allocated and deallocated.
When it's time for the object to leave quarantine, its state becomes
KASAN_STATE_FREE and it's returned to the allocator. From now on the
allocator may reuse it for another allocation. Before that happens,
it's still possible to detect a use-after free on that object (it
retains the allocation/deallocation stacks).
When the allocator reuses this object, the shadow is unpoisoned and old
allocation/deallocation stacks are wiped. Therefore a use of this
object, even an incorrect one, won't trigger ASan warning.
Without the quarantine, it's not guaranteed that the objects aren't
reused immediately, that's why the probability of catching a
use-after-free is lower than with quarantine in place.
Quarantine isolates freed objects in a separate queue. The objects are
returned to the allocator later, which helps to detect use-after-free
errors.
Freed objects are first added to per-cpu quarantine queues. When a
cache is destroyed or memory shrinking is requested, the objects are
moved into the global quarantine queue. Whenever a kmalloc call allows
memory reclaiming, the oldest objects are popped out of the global queue
until the total size of objects in quarantine is less than 3/4 of the
maximum quarantine size (which is a fraction of installed physical
memory).
As long as an object remains in the quarantine, KASAN is able to report
accesses to it, so the chance of reporting a use-after-free is
increased. Once the object leaves quarantine, the allocator may reuse
it, in which case the object is unpoisoned and KASAN can't detect
incorrect accesses to it.
Right now quarantine support is only enabled in SLAB allocator.
Unification of KASAN features in SLAB and SLUB will be done later.
This patch is based on the "mm: kasan: quarantine" patch originally
prepared by Dmitry Chernenkov. A number of improvements have been
suggested by Andrey Ryabinin.
[glider@google.com: v9]
Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-21 07:59:11 +08:00
|
|
|
kasan_cache_shrink(cachep);
|
2017-02-23 07:41:27 +08:00
|
|
|
ret = __kmem_cache_shrink(cachep);
|
slab: get_online_mems for kmem_cache_{create,destroy,shrink}
When we create a sl[au]b cache, we allocate kmem_cache_node structures
for each online NUMA node. To handle nodes taken online/offline, we
register memory hotplug notifier and allocate/free kmem_cache_node
corresponding to the node that changes its state for each kmem cache.
To synchronize between the two paths we hold the slab_mutex during both
the cache creationg/destruction path and while tuning per-node parts of
kmem caches in memory hotplug handler, but that's not quite right,
because it does not guarantee that a newly created cache will have all
kmem_cache_nodes initialized in case it races with memory hotplug. For
instance, in case of slub:
CPU0 CPU1
---- ----
kmem_cache_create: online_pages:
__kmem_cache_create: slab_memory_callback:
slab_mem_going_online_callback:
lock slab_mutex
for each slab_caches list entry
allocate kmem_cache node
unlock slab_mutex
lock slab_mutex
init_kmem_cache_nodes:
for_each_node_state(node, N_NORMAL_MEMORY)
allocate kmem_cache node
add kmem_cache to slab_caches list
unlock slab_mutex
online_pages (continued):
node_states_set_node
As a result we'll get a kmem cache with not all kmem_cache_nodes
allocated.
To avoid issues like that we should hold get/put_online_mems() during
the whole kmem cache creation/destruction/shrink paths, just like we
deal with cpu hotplug. This patch does the trick.
Note, that after it's applied, there is no need in taking the slab_mutex
for kmem_cache_shrink any more, so it is removed from there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:20 +08:00
|
|
|
put_online_mems();
|
|
|
|
put_online_cpus();
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kmem_cache_shrink);
|
|
|
|
|
2015-11-06 10:44:59 +08:00
|
|
|
bool slab_is_available(void)
|
2012-07-07 04:25:11 +08:00
|
|
|
{
|
|
|
|
return slab_state >= UP;
|
|
|
|
}
|
2012-10-19 22:20:25 +08:00
|
|
|
|
2012-11-29 00:23:07 +08:00
|
|
|
#ifndef CONFIG_SLOB
|
|
|
|
/* Create a cache during boot when no slab services are available yet */
|
2018-04-06 07:20:33 +08:00
|
|
|
void __init create_boot_cache(struct kmem_cache *s, const char *name,
|
|
|
|
unsigned int size, slab_flags_t flags,
|
|
|
|
unsigned int useroffset, unsigned int usersize)
|
2012-11-29 00:23:07 +08:00
|
|
|
{
|
|
|
|
int err;
|
|
|
|
|
|
|
|
s->name = name;
|
|
|
|
s->size = s->object_size = size;
|
2012-11-29 00:23:16 +08:00
|
|
|
s->align = calculate_alignment(flags, ARCH_KMALLOC_MINALIGN, size);
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 10:50:28 +08:00
|
|
|
s->useroffset = useroffset;
|
|
|
|
s->usersize = usersize;
|
2015-02-13 06:59:20 +08:00
|
|
|
|
|
|
|
slab_init_memcg_params(s);
|
|
|
|
|
2012-11-29 00:23:07 +08:00
|
|
|
err = __kmem_cache_create(s, flags);
|
|
|
|
|
|
|
|
if (err)
|
2018-04-06 07:20:33 +08:00
|
|
|
panic("Creation of kmalloc slab %s size=%u failed. Reason %d\n",
|
2012-11-29 00:23:07 +08:00
|
|
|
name, size, err);
|
|
|
|
|
|
|
|
s->refcount = -1; /* Exempt from merging for now */
|
|
|
|
}
|
|
|
|
|
2018-04-06 07:20:29 +08:00
|
|
|
struct kmem_cache *__init create_kmalloc_cache(const char *name,
|
|
|
|
unsigned int size, slab_flags_t flags,
|
|
|
|
unsigned int useroffset, unsigned int usersize)
|
2012-11-29 00:23:07 +08:00
|
|
|
{
|
|
|
|
struct kmem_cache *s = kmem_cache_zalloc(kmem_cache, GFP_NOWAIT);
|
|
|
|
|
|
|
|
if (!s)
|
|
|
|
panic("Out of memory when creating slab %s\n", name);
|
|
|
|
|
2017-06-11 10:50:47 +08:00
|
|
|
create_boot_cache(s, name, size, flags, useroffset, usersize);
|
2012-11-29 00:23:07 +08:00
|
|
|
list_add(&s->list, &slab_caches);
|
2017-02-23 07:41:24 +08:00
|
|
|
memcg_link_cache(s);
|
2012-11-29 00:23:07 +08:00
|
|
|
s->refcount = 1;
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
2018-04-06 07:20:11 +08:00
|
|
|
struct kmem_cache *kmalloc_caches[KMALLOC_SHIFT_HIGH + 1] __ro_after_init;
|
2013-01-11 03:12:17 +08:00
|
|
|
EXPORT_SYMBOL(kmalloc_caches);
|
|
|
|
|
|
|
|
#ifdef CONFIG_ZONE_DMA
|
2018-04-06 07:20:11 +08:00
|
|
|
struct kmem_cache *kmalloc_dma_caches[KMALLOC_SHIFT_HIGH + 1] __ro_after_init;
|
2013-01-11 03:12:17 +08:00
|
|
|
EXPORT_SYMBOL(kmalloc_dma_caches);
|
|
|
|
#endif
|
|
|
|
|
2013-01-11 03:14:19 +08:00
|
|
|
/*
|
|
|
|
* Conversion table for small slabs sizes / 8 to the index in the
|
|
|
|
* kmalloc array. This is necessary for slabs < 192 since we have non power
|
|
|
|
* of two cache sizes there. The size of larger slabs can be determined using
|
|
|
|
* fls.
|
|
|
|
*/
|
2018-04-06 07:20:40 +08:00
|
|
|
static u8 size_index[24] __ro_after_init = {
|
2013-01-11 03:14:19 +08:00
|
|
|
3, /* 8 */
|
|
|
|
4, /* 16 */
|
|
|
|
5, /* 24 */
|
|
|
|
5, /* 32 */
|
|
|
|
6, /* 40 */
|
|
|
|
6, /* 48 */
|
|
|
|
6, /* 56 */
|
|
|
|
6, /* 64 */
|
|
|
|
1, /* 72 */
|
|
|
|
1, /* 80 */
|
|
|
|
1, /* 88 */
|
|
|
|
1, /* 96 */
|
|
|
|
7, /* 104 */
|
|
|
|
7, /* 112 */
|
|
|
|
7, /* 120 */
|
|
|
|
7, /* 128 */
|
|
|
|
2, /* 136 */
|
|
|
|
2, /* 144 */
|
|
|
|
2, /* 152 */
|
|
|
|
2, /* 160 */
|
|
|
|
2, /* 168 */
|
|
|
|
2, /* 176 */
|
|
|
|
2, /* 184 */
|
|
|
|
2 /* 192 */
|
|
|
|
};
|
|
|
|
|
2018-04-06 07:20:44 +08:00
|
|
|
static inline unsigned int size_index_elem(unsigned int bytes)
|
2013-01-11 03:14:19 +08:00
|
|
|
{
|
|
|
|
return (bytes - 1) / 8;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find the kmem_cache structure that serves a given size of
|
|
|
|
* allocation
|
|
|
|
*/
|
|
|
|
struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
|
|
|
|
{
|
2018-04-06 07:20:40 +08:00
|
|
|
unsigned int index;
|
2013-01-11 03:14:19 +08:00
|
|
|
|
2013-08-02 10:02:42 +08:00
|
|
|
if (unlikely(size > KMALLOC_MAX_SIZE)) {
|
2013-06-11 03:18:00 +08:00
|
|
|
WARN_ON_ONCE(!(flags & __GFP_NOWARN));
|
2013-05-03 23:43:18 +08:00
|
|
|
return NULL;
|
2013-06-11 03:18:00 +08:00
|
|
|
}
|
2013-05-03 23:43:18 +08:00
|
|
|
|
2013-01-11 03:14:19 +08:00
|
|
|
if (size <= 192) {
|
|
|
|
if (!size)
|
|
|
|
return ZERO_SIZE_PTR;
|
|
|
|
|
|
|
|
index = size_index[size_index_elem(size)];
|
|
|
|
} else
|
|
|
|
index = fls(size - 1);
|
|
|
|
|
|
|
|
#ifdef CONFIG_ZONE_DMA
|
2013-02-04 22:46:46 +08:00
|
|
|
if (unlikely((flags & GFP_DMA)))
|
2013-01-11 03:14:19 +08:00
|
|
|
return kmalloc_dma_caches[index];
|
|
|
|
|
|
|
|
#endif
|
|
|
|
return kmalloc_caches[index];
|
|
|
|
}
|
|
|
|
|
2015-06-25 07:55:54 +08:00
|
|
|
/*
|
|
|
|
* kmalloc_info[] is to make slub_debug=,kmalloc-xx option work at boot time.
|
|
|
|
* kmalloc_index() supports up to 2^26=64MB, so the final entry of the table is
|
|
|
|
* kmalloc-67108864.
|
|
|
|
*/
|
2017-02-23 07:41:05 +08:00
|
|
|
const struct kmalloc_info_struct kmalloc_info[] __initconst = {
|
2015-06-25 07:55:54 +08:00
|
|
|
{NULL, 0}, {"kmalloc-96", 96},
|
|
|
|
{"kmalloc-192", 192}, {"kmalloc-8", 8},
|
|
|
|
{"kmalloc-16", 16}, {"kmalloc-32", 32},
|
|
|
|
{"kmalloc-64", 64}, {"kmalloc-128", 128},
|
|
|
|
{"kmalloc-256", 256}, {"kmalloc-512", 512},
|
|
|
|
{"kmalloc-1024", 1024}, {"kmalloc-2048", 2048},
|
|
|
|
{"kmalloc-4096", 4096}, {"kmalloc-8192", 8192},
|
|
|
|
{"kmalloc-16384", 16384}, {"kmalloc-32768", 32768},
|
|
|
|
{"kmalloc-65536", 65536}, {"kmalloc-131072", 131072},
|
|
|
|
{"kmalloc-262144", 262144}, {"kmalloc-524288", 524288},
|
|
|
|
{"kmalloc-1048576", 1048576}, {"kmalloc-2097152", 2097152},
|
|
|
|
{"kmalloc-4194304", 4194304}, {"kmalloc-8388608", 8388608},
|
|
|
|
{"kmalloc-16777216", 16777216}, {"kmalloc-33554432", 33554432},
|
|
|
|
{"kmalloc-67108864", 67108864}
|
|
|
|
};
|
|
|
|
|
2013-01-11 03:12:17 +08:00
|
|
|
/*
|
2015-06-25 07:55:57 +08:00
|
|
|
* Patch up the size_index table if we have strange large alignment
|
|
|
|
* requirements for the kmalloc array. This is only the case for
|
|
|
|
* MIPS it seems. The standard arches will not generate any code here.
|
|
|
|
*
|
|
|
|
* Largest permitted alignment is 256 bytes due to the way we
|
|
|
|
* handle the index determination for the smaller caches.
|
|
|
|
*
|
|
|
|
* Make sure that nothing crazy happens if someone starts tinkering
|
|
|
|
* around with ARCH_KMALLOC_MINALIGN
|
2013-01-11 03:12:17 +08:00
|
|
|
*/
|
2015-06-25 07:55:57 +08:00
|
|
|
void __init setup_kmalloc_cache_index_table(void)
|
2013-01-11 03:12:17 +08:00
|
|
|
{
|
2018-04-06 07:20:44 +08:00
|
|
|
unsigned int i;
|
2013-01-11 03:12:17 +08:00
|
|
|
|
2013-01-11 03:14:19 +08:00
|
|
|
BUILD_BUG_ON(KMALLOC_MIN_SIZE > 256 ||
|
|
|
|
(KMALLOC_MIN_SIZE & (KMALLOC_MIN_SIZE - 1)));
|
|
|
|
|
|
|
|
for (i = 8; i < KMALLOC_MIN_SIZE; i += 8) {
|
2018-04-06 07:20:44 +08:00
|
|
|
unsigned int elem = size_index_elem(i);
|
2013-01-11 03:14:19 +08:00
|
|
|
|
|
|
|
if (elem >= ARRAY_SIZE(size_index))
|
|
|
|
break;
|
|
|
|
size_index[elem] = KMALLOC_SHIFT_LOW;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (KMALLOC_MIN_SIZE >= 64) {
|
|
|
|
/*
|
|
|
|
* The 96 byte size cache is not used if the alignment
|
|
|
|
* is 64 byte.
|
|
|
|
*/
|
|
|
|
for (i = 64 + 8; i <= 96; i += 8)
|
|
|
|
size_index[size_index_elem(i)] = 7;
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
if (KMALLOC_MIN_SIZE >= 128) {
|
|
|
|
/*
|
|
|
|
* The 192 byte sized cache is not used if the alignment
|
|
|
|
* is 128 byte. Redirect kmalloc to use the 256 byte cache
|
|
|
|
* instead.
|
|
|
|
*/
|
|
|
|
for (i = 128 + 8; i <= 192; i += 8)
|
|
|
|
size_index[size_index_elem(i)] = 8;
|
|
|
|
}
|
2015-06-25 07:55:57 +08:00
|
|
|
}
|
|
|
|
|
2017-11-16 09:32:18 +08:00
|
|
|
static void __init new_kmalloc_cache(int idx, slab_flags_t flags)
|
2015-06-29 22:28:08 +08:00
|
|
|
{
|
|
|
|
kmalloc_caches[idx] = create_kmalloc_cache(kmalloc_info[idx].name,
|
2017-06-11 10:50:47 +08:00
|
|
|
kmalloc_info[idx].size, flags, 0,
|
|
|
|
kmalloc_info[idx].size);
|
2015-06-29 22:28:08 +08:00
|
|
|
}
|
|
|
|
|
2015-06-25 07:55:57 +08:00
|
|
|
/*
|
|
|
|
* Create the kmalloc array. Some of the regular kmalloc arrays
|
|
|
|
* may already have been created because they were needed to
|
|
|
|
* enable allocations for slab creation.
|
|
|
|
*/
|
2017-11-16 09:32:18 +08:00
|
|
|
void __init create_kmalloc_caches(slab_flags_t flags)
|
2015-06-25 07:55:57 +08:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2015-06-29 22:28:08 +08:00
|
|
|
for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {
|
|
|
|
if (!kmalloc_caches[i])
|
|
|
|
new_kmalloc_cache(i, flags);
|
2013-01-11 03:12:17 +08:00
|
|
|
|
2013-05-09 03:56:28 +08:00
|
|
|
/*
|
2015-06-29 22:28:08 +08:00
|
|
|
* Caches that are not of the two-to-the-power-of size.
|
|
|
|
* These have to be created immediately after the
|
|
|
|
* earlier power of two caches
|
2013-05-09 03:56:28 +08:00
|
|
|
*/
|
2015-06-29 22:28:08 +08:00
|
|
|
if (KMALLOC_MIN_SIZE <= 32 && !kmalloc_caches[1] && i == 6)
|
|
|
|
new_kmalloc_cache(1, flags);
|
|
|
|
if (KMALLOC_MIN_SIZE <= 64 && !kmalloc_caches[2] && i == 7)
|
|
|
|
new_kmalloc_cache(2, flags);
|
2013-05-04 02:04:18 +08:00
|
|
|
}
|
|
|
|
|
2013-01-11 03:12:17 +08:00
|
|
|
/* Kmalloc array is now usable */
|
|
|
|
slab_state = UP;
|
|
|
|
|
|
|
|
#ifdef CONFIG_ZONE_DMA
|
|
|
|
for (i = 0; i <= KMALLOC_SHIFT_HIGH; i++) {
|
|
|
|
struct kmem_cache *s = kmalloc_caches[i];
|
|
|
|
|
|
|
|
if (s) {
|
2018-04-06 07:20:26 +08:00
|
|
|
unsigned int size = kmalloc_size(i);
|
2013-01-11 03:12:17 +08:00
|
|
|
char *n = kasprintf(GFP_NOWAIT,
|
2018-04-06 07:20:26 +08:00
|
|
|
"dma-kmalloc-%u", size);
|
2013-01-11 03:12:17 +08:00
|
|
|
|
|
|
|
BUG_ON(!n);
|
|
|
|
kmalloc_dma_caches[i] = create_kmalloc_cache(n,
|
2017-06-11 10:50:47 +08:00
|
|
|
size, SLAB_CACHE_DMA | flags, 0, 0);
|
2013-01-11 03:12:17 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
}
|
2012-11-29 00:23:07 +08:00
|
|
|
#endif /* !CONFIG_SLOB */
|
|
|
|
|
2014-06-05 07:07:04 +08:00
|
|
|
/*
|
|
|
|
* To avoid unnecessary overhead, we pass through large allocation requests
|
|
|
|
* directly to the page allocator. We use __GFP_COMP, because we will need to
|
|
|
|
* know the allocation order to free the pages properly in kfree.
|
|
|
|
*/
|
2014-06-05 07:06:39 +08:00
|
|
|
void *kmalloc_order(size_t size, gfp_t flags, unsigned int order)
|
|
|
|
{
|
|
|
|
void *ret;
|
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
flags |= __GFP_COMP;
|
mm: charge/uncharge kmemcg from generic page allocator paths
Currently, to charge a non-slab allocation to kmemcg one has to use
alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with
this helper should finally be freed using free_kmem_pages, otherwise it
won't be uncharged.
This API suits its current users fine, but it turns out to be impossible
to use along with page reference counting, i.e. when an allocation is
supposed to be freed with put_page, as it is the case with pipe or unix
socket buffers.
To overcome this limitation, this patch moves charging/uncharging to
generic page allocator paths, i.e. to __alloc_pages_nodemask and
free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way,
one can use any of the available page allocation functions to get the
allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
just like in case of kmalloc and friends. A charged page will be
automatically uncharged on free.
To make it possible, we need to mark pages charged to kmemcg somehow.
To avoid introducing a new page flag, we make use of page->_mapcount for
marking such pages. Since pages charged to kmemcg are not supposed to
be mapped to userspace, it should work just fine. There are other
(ab)users of page->_mapcount - buddy and balloon pages - but we don't
conflict with them.
In case kmemcg is compiled out or not used at runtime, this patch
introduces no overhead to generic page allocator paths. If kmemcg is
used, it will be plus one gfp flags check on alloc and plus one
page->_mapcount check on free, which shouldn't hurt performance, because
the data accessed are hot.
Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-27 06:24:24 +08:00
|
|
|
page = alloc_pages(flags, order);
|
2014-06-05 07:06:39 +08:00
|
|
|
ret = page ? page_address(page) : NULL;
|
|
|
|
kmemleak_alloc(ret, size, 1, flags);
|
2016-03-26 05:22:02 +08:00
|
|
|
kasan_kmalloc_large(ret, size, flags);
|
2014-06-05 07:06:39 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kmalloc_order);
|
|
|
|
|
2013-09-05 00:35:34 +08:00
|
|
|
#ifdef CONFIG_TRACING
|
|
|
|
void *kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order)
|
|
|
|
{
|
|
|
|
void *ret = kmalloc_order(size, flags, order);
|
|
|
|
trace_kmalloc(_RET_IP_, ret, size, PAGE_SIZE << order, flags);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kmalloc_order_trace);
|
|
|
|
#endif
|
2012-11-29 00:23:07 +08:00
|
|
|
|
2016-07-27 06:21:56 +08:00
|
|
|
#ifdef CONFIG_SLAB_FREELIST_RANDOM
|
|
|
|
/* Randomize a generic freelist */
|
|
|
|
static void freelist_randomize(struct rnd_state *state, unsigned int *list,
|
2018-04-06 07:21:46 +08:00
|
|
|
unsigned int count)
|
2016-07-27 06:21:56 +08:00
|
|
|
{
|
|
|
|
unsigned int rand;
|
2018-04-06 07:21:46 +08:00
|
|
|
unsigned int i;
|
2016-07-27 06:21:56 +08:00
|
|
|
|
|
|
|
for (i = 0; i < count; i++)
|
|
|
|
list[i] = i;
|
|
|
|
|
|
|
|
/* Fisher-Yates shuffle */
|
|
|
|
for (i = count - 1; i > 0; i--) {
|
|
|
|
rand = prandom_u32_state(state);
|
|
|
|
rand %= (i + 1);
|
|
|
|
swap(list[i], list[rand]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Create a random sequence per cache */
|
|
|
|
int cache_random_seq_create(struct kmem_cache *cachep, unsigned int count,
|
|
|
|
gfp_t gfp)
|
|
|
|
{
|
|
|
|
struct rnd_state state;
|
|
|
|
|
|
|
|
if (count < 2 || cachep->random_seq)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
cachep->random_seq = kcalloc(count, sizeof(unsigned int), gfp);
|
|
|
|
if (!cachep->random_seq)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
/* Get best entropy at this stage of boot */
|
|
|
|
prandom_seed_state(&state, get_random_long());
|
|
|
|
|
|
|
|
freelist_randomize(&state, cachep->random_seq, count);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Destroy the per-cache random freelist sequence */
|
|
|
|
void cache_random_seq_destroy(struct kmem_cache *cachep)
|
|
|
|
{
|
|
|
|
kfree(cachep->random_seq);
|
|
|
|
cachep->random_seq = NULL;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_SLAB_FREELIST_RANDOM */
|
|
|
|
|
2017-11-16 09:32:03 +08:00
|
|
|
#if defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG)
|
2013-07-04 08:33:24 +08:00
|
|
|
#ifdef CONFIG_SLAB
|
2018-06-15 06:27:58 +08:00
|
|
|
#define SLABINFO_RIGHTS (0600)
|
2013-07-04 08:33:24 +08:00
|
|
|
#else
|
2018-06-15 06:27:58 +08:00
|
|
|
#define SLABINFO_RIGHTS (0400)
|
2013-07-04 08:33:24 +08:00
|
|
|
#endif
|
|
|
|
|
2014-12-11 07:44:19 +08:00
|
|
|
static void print_slabinfo_header(struct seq_file *m)
|
2012-10-19 22:20:26 +08:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Output format version, so at least we can change it
|
|
|
|
* without _too_ many complaints.
|
|
|
|
*/
|
|
|
|
#ifdef CONFIG_DEBUG_SLAB
|
|
|
|
seq_puts(m, "slabinfo - version: 2.1 (statistics)\n");
|
|
|
|
#else
|
|
|
|
seq_puts(m, "slabinfo - version: 2.1\n");
|
|
|
|
#endif
|
2016-03-18 05:19:47 +08:00
|
|
|
seq_puts(m, "# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>");
|
2012-10-19 22:20:26 +08:00
|
|
|
seq_puts(m, " : tunables <limit> <batchcount> <sharedfactor>");
|
|
|
|
seq_puts(m, " : slabdata <active_slabs> <num_slabs> <sharedavail>");
|
|
|
|
#ifdef CONFIG_DEBUG_SLAB
|
2016-03-18 05:19:47 +08:00
|
|
|
seq_puts(m, " : globalstat <listallocs> <maxobjs> <grown> <reaped> <error> <maxfreeable> <nodeallocs> <remotefrees> <alienoverflow>");
|
2012-10-19 22:20:26 +08:00
|
|
|
seq_puts(m, " : cpustat <allochit> <allocmiss> <freehit> <freemiss>");
|
|
|
|
#endif
|
|
|
|
seq_putc(m, '\n');
|
|
|
|
}
|
|
|
|
|
2014-12-11 07:42:16 +08:00
|
|
|
void *slab_start(struct seq_file *m, loff_t *pos)
|
2012-10-19 22:20:25 +08:00
|
|
|
{
|
|
|
|
mutex_lock(&slab_mutex);
|
2017-02-23 07:41:24 +08:00
|
|
|
return seq_list_start(&slab_root_caches, *pos);
|
2012-10-19 22:20:25 +08:00
|
|
|
}
|
|
|
|
|
2013-07-08 08:08:28 +08:00
|
|
|
void *slab_next(struct seq_file *m, void *p, loff_t *pos)
|
2012-10-19 22:20:25 +08:00
|
|
|
{
|
2017-02-23 07:41:24 +08:00
|
|
|
return seq_list_next(p, &slab_root_caches, pos);
|
2012-10-19 22:20:25 +08:00
|
|
|
}
|
|
|
|
|
2013-07-08 08:08:28 +08:00
|
|
|
void slab_stop(struct seq_file *m, void *p)
|
2012-10-19 22:20:25 +08:00
|
|
|
{
|
|
|
|
mutex_unlock(&slab_mutex);
|
|
|
|
}
|
|
|
|
|
2012-12-19 06:23:01 +08:00
|
|
|
static void
|
|
|
|
memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
|
|
|
|
{
|
|
|
|
struct kmem_cache *c;
|
|
|
|
struct slabinfo sinfo;
|
|
|
|
|
|
|
|
if (!is_root_cache(s))
|
|
|
|
return;
|
|
|
|
|
2015-02-13 06:59:23 +08:00
|
|
|
for_each_memcg_cache(c, s) {
|
2012-12-19 06:23:01 +08:00
|
|
|
memset(&sinfo, 0, sizeof(sinfo));
|
|
|
|
get_slabinfo(c, &sinfo);
|
|
|
|
|
|
|
|
info->active_slabs += sinfo.active_slabs;
|
|
|
|
info->num_slabs += sinfo.num_slabs;
|
|
|
|
info->shared_avail += sinfo.shared_avail;
|
|
|
|
info->active_objs += sinfo.active_objs;
|
|
|
|
info->num_objs += sinfo.num_objs;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-12-11 07:44:19 +08:00
|
|
|
static void cache_show(struct kmem_cache *s, struct seq_file *m)
|
2012-10-19 22:20:25 +08:00
|
|
|
{
|
2012-10-19 22:20:27 +08:00
|
|
|
struct slabinfo sinfo;
|
|
|
|
|
|
|
|
memset(&sinfo, 0, sizeof(sinfo));
|
|
|
|
get_slabinfo(s, &sinfo);
|
|
|
|
|
2012-12-19 06:23:01 +08:00
|
|
|
memcg_accumulate_slabinfo(s, &sinfo);
|
|
|
|
|
2012-10-19 22:20:27 +08:00
|
|
|
seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d",
|
2012-12-19 06:23:01 +08:00
|
|
|
cache_name(s), sinfo.active_objs, sinfo.num_objs, s->size,
|
2012-10-19 22:20:27 +08:00
|
|
|
sinfo.objects_per_slab, (1 << sinfo.cache_order));
|
|
|
|
|
|
|
|
seq_printf(m, " : tunables %4u %4u %4u",
|
|
|
|
sinfo.limit, sinfo.batchcount, sinfo.shared);
|
|
|
|
seq_printf(m, " : slabdata %6lu %6lu %6lu",
|
|
|
|
sinfo.active_slabs, sinfo.num_slabs, sinfo.shared_avail);
|
|
|
|
slabinfo_show_stats(m, s);
|
|
|
|
seq_putc(m, '\n');
|
2012-10-19 22:20:25 +08:00
|
|
|
}
|
|
|
|
|
2014-12-11 07:42:16 +08:00
|
|
|
static int slab_show(struct seq_file *m, void *p)
|
2012-12-19 06:23:01 +08:00
|
|
|
{
|
2017-02-23 07:41:24 +08:00
|
|
|
struct kmem_cache *s = list_entry(p, struct kmem_cache, root_caches_node);
|
2012-12-19 06:23:01 +08:00
|
|
|
|
2017-02-23 07:41:24 +08:00
|
|
|
if (p == slab_root_caches.next)
|
2014-12-11 07:42:16 +08:00
|
|
|
print_slabinfo_header(m);
|
2017-02-23 07:41:24 +08:00
|
|
|
cache_show(s, m);
|
2014-12-11 07:44:19 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-11-16 09:32:07 +08:00
|
|
|
void dump_unreclaimable_slab(void)
|
|
|
|
{
|
|
|
|
struct kmem_cache *s, *s2;
|
|
|
|
struct slabinfo sinfo;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Here acquiring slab_mutex is risky since we don't prefer to get
|
|
|
|
* sleep in oom path. But, without mutex hold, it may introduce a
|
|
|
|
* risk of crash.
|
|
|
|
* Use mutex_trylock to protect the list traverse, dump nothing
|
|
|
|
* without acquiring the mutex.
|
|
|
|
*/
|
|
|
|
if (!mutex_trylock(&slab_mutex)) {
|
|
|
|
pr_warn("excessive unreclaimable slab but cannot dump stats\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
pr_info("Unreclaimable slab info:\n");
|
|
|
|
pr_info("Name Used Total\n");
|
|
|
|
|
|
|
|
list_for_each_entry_safe(s, s2, &slab_caches, list) {
|
|
|
|
if (!is_root_cache(s) || (s->flags & SLAB_RECLAIM_ACCOUNT))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
get_slabinfo(s, &sinfo);
|
|
|
|
|
|
|
|
if (sinfo.num_objs > 0)
|
|
|
|
pr_info("%-17s %10luKB %10luKB\n", cache_name(s),
|
|
|
|
(sinfo.active_objs * s->size) / 1024,
|
|
|
|
(sinfo.num_objs * s->size) / 1024);
|
|
|
|
}
|
|
|
|
mutex_unlock(&slab_mutex);
|
|
|
|
}
|
|
|
|
|
2017-11-16 09:32:03 +08:00
|
|
|
#if defined(CONFIG_MEMCG)
|
2017-02-23 07:41:21 +08:00
|
|
|
void *memcg_slab_start(struct seq_file *m, loff_t *pos)
|
|
|
|
{
|
|
|
|
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
|
|
|
|
|
|
|
|
mutex_lock(&slab_mutex);
|
|
|
|
return seq_list_start(&memcg->kmem_caches, *pos);
|
|
|
|
}
|
|
|
|
|
|
|
|
void *memcg_slab_next(struct seq_file *m, void *p, loff_t *pos)
|
|
|
|
{
|
|
|
|
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
|
|
|
|
|
|
|
|
return seq_list_next(p, &memcg->kmem_caches, pos);
|
|
|
|
}
|
|
|
|
|
|
|
|
void memcg_slab_stop(struct seq_file *m, void *p)
|
|
|
|
{
|
|
|
|
mutex_unlock(&slab_mutex);
|
|
|
|
}
|
|
|
|
|
2014-12-11 07:44:19 +08:00
|
|
|
int memcg_slab_show(struct seq_file *m, void *p)
|
|
|
|
{
|
2017-02-23 07:41:21 +08:00
|
|
|
struct kmem_cache *s = list_entry(p, struct kmem_cache,
|
|
|
|
memcg_params.kmem_caches_node);
|
2014-12-11 07:44:19 +08:00
|
|
|
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
|
|
|
|
|
2017-02-23 07:41:21 +08:00
|
|
|
if (p == memcg->kmem_caches.next)
|
2014-12-11 07:44:19 +08:00
|
|
|
print_slabinfo_header(m);
|
2017-02-23 07:41:21 +08:00
|
|
|
cache_show(s, m);
|
2014-12-11 07:44:19 +08:00
|
|
|
return 0;
|
2012-12-19 06:23:01 +08:00
|
|
|
}
|
2014-12-11 07:44:19 +08:00
|
|
|
#endif
|
2012-12-19 06:23:01 +08:00
|
|
|
|
2012-10-19 22:20:25 +08:00
|
|
|
/*
|
|
|
|
* slabinfo_op - iterator that generates /proc/slabinfo
|
|
|
|
*
|
|
|
|
* Output layout:
|
|
|
|
* cache-name
|
|
|
|
* num-active-objs
|
|
|
|
* total-objs
|
|
|
|
* object size
|
|
|
|
* num-active-slabs
|
|
|
|
* total-slabs
|
|
|
|
* num-pages-per-slab
|
|
|
|
* + further values on SMP and with statistics enabled
|
|
|
|
*/
|
|
|
|
static const struct seq_operations slabinfo_op = {
|
2014-12-11 07:42:16 +08:00
|
|
|
.start = slab_start,
|
2013-07-08 08:08:28 +08:00
|
|
|
.next = slab_next,
|
|
|
|
.stop = slab_stop,
|
2014-12-11 07:42:16 +08:00
|
|
|
.show = slab_show,
|
2012-10-19 22:20:25 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
static int slabinfo_open(struct inode *inode, struct file *file)
|
|
|
|
{
|
|
|
|
return seq_open(file, &slabinfo_op);
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct file_operations proc_slabinfo_operations = {
|
|
|
|
.open = slabinfo_open,
|
|
|
|
.read = seq_read,
|
|
|
|
.write = slabinfo_write,
|
|
|
|
.llseek = seq_lseek,
|
|
|
|
.release = seq_release,
|
|
|
|
};
|
|
|
|
|
|
|
|
static int __init slab_proc_init(void)
|
|
|
|
{
|
2013-07-04 08:33:24 +08:00
|
|
|
proc_create("slabinfo", SLABINFO_RIGHTS, NULL,
|
|
|
|
&proc_slabinfo_operations);
|
2012-10-19 22:20:25 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
module_init(slab_proc_init);
|
2017-11-16 09:32:03 +08:00
|
|
|
#endif /* CONFIG_SLAB || CONFIG_SLUB_DEBUG */
|
2014-08-07 07:04:44 +08:00
|
|
|
|
|
|
|
static __always_inline void *__do_krealloc(const void *p, size_t new_size,
|
|
|
|
gfp_t flags)
|
|
|
|
{
|
|
|
|
void *ret;
|
|
|
|
size_t ks = 0;
|
|
|
|
|
|
|
|
if (p)
|
|
|
|
ks = ksize(p);
|
|
|
|
|
2015-02-14 06:39:42 +08:00
|
|
|
if (ks >= new_size) {
|
2016-03-26 05:22:02 +08:00
|
|
|
kasan_krealloc((void *)p, new_size, flags);
|
2014-08-07 07:04:44 +08:00
|
|
|
return (void *)p;
|
2015-02-14 06:39:42 +08:00
|
|
|
}
|
2014-08-07 07:04:44 +08:00
|
|
|
|
|
|
|
ret = kmalloc_track_caller(new_size, flags);
|
|
|
|
if (ret && p)
|
|
|
|
memcpy(ret, p, ks);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* __krealloc - like krealloc() but don't free @p.
|
|
|
|
* @p: object to reallocate memory for.
|
|
|
|
* @new_size: how many bytes of memory are required.
|
|
|
|
* @flags: the type of memory to allocate.
|
|
|
|
*
|
|
|
|
* This function is like krealloc() except it never frees the originally
|
|
|
|
* allocated buffer. Use this if you don't want to free the buffer immediately
|
|
|
|
* like, for example, with RCU.
|
|
|
|
*/
|
|
|
|
void *__krealloc(const void *p, size_t new_size, gfp_t flags)
|
|
|
|
{
|
|
|
|
if (unlikely(!new_size))
|
|
|
|
return ZERO_SIZE_PTR;
|
|
|
|
|
|
|
|
return __do_krealloc(p, new_size, flags);
|
|
|
|
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(__krealloc);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* krealloc - reallocate memory. The contents will remain unchanged.
|
|
|
|
* @p: object to reallocate memory for.
|
|
|
|
* @new_size: how many bytes of memory are required.
|
|
|
|
* @flags: the type of memory to allocate.
|
|
|
|
*
|
|
|
|
* The contents of the object pointed to are preserved up to the
|
|
|
|
* lesser of the new and old sizes. If @p is %NULL, krealloc()
|
|
|
|
* behaves exactly like kmalloc(). If @new_size is 0 and @p is not a
|
|
|
|
* %NULL pointer, the object pointed to is freed.
|
|
|
|
*/
|
|
|
|
void *krealloc(const void *p, size_t new_size, gfp_t flags)
|
|
|
|
{
|
|
|
|
void *ret;
|
|
|
|
|
|
|
|
if (unlikely(!new_size)) {
|
|
|
|
kfree(p);
|
|
|
|
return ZERO_SIZE_PTR;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = __do_krealloc(p, new_size, flags);
|
|
|
|
if (ret && p != ret)
|
|
|
|
kfree(p);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(krealloc);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* kzfree - like kfree but zero memory
|
|
|
|
* @p: object to free memory of
|
|
|
|
*
|
|
|
|
* The memory of the object @p points to is zeroed before freed.
|
|
|
|
* If @p is %NULL, kzfree() does nothing.
|
|
|
|
*
|
|
|
|
* Note: this function zeroes the whole allocated buffer which can be a good
|
|
|
|
* deal bigger than the requested buffer size passed to kmalloc(). So be
|
|
|
|
* careful when using this function in performance sensitive code.
|
|
|
|
*/
|
|
|
|
void kzfree(const void *p)
|
|
|
|
{
|
|
|
|
size_t ks;
|
|
|
|
void *mem = (void *)p;
|
|
|
|
|
|
|
|
if (unlikely(ZERO_OR_NULL_PTR(mem)))
|
|
|
|
return;
|
|
|
|
ks = ksize(mem);
|
|
|
|
memset(mem, 0, ks);
|
|
|
|
kfree(mem);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kzfree);
|
|
|
|
|
|
|
|
/* Tracepoints definitions. */
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL(kmalloc);
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc);
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL(kmalloc_node);
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc_node);
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL(kfree);
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL(kmem_cache_free);
|
2018-04-06 07:23:57 +08:00
|
|
|
|
|
|
|
int should_failslab(struct kmem_cache *s, gfp_t gfpflags)
|
|
|
|
{
|
|
|
|
if (__should_failslab(s, gfpflags))
|
|
|
|
return -ENOMEM;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
ALLOW_ERROR_INJECTION(should_failslab, ERRNO);
|