License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2005-04-17 06:20:36 +08:00
|
|
|
#ifndef _LINUX_HUGETLB_H
|
|
|
|
#define _LINUX_HUGETLB_H
|
|
|
|
|
2011-05-27 03:03:50 +08:00
|
|
|
#include <linux/mm_types.h>
|
2014-01-24 07:52:54 +08:00
|
|
|
#include <linux/mmdebug.h>
|
2007-07-30 06:36:13 +08:00
|
|
|
#include <linux/fs.h>
|
2010-05-28 08:29:15 +08:00
|
|
|
#include <linux/hugetlb_inline.h>
|
2012-08-01 07:42:24 +08:00
|
|
|
#include <linux/cgroup.h>
|
mm, hugetlb: unify region structure handling
Currently, to track reserved and allocated regions, we use two different
ways, depending on the mapping. For MAP_SHARED, we use
address_mapping's private_list and, while for MAP_PRIVATE, we use a
resv_map.
Now, we are preparing to change a coarse grained lock which protect a
region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling, consistently
using a resv_map regardless of the kind of mapping.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:25 +08:00
|
|
|
#include <linux/list.h>
|
|
|
|
#include <linux/kref.h>
|
2016-01-16 08:56:32 +08:00
|
|
|
#include <asm/pgtable.h>
|
2007-07-30 06:36:13 +08:00
|
|
|
|
2009-09-25 05:47:45 +08:00
|
|
|
struct ctl_table;
|
|
|
|
struct user_struct;
|
2012-08-01 07:42:03 +08:00
|
|
|
struct mmu_gather;
|
2009-09-25 05:47:45 +08:00
|
|
|
|
2017-07-07 06:38:53 +08:00
|
|
|
#ifndef is_hugepd
|
|
|
|
typedef struct { unsigned long pd; } hugepd_t;
|
|
|
|
#define is_hugepd(hugepd) (0)
|
|
|
|
#define __hugepd(x) ((hugepd_t) { (x) })
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#ifdef CONFIG_HUGETLB_PAGE
|
|
|
|
|
|
|
|
#include <linux/mempolicy.h>
|
2007-03-02 07:46:08 +08:00
|
|
|
#include <linux/shm.h>
|
2005-06-22 08:14:44 +08:00
|
|
|
#include <asm/tlbflush.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
|
hugepages: fix use after free bug in "quota" handling
hugetlbfs_{get,put}_quota() are badly named. They don't interact with the
general quota handling code, and they don't much resemble its behaviour.
Rather than being about maintaining limits on on-disk block usage by
particular users, they are instead about maintaining limits on in-memory
page usage (including anonymous MAP_PRIVATE copied-on-write pages)
associated with a particular hugetlbfs filesystem instance.
Worse, they work by having callbacks to the hugetlbfs filesystem code from
the low-level page handling code, in particular from free_huge_page().
This is a layering violation of itself, but more importantly, if the
kernel does a get_user_pages() on hugepages (which can happen from KVM
amongst others), then the free_huge_page() can be delayed until after the
associated inode has already been freed. If an unmount occurs at the
wrong time, even the hugetlbfs superblock where the "quota" limits are
stored may have been freed.
Andrew Barry proposed a patch to fix this by having hugepages, instead of
storing a pointer to their address_space and reaching the superblock from
there, had the hugepages store pointers directly to the superblock,
bumping the reference count as appropriate to avoid it being freed.
Andrew Morton rejected that version, however, on the grounds that it made
the existing layering violation worse.
This is a reworked version of Andrew's patch, which removes the extra, and
some of the existing, layering violation. It works by introducing the
concept of a hugepage "subpool" at the lower hugepage mm layer - that is a
finite logical pool of hugepages to allocate from. hugetlbfs now creates
a subpool for each filesystem instance with a page limit set, and a
pointer to the subpool gets added to each allocated hugepage, instead of
the address_space pointer used now. The subpool has its own lifetime and
is only freed once all pages in it _and_ all other references to it (i.e.
superblocks) are gone.
subpools are optional - a NULL subpool pointer is taken by the code to
mean that no subpool limits are in effect.
Previous discussion of this bug found in: "Fix refcounting in hugetlbfs
quota handling.". See: https://lkml.org/lkml/2011/8/11/28 or
http://marc.info/?l=linux-mm&m=126928970510627&w=1
v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
alloc_huge_page() - since it already takes the vma, it is not necessary.
Signed-off-by: Andrew Barry <abarry@cray.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 07:34:12 +08:00
|
|
|
struct hugepage_subpool {
|
|
|
|
spinlock_t lock;
|
|
|
|
long count;
|
hugetlbfs: add minimum size tracking fields to subpool structure
hugetlbfs allocates huge pages from the global pool as needed. Even if
the global pool contains a sufficient number pages for the filesystem size
at mount time, those global pages could be grabbed for some other use. As
a result, filesystem huge page allocations may fail due to lack of pages.
Applications such as a database want to use huge pages for performance
reasons. hugetlbfs filesystem semantics with ownership and modes work
well to manage access to a pool of huge pages. However, the application
would like some reasonable assurance that allocations will not fail due to
a lack of huge pages. At application startup time, the application would
like to configure itself to use a specific number of huge pages. Before
starting, the application can check to make sure that enough huge pages
exist in the system global pools. However, there are no guarantees that
those pages will be available when needed by the application. What the
application wants is exclusive use of a subset of huge pages.
Add a new hugetlbfs mount option 'min_size=<value>' to indicate that the
specified number of pages will be available for use by the filesystem. At
mount time, this number of huge pages will be reserved for exclusive use
of the filesystem. If there is not a sufficient number of free pages, the
mount will fail. As pages are allocated to and freeed from the
filesystem, the number of reserved pages is adjusted so that the specified
minimum is maintained.
This patch (of 4):
Add a field to the subpool structure to indicate the minimimum number of
huge pages to always be used by this subpool. This minimum count includes
allocated pages as well as reserved pages. If the minimum number of pages
for the subpool have not been allocated, pages are reserved up to this
minimum. An additional field (rsv_hpages) is used to track the number of
pages reserved to meet this minimum size. The hstate pointer in the
subpool is convenient to have when reserving and unreserving the pages.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-16 07:13:36 +08:00
|
|
|
long max_hpages; /* Maximum huge pages or -1 if no maximum. */
|
|
|
|
long used_hpages; /* Used count against maximum, includes */
|
|
|
|
/* both alloced and reserved pages. */
|
|
|
|
struct hstate *hstate;
|
|
|
|
long min_hpages; /* Minimum huge pages or -1 if no minimum. */
|
|
|
|
long rsv_hpages; /* Pages reserved against global pool to */
|
|
|
|
/* sasitfy minimum size. */
|
hugepages: fix use after free bug in "quota" handling
hugetlbfs_{get,put}_quota() are badly named. They don't interact with the
general quota handling code, and they don't much resemble its behaviour.
Rather than being about maintaining limits on on-disk block usage by
particular users, they are instead about maintaining limits on in-memory
page usage (including anonymous MAP_PRIVATE copied-on-write pages)
associated with a particular hugetlbfs filesystem instance.
Worse, they work by having callbacks to the hugetlbfs filesystem code from
the low-level page handling code, in particular from free_huge_page().
This is a layering violation of itself, but more importantly, if the
kernel does a get_user_pages() on hugepages (which can happen from KVM
amongst others), then the free_huge_page() can be delayed until after the
associated inode has already been freed. If an unmount occurs at the
wrong time, even the hugetlbfs superblock where the "quota" limits are
stored may have been freed.
Andrew Barry proposed a patch to fix this by having hugepages, instead of
storing a pointer to their address_space and reaching the superblock from
there, had the hugepages store pointers directly to the superblock,
bumping the reference count as appropriate to avoid it being freed.
Andrew Morton rejected that version, however, on the grounds that it made
the existing layering violation worse.
This is a reworked version of Andrew's patch, which removes the extra, and
some of the existing, layering violation. It works by introducing the
concept of a hugepage "subpool" at the lower hugepage mm layer - that is a
finite logical pool of hugepages to allocate from. hugetlbfs now creates
a subpool for each filesystem instance with a page limit set, and a
pointer to the subpool gets added to each allocated hugepage, instead of
the address_space pointer used now. The subpool has its own lifetime and
is only freed once all pages in it _and_ all other references to it (i.e.
superblocks) are gone.
subpools are optional - a NULL subpool pointer is taken by the code to
mean that no subpool limits are in effect.
Previous discussion of this bug found in: "Fix refcounting in hugetlbfs
quota handling.". See: https://lkml.org/lkml/2011/8/11/28 or
http://marc.info/?l=linux-mm&m=126928970510627&w=1
v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
alloc_huge_page() - since it already takes the vma, it is not necessary.
Signed-off-by: Andrew Barry <abarry@cray.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 07:34:12 +08:00
|
|
|
};
|
|
|
|
|
mm, hugetlb: unify region structure handling
Currently, to track reserved and allocated regions, we use two different
ways, depending on the mapping. For MAP_SHARED, we use
address_mapping's private_list and, while for MAP_PRIVATE, we use a
resv_map.
Now, we are preparing to change a coarse grained lock which protect a
region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling, consistently
using a resv_map regardless of the kind of mapping.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:25 +08:00
|
|
|
struct resv_map {
|
|
|
|
struct kref refs;
|
2014-04-04 05:47:27 +08:00
|
|
|
spinlock_t lock;
|
mm, hugetlb: unify region structure handling
Currently, to track reserved and allocated regions, we use two different
ways, depending on the mapping. For MAP_SHARED, we use
address_mapping's private_list and, while for MAP_PRIVATE, we use a
resv_map.
Now, we are preparing to change a coarse grained lock which protect a
region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling, consistently
using a resv_map regardless of the kind of mapping.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:25 +08:00
|
|
|
struct list_head regions;
|
2015-09-09 06:01:28 +08:00
|
|
|
long adds_in_progress;
|
|
|
|
struct list_head region_cache;
|
|
|
|
long region_cache_count;
|
2020-04-02 12:11:21 +08:00
|
|
|
#ifdef CONFIG_CGROUP_HUGETLB
|
|
|
|
/*
|
|
|
|
* On private mappings, the counter to uncharge reservations is stored
|
|
|
|
* here. If these fields are 0, then either the mapping is shared, or
|
|
|
|
* cgroup accounting is disabled for this resv_map.
|
|
|
|
*/
|
|
|
|
struct page_counter *reservation_counter;
|
|
|
|
unsigned long pages_per_hpage;
|
|
|
|
struct cgroup_subsys_state *css;
|
|
|
|
#endif
|
mm, hugetlb: unify region structure handling
Currently, to track reserved and allocated regions, we use two different
ways, depending on the mapping. For MAP_SHARED, we use
address_mapping's private_list and, while for MAP_PRIVATE, we use a
resv_map.
Now, we are preparing to change a coarse grained lock which protect a
region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling, consistently
using a resv_map regardless of the kind of mapping.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:25 +08:00
|
|
|
};
|
hugetlb_cgroup: add accounting for shared mappings
For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
in the resv_map entries, in file_region->reservation_counter.
After a call to region_chg, we charge the approprate hugetlb_cgroup, and
if successful, we pass on the hugetlb_cgroup info to a follow up
region_add call. When a file_region entry is added to the resv_map via
region_add, we put the pointer to that cgroup in
file_region->reservation_counter. If charging doesn't succeed, we report
the error to the caller, so that the kernel fails the reservation.
On region_del, which is when the hugetlb memory is unreserved, we also
uncharge the file_region->reservation_counter.
[akpm@linux-foundation.org: forward declare struct file_region]
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Link: http://lkml.kernel.org/r/20200211213128.73302-5-almasrymina@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:11:28 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Region tracking -- allows tracking of reservations and instantiated pages
|
|
|
|
* across the pages in a mapping.
|
|
|
|
*
|
|
|
|
* The region data structures are embedded into a resv_map and protected
|
|
|
|
* by a resv_map's lock. The set of regions within the resv_map represent
|
|
|
|
* reservations for huge pages, or huge pages that have already been
|
|
|
|
* instantiated within the map. The from and to elements are huge page
|
|
|
|
* indicies into the associated mapping. from indicates the starting index
|
|
|
|
* of the region. to represents the first index past the end of the region.
|
|
|
|
*
|
|
|
|
* For example, a file region structure with from == 0 and to == 4 represents
|
|
|
|
* four huge pages in a mapping. It is important to note that the to element
|
|
|
|
* represents the first element past the end of the region. This is used in
|
|
|
|
* arithmetic as 4(to) - 0(from) = 4 huge pages in the region.
|
|
|
|
*
|
|
|
|
* Interval notation of the form [from, to) will be used to indicate that
|
|
|
|
* the endpoint from is inclusive and to is exclusive.
|
|
|
|
*/
|
|
|
|
struct file_region {
|
|
|
|
struct list_head link;
|
|
|
|
long from;
|
|
|
|
long to;
|
|
|
|
#ifdef CONFIG_CGROUP_HUGETLB
|
|
|
|
/*
|
|
|
|
* On shared mappings, each reserved region appears as a struct
|
|
|
|
* file_region in resv_map. These fields hold the info needed to
|
|
|
|
* uncharge each reservation.
|
|
|
|
*/
|
|
|
|
struct page_counter *reservation_counter;
|
|
|
|
struct cgroup_subsys_state *css;
|
|
|
|
#endif
|
|
|
|
};
|
|
|
|
|
mm, hugetlb: unify region structure handling
Currently, to track reserved and allocated regions, we use two different
ways, depending on the mapping. For MAP_SHARED, we use
address_mapping's private_list and, while for MAP_PRIVATE, we use a
resv_map.
Now, we are preparing to change a coarse grained lock which protect a
region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling, consistently
using a resv_map regardless of the kind of mapping.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 05:47:25 +08:00
|
|
|
extern struct resv_map *resv_map_alloc(void);
|
|
|
|
void resv_map_release(struct kref *ref);
|
|
|
|
|
2012-08-01 07:42:10 +08:00
|
|
|
extern spinlock_t hugetlb_lock;
|
|
|
|
extern int hugetlb_max_hstate __read_mostly;
|
|
|
|
#define for_each_hstate(h) \
|
|
|
|
for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++)
|
|
|
|
|
2015-04-16 07:13:42 +08:00
|
|
|
struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
|
|
|
|
long min_hpages);
|
hugepages: fix use after free bug in "quota" handling
hugetlbfs_{get,put}_quota() are badly named. They don't interact with the
general quota handling code, and they don't much resemble its behaviour.
Rather than being about maintaining limits on on-disk block usage by
particular users, they are instead about maintaining limits on in-memory
page usage (including anonymous MAP_PRIVATE copied-on-write pages)
associated with a particular hugetlbfs filesystem instance.
Worse, they work by having callbacks to the hugetlbfs filesystem code from
the low-level page handling code, in particular from free_huge_page().
This is a layering violation of itself, but more importantly, if the
kernel does a get_user_pages() on hugepages (which can happen from KVM
amongst others), then the free_huge_page() can be delayed until after the
associated inode has already been freed. If an unmount occurs at the
wrong time, even the hugetlbfs superblock where the "quota" limits are
stored may have been freed.
Andrew Barry proposed a patch to fix this by having hugepages, instead of
storing a pointer to their address_space and reaching the superblock from
there, had the hugepages store pointers directly to the superblock,
bumping the reference count as appropriate to avoid it being freed.
Andrew Morton rejected that version, however, on the grounds that it made
the existing layering violation worse.
This is a reworked version of Andrew's patch, which removes the extra, and
some of the existing, layering violation. It works by introducing the
concept of a hugepage "subpool" at the lower hugepage mm layer - that is a
finite logical pool of hugepages to allocate from. hugetlbfs now creates
a subpool for each filesystem instance with a page limit set, and a
pointer to the subpool gets added to each allocated hugepage, instead of
the address_space pointer used now. The subpool has its own lifetime and
is only freed once all pages in it _and_ all other references to it (i.e.
superblocks) are gone.
subpools are optional - a NULL subpool pointer is taken by the code to
mean that no subpool limits are in effect.
Previous discussion of this bug found in: "Fix refcounting in hugetlbfs
quota handling.". See: https://lkml.org/lkml/2011/8/11/28 or
http://marc.info/?l=linux-mm&m=126928970510627&w=1
v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
alloc_huge_page() - since it already takes the vma, it is not necessary.
Signed-off-by: Andrew Barry <abarry@cray.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 07:34:12 +08:00
|
|
|
void hugepage_put_subpool(struct hugepage_subpool *spool);
|
|
|
|
|
2008-07-24 12:27:23 +08:00
|
|
|
void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
|
2020-04-24 14:43:38 +08:00
|
|
|
int hugetlb_sysctl_handler(struct ctl_table *, int, void *, size_t *, loff_t *);
|
|
|
|
int hugetlb_overcommit_handler(struct ctl_table *, int, void *, size_t *,
|
|
|
|
loff_t *);
|
|
|
|
int hugetlb_treat_movable_handler(struct ctl_table *, int, void *, size_t *,
|
|
|
|
loff_t *);
|
|
|
|
int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int, void *, size_t *,
|
|
|
|
loff_t *);
|
hugetlb: derive huge pages nodes allowed from task mempolicy
This patch derives a "nodes_allowed" node mask from the numa mempolicy of
the task modifying the number of persistent huge pages to control the
allocation, freeing and adjusting of surplus huge pages when the pool page
count is modified via the new sysctl or sysfs attribute
"nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows:
* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
is produced. This will cause the hugetlb subsystem to use
node_online_map as the "nodes_allowed". This preserves the
behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
a nodemask with the single preferred node will be produced.
"local" policy will NOT track any internode migrations of the
task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
will be used.
* Other than to inform the construction of the nodes_allowed node
mask, the actual mempolicy mode is ignored. That is, all modes
behave like interleave over the resulting nodes_allowed mask
with no "fallback".
See the updated documentation [next patch] for more information
about the implications of this patch.
Examples:
Starting with:
Node 0 HugePages_Total: 0
Node 1 HugePages_Total: 0
Node 2 HugePages_Total: 0
Node 3 HugePages_Total: 0
Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:
sysctl vm.nr_hugepages[_mempolicy]=32
yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 8
Node 3 HugePages_Total: 8
Of course, we only have nr_hugepages_mempolicy with the patch,
but with default mempolicy, nr_hugepages_mempolicy behaves the
same as nr_hugepages.
Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes. So, starting from the
condition above, with 8 huge pages per node, add 8 more to
node 2 using:
numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40
This yields:
Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.
Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:
numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32
yields:
Node 0 HugePages_Total: 4
Node 1 HugePages_Total: 4
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8
The 8 huge pages freed were balanced over nodes 0 and 1.
[rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 09:58:21 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
|
2013-02-23 08:35:55 +08:00
|
|
|
long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
|
|
|
|
struct page **, struct vm_area_struct **,
|
2017-02-23 07:43:13 +08:00
|
|
|
unsigned long *, unsigned long *, long, unsigned int,
|
|
|
|
int *);
|
hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed
After patch 2 in this series, a process that successfully calls mmap() for
a MAP_PRIVATE mapping will be guaranteed to successfully fault until a
process calls fork(). At that point, the next write fault from the parent
could fail due to COW if the child still has a reference.
We only reserve pages for the parent but a copy must be made to avoid
leaking data from the parent to the child after fork(). Reserves could be
taken for both parent and child at fork time to guarantee faults but if
the mapping is large it is highly likely we will not have sufficient pages
for the reservation, and it is common to fork only to exec() immediatly
after. A failure here would be very undesirable.
Note that the current behaviour of mainline with MAP_PRIVATE pages is
pretty bad. The following situation is allowed to occur today.
1. Process calls mmap(MAP_PRIVATE)
2. Process calls mlock() to fault all pages and makes sure it succeeds
3. Process forks()
4. Process writes to MAP_PRIVATE mapping while child still exists
5. If the COW fails at this point, the process gets SIGKILLed even though it
had taken care to ensure the pages existed
This patch improves the situation by guaranteeing the reliability of the
process that successfully calls mmap(). When the parent performs COW, it
will try to satisfy the allocation without using reserves. If that fails
the parent will steal the page leaving any children without a page.
Faults from the child after that point will result in failure. If the
child COW happens first, an attempt will be made to allocate the page
without reserves and the child will get SIGKILLed on failure.
To summarise the new behaviour:
1. If the original mapper performs COW on a private mapping with multiple
references, it will attempt to allocate a hugepage from the pool or
the buddy allocator without using the existing reserves. On fail, VMAs
mapping the same area are traversed and the page being COW'd is unmapped
where found. It will then steal the original page as the last mapper in
the normal way.
2. The VMAs the pages were unmapped from are flagged to note that pages
with data no longer exist. Future no-page faults on those VMAs will
terminate the process as otherwise it would appear that data was corrupted.
A warning is printed to the console that this situation occured.
2. If the child performs COW first, it will attempt to satisfy the COW
from the pool if there are enough pages or via the buddy allocator if
overcommit is allowed and the buddy allocator can satisfy the request. If
it fails, the child will be killed.
If the pool is large enough, existing applications will not notice that
the reserves were a factor. Existing applications depending on the
no-reserves been set are unlikely to exist as for much of the history of
hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that
point or failing the mmap().
[npiggin@suse.de: fix CONFIG_HUGETLB=n build]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 12:27:25 +08:00
|
|
|
void unmap_hugepage_range(struct vm_area_struct *,
|
2012-08-01 07:42:03 +08:00
|
|
|
unsigned long, unsigned long, struct page *);
|
mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
If a process creates a large hugetlbfs mapping that is eligible for page
table sharing and forks heavily with children some of whom fault and
others which destroy the mapping then it is possible for page tables to
get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
output a message to the kernel log. The final teardown will trigger a
BUG_ON in mm/filemap.c.
This was reproduced in 3.4 but is known to have existed for a long time
and goes back at least as far as 2.6.37. It was probably was introduced
in 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
look like this;
[ ..........] Lots of bad pmd messages followed by this
[ 127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
[ 127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
[ 127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
[ 127.186778] ------------[ cut here ]------------
[ 127.186781] kernel BUG at mm/filemap.c:134!
[ 127.186782] invalid opcode: 0000 [#1] SMP
[ 127.186783] CPU 7
[ 127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
[ 127.186801]
[ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
[ 127.186804] RIP: 0010:[<ffffffff810ed6ce>] [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[ 127.186809] RSP: 0000:ffff8804144b5c08 EFLAGS: 00010002
[ 127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
[ 127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
[ 127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
[ 127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
[ 127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
[ 127.186815] FS: 00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
[ 127.186816] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
[ 127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
[ 127.186821] Stack:
[ 127.186822] ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
[ 127.186824] ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
[ 127.186825] ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
[ 127.186827] Call Trace:
[ 127.186829] [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
[ 127.186832] [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
[ 127.186834] [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
[ 127.186837] [<ffffffff811655c7>] evict+0xa7/0x1b0
[ 127.186839] [<ffffffff811657a3>] iput_final+0xd3/0x1f0
[ 127.186840] [<ffffffff811658f9>] iput+0x39/0x50
[ 127.186842] [<ffffffff81162708>] d_kill+0xf8/0x130
[ 127.186843] [<ffffffff81162812>] dput+0xd2/0x1a0
[ 127.186845] [<ffffffff8114e2d0>] __fput+0x170/0x230
[ 127.186848] [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
[ 127.186849] [<ffffffff8114e3ad>] fput+0x1d/0x30
[ 127.186851] [<ffffffff81117db7>] remove_vma+0x37/0x80
[ 127.186853] [<ffffffff81119182>] do_munmap+0x2d2/0x360
[ 127.186855] [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
[ 127.186857] [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
[ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
[ 127.186868] RIP [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[ 127.186870] RSP <ffff8804144b5c08>
[ 127.186871] ---[ end trace 7cbac5d1db69f426 ]---
The bug is a race and not always easy to reproduce. To reproduce it I was
doing the following on a single socket I7-based machine with 16G of RAM.
$ hugeadm --pool-pages-max DEFAULT:13G
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
$ for i in `seq 1 9000`; do ./hugetlbfs-test; done
On my particular machine, it usually triggers within 10 minutes but
enabling debug options can change the timing such that it never hits.
Once the bug is triggered, the machine is in trouble and needs to be
rebooted. The machine will respond but processes accessing proc like "ps
aux" will hang due to the BUG_ON. shutdown will also hang and needs a
hard reset or a sysrq-b.
The basic problem is a race between page table sharing and teardown. For
the most part page table sharing depends on i_mmap_mutex. In some cases,
it is also taking the mm->page_table_lock for the PTE updates but with
shared page tables, it is the i_mmap_mutex that is more important.
Unfortunately it appears to be also insufficient. Consider the following
situation
Process A Process B
--------- ---------
hugetlb_fault shmdt
LockWrite(mmap_sem)
do_munmap
unmap_region
unmap_vmas
unmap_single_vma
unmap_hugepage_range
Lock(i_mmap_mutex)
Lock(mm->page_table_lock)
huge_pmd_unshare/unmap tables <--- (1)
Unlock(mm->page_table_lock)
Unlock(i_mmap_mutex)
huge_pte_alloc ...
Lock(i_mmap_mutex) ...
vma_prio_walk, find svma, spte ...
Lock(mm->page_table_lock) ...
share spte ...
Unlock(mm->page_table_lock) ...
Unlock(i_mmap_mutex) ...
hugetlb_no_page <--- (2)
free_pgtables
unlink_file_vma
hugetlb_free_pgd_range
remove_vma_list
In this scenario, it is possible for Process A to share page tables with
Process B that is trying to tear them down. The i_mmap_mutex on its own
does not prevent Process A walking Process B's page tables. At (1) above,
the page tables are not shared yet so it unmaps the PMDs. Process A sets
up page table sharing and at (2) faults a new entry. Process B then trips
up on it in free_pgtables.
This patch fixes the problem by adding a new function
__unmap_hugepage_range_final that is only called when the VMA is about to
be destroyed. This function clears VM_MAYSHARE during
unmap_hugepage_range() under the i_mmap_mutex. This makes the VMA
ineligible for sharing and avoids the race. Superficially this looks like
it would then be vunerable to truncate and madvise issues but hugetlbfs
has its own truncate handlers so does not use unmap_mapping_range() and
does not support madvise(DONTNEED).
This should be treated as a -stable candidate if it is merged.
Test program is as follows. The test case was mostly written by Michal
Hocko with a few minor changes to reproduce this bug.
==== CUT HERE ====
static size_t huge_page_size = (2UL << 20);
static size_t nr_huge_page_A = 512;
static size_t nr_huge_page_B = 5632;
unsigned int get_random(unsigned int max)
{
struct timeval tv;
gettimeofday(&tv, NULL);
srandom(tv.tv_usec);
return random() % max;
}
static void play(void *addr, size_t size)
{
unsigned char *start = addr,
*end = start + size,
*a;
start += get_random(size/2);
/* we could itterate on huge pages but let's give it more time. */
for (a = start; a < end; a += 4096)
*a = 0;
}
int main(int argc, char **argv)
{
key_t key = IPC_PRIVATE;
size_t sizeA = nr_huge_page_A * huge_page_size;
size_t sizeB = nr_huge_page_B * huge_page_size;
int shmidA, shmidB;
void *addrA = NULL, *addrB = NULL;
int nr_children = 300, n = 0;
if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
perror("shmget:");
return 1;
}
if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
perror("shmat");
return 1;
}
if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
perror("shmget:");
return 1;
}
if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
perror("shmat");
return 1;
}
fork_child:
switch(fork()) {
case 0:
switch (n%3) {
case 0:
play(addrA, sizeA);
break;
case 1:
play(addrB, sizeB);
break;
case 2:
break;
}
break;
case -1:
perror("fork:");
break;
default:
if (++n < nr_children)
goto fork_child;
play(addrA, sizeA);
break;
}
shmdt(addrA);
shmdt(addrB);
do {
wait(NULL);
} while (--n > 0);
shmctl(shmidA, IPC_RMID, NULL);
shmctl(shmidB, IPC_RMID, NULL);
return 0;
}
[akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:46:20 +08:00
|
|
|
void __unmap_hugepage_range_final(struct mmu_gather *tlb,
|
|
|
|
struct vm_area_struct *vma,
|
|
|
|
unsigned long start, unsigned long end,
|
|
|
|
struct page *ref_page);
|
2012-08-01 07:42:03 +08:00
|
|
|
void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
|
|
|
|
unsigned long start, unsigned long end,
|
|
|
|
struct page *ref_page);
|
2008-10-16 03:50:22 +08:00
|
|
|
void hugetlb_report_meminfo(struct seq_file *);
|
2005-04-17 06:20:36 +08:00
|
|
|
int hugetlb_report_node_meminfo(int, char *);
|
2013-04-30 06:07:48 +08:00
|
|
|
void hugetlb_show_meminfo(void);
|
2005-04-17 06:20:36 +08:00
|
|
|
unsigned long hugetlb_total_pages(void);
|
2018-08-24 08:01:36 +08:00
|
|
|
vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
|
2009-06-23 20:49:05 +08:00
|
|
|
unsigned long address, unsigned int flags);
|
2017-02-23 07:42:52 +08:00
|
|
|
int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
|
|
|
|
struct vm_area_struct *dst_vma,
|
|
|
|
unsigned long dst_addr,
|
|
|
|
unsigned long src_addr,
|
|
|
|
struct page **pagep);
|
2008-07-24 12:27:23 +08:00
|
|
|
int hugetlb_reserve_pages(struct inode *inode, long from, long to,
|
2009-02-10 22:02:27 +08:00
|
|
|
struct vm_area_struct *vma,
|
2011-05-26 18:16:19 +08:00
|
|
|
vm_flags_t vm_flags);
|
2015-09-09 06:01:41 +08:00
|
|
|
long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
|
|
|
|
long freed);
|
mm: migrate: make core migration code aware of hugepage
Currently hugepage migration is available only for soft offlining, but
it's also useful for some other users of page migration (clearly because
users of hugepage can enjoy the benefit of mempolicy and memory hotplug.)
So this patchset tries to extend such users to support hugepage migration.
The target of this patchset is to enable hugepage migration for NUMA
related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and
memory hotplug.
This patchset does not add hugepage migration for memory compaction,
because users of memory compaction mainly expect to construct thp by
arranging raw pages, and there's little or no need to compact hugepages.
CMA, another user of page migration, can have benefit from hugepage
migration, but is not enabled to support it for now (just because of lack
of testing and expertise in CMA.)
Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in
x86_64, or hugepages in architectures like ia64) is not enabled for now
(again, because of lack of testing.)
As for how these are achived, I extended the API (migrate_pages()) to
handle hugepage (with patch 1 and 2) and adjusted code of each caller to
check and collect movable hugepages (with patch 3-7). Remaining 2 patches
are kind of miscellaneous ones to avoid unexpected behavior. Patch 8 is
about making sure that we only migrate pmd-based hugepages. And patch 9
is about choosing appropriate zone for hugepage allocation.
My test is mainly functional one, simply kicking hugepage migration via
each entry point and confirm that migration is done correctly. Test code
is available here:
git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git
And I always run libhugetlbfs test when changing hugetlbfs's code. With
this patchset, no regression was found in the test.
This patch (of 9):
Before enabling each user of page migration to support hugepage,
this patch enables the list of pages for migration to link not only
LRU pages, but also hugepages. As a result, putback_movable_pages()
and migrate_pages() can handle both of LRU pages and hugepages.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 05:21:59 +08:00
|
|
|
bool isolate_huge_page(struct page *page, struct list_head *list);
|
|
|
|
void putback_active_hugepage(struct page *page);
|
2018-02-01 08:20:48 +08:00
|
|
|
void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
|
2014-07-31 07:08:39 +08:00
|
|
|
void free_huge_page(struct page *page);
|
2016-10-08 08:02:01 +08:00
|
|
|
void hugetlb_fix_reserve_counts(struct inode *inode);
|
2015-09-09 06:01:35 +08:00
|
|
|
extern struct mutex *hugetlb_fault_mutex_table;
|
2019-12-01 09:57:02 +08:00
|
|
|
u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2013-04-23 19:35:02 +08:00
|
|
|
pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
|
|
|
|
|
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.
While discussing the issue with huge_pte_offset [1], I remembered that
there were more outstanding hugetlb races. These issues are:
1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid global
reserve counts and state.
A previous attempt was made to use i_mmap_rwsem in this manner as
described at [2]. However, those patches were reverted starting with [3]
due to locking issues.
To effectively use i_mmap_rwsem to address the above issues it needs to be
held (in read mode) during page fault processing. However, during fault
processing we need to lock the page we will be adding. Lock ordering
requires we take page lock before i_mmap_rwsem. Waiting until after
taking the page lock is too late in the fault process for the
synchronization we want to do.
To address this lock ordering issue, the following patches change the lock
ordering for hugetlb pages. This is not too invasive as hugetlbfs
processing is done separate from core mm in many places. However, I don't
really like this idea. Much ugliness is contained in the new routine
hugetlb_page_mapping_lock_write() of patch 1.
The only other way I can think of to address these issues is by catching
all the races. After catching a race, cleanup, backout, retry ... etc,
as needed. This can get really ugly, especially for huge page
reservations. At one time, I started writing some of the reservation
backout code for page faults and it got so ugly and complicated I went
down the path of adding synchronization to avoid the races. Any other
suggestions would be welcome.
[1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
[2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
[3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
[4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
[5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/
This patch (of 2):
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with
the ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
One problem with this scheme is that it requires taking i_mmap_rwsem
before taking the page lock during page faults. This is not the order
specified in the rest of mm code. Handling of hugetlbfs pages is mostly
isolated today. Therefore, we use this alternative locking order for
PageHuge() pages.
mapping->i_mmap_rwsem
hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
page->flags PG_locked (lock_page)
To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
introduced to write lock the i_mmap_rwsem associated with a page.
In most cases it is easy to get address_space via vma->vm_file->f_mapping.
However, in the case of migration or memory errors for anon pages we do
not have an associated vma. A new routine _get_hugetlb_page_mapping()
will use anon_vma to get address_space in these cases.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:11:05 +08:00
|
|
|
struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
extern int sysctl_hugetlb_shm_group;
|
2008-07-24 12:27:52 +08:00
|
|
|
extern struct list_head huge_boot_pages;
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2005-06-22 08:14:44 +08:00
|
|
|
/* arch callbacks */
|
|
|
|
|
2008-07-24 12:27:41 +08:00
|
|
|
pte_t *huge_pte_alloc(struct mm_struct *mm,
|
|
|
|
unsigned long addr, unsigned long sz);
|
2017-07-07 06:39:42 +08:00
|
|
|
pte_t *huge_pte_offset(struct mm_struct *mm,
|
|
|
|
unsigned long addr, unsigned long sz);
|
2006-12-07 12:32:03 +08:00
|
|
|
int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
|
2018-10-06 06:51:29 +08:00
|
|
|
void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
|
|
|
|
unsigned long *start, unsigned long *end);
|
2005-06-22 08:14:44 +08:00
|
|
|
struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
|
|
|
|
int write);
|
2017-07-07 06:38:56 +08:00
|
|
|
struct page *follow_huge_pd(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, hugepd_t hpd,
|
|
|
|
int flags, int pdshift);
|
2005-06-22 08:14:44 +08:00
|
|
|
struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
|
mm/hugetlb: take page table lock in follow_huge_pmd()
We have a race condition between move_pages() and freeing hugepages, where
move_pages() calls follow_page(FOLL_GET) for hugepages internally and
tries to get its refcount without preventing concurrent freeing. This
race crashes the kernel, so this patch fixes it by moving FOLL_GET code
for hugepages into follow_huge_pmd() with taking the page table lock.
This patch intentionally removes page==NULL check after pte_page.
This is justified because pte_page() never returns NULL for any
architectures or configurations.
This patch changes the behavior of follow_huge_pmd() for tail pages and
then tail pages can be pinned/returned. So the caller must be changed to
properly handle the returned tail pages.
We could have a choice to add the similar locking to
follow_huge_(addr|pud) for consistency, but it's not necessary because
currently these functions don't support FOLL_GET flag, so let's leave it
for future development.
Here is the reproducer:
$ cat movepages.c
#include <stdio.h>
#include <stdlib.h>
#include <numaif.h>
#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
#define PS 0x1000
int main(int argc, char *argv[]) {
int i;
int nr_hp = strtol(argv[1], NULL, 0);
int nr_p = nr_hp * HPS / PS;
int ret;
void **addrs;
int *status;
int *nodes;
pid_t pid;
pid = strtol(argv[2], NULL, 0);
addrs = malloc(sizeof(char *) * nr_p + 1);
status = malloc(sizeof(char *) * nr_p + 1);
nodes = malloc(sizeof(char *) * nr_p + 1);
while (1) {
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 1;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 0;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
}
return 0;
}
$ cat hugepage.c
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
int main(int argc, char *argv[]) {
int nr_hp = strtol(argv[1], NULL, 0);
char *p;
while (1) {
p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
if (p != (void *)ADDR_INPUT) {
perror("mmap");
break;
}
memset(p, 0, nr_hp * HPS);
munmap(p, nr_hp * HPS);
}
}
$ sysctl vm.nr_hugepages=40
$ ./hugepage 10 &
$ ./movepages 10 $(pgrep -f hugepage)
Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 07:25:22 +08:00
|
|
|
pmd_t *pmd, int flags);
|
2008-07-24 12:27:50 +08:00
|
|
|
struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
|
mm/hugetlb: take page table lock in follow_huge_pmd()
We have a race condition between move_pages() and freeing hugepages, where
move_pages() calls follow_page(FOLL_GET) for hugepages internally and
tries to get its refcount without preventing concurrent freeing. This
race crashes the kernel, so this patch fixes it by moving FOLL_GET code
for hugepages into follow_huge_pmd() with taking the page table lock.
This patch intentionally removes page==NULL check after pte_page.
This is justified because pte_page() never returns NULL for any
architectures or configurations.
This patch changes the behavior of follow_huge_pmd() for tail pages and
then tail pages can be pinned/returned. So the caller must be changed to
properly handle the returned tail pages.
We could have a choice to add the similar locking to
follow_huge_(addr|pud) for consistency, but it's not necessary because
currently these functions don't support FOLL_GET flag, so let's leave it
for future development.
Here is the reproducer:
$ cat movepages.c
#include <stdio.h>
#include <stdlib.h>
#include <numaif.h>
#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
#define PS 0x1000
int main(int argc, char *argv[]) {
int i;
int nr_hp = strtol(argv[1], NULL, 0);
int nr_p = nr_hp * HPS / PS;
int ret;
void **addrs;
int *status;
int *nodes;
pid_t pid;
pid = strtol(argv[2], NULL, 0);
addrs = malloc(sizeof(char *) * nr_p + 1);
status = malloc(sizeof(char *) * nr_p + 1);
nodes = malloc(sizeof(char *) * nr_p + 1);
while (1) {
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 1;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 0;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
}
return 0;
}
$ cat hugepage.c
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
int main(int argc, char *argv[]) {
int nr_hp = strtol(argv[1], NULL, 0);
char *p;
while (1) {
p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
if (p != (void *)ADDR_INPUT) {
perror("mmap");
break;
}
memset(p, 0, nr_hp * HPS);
munmap(p, nr_hp * HPS);
}
}
$ sysctl vm.nr_hugepages=40
$ ./hugepage 10 &
$ ./movepages 10 $(pgrep -f hugepage)
Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 07:25:22 +08:00
|
|
|
pud_t *pud, int flags);
|
2017-07-07 06:38:50 +08:00
|
|
|
struct page *follow_huge_pgd(struct mm_struct *mm, unsigned long address,
|
|
|
|
pgd_t *pgd, int flags);
|
|
|
|
|
2005-06-22 08:14:44 +08:00
|
|
|
int pmd_huge(pmd_t pmd);
|
2017-03-09 22:24:07 +08:00
|
|
|
int pud_huge(pud_t pud);
|
2012-11-19 10:14:23 +08:00
|
|
|
unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
|
2006-03-22 16:08:50 +08:00
|
|
|
unsigned long address, unsigned long end, pgprot_t newprot);
|
2005-06-22 08:14:44 +08:00
|
|
|
|
2017-07-07 06:38:47 +08:00
|
|
|
bool is_hugetlb_entry_migration(pte_t pte);
|
2018-02-01 08:20:48 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#else /* !CONFIG_HUGETLB_PAGE */
|
|
|
|
|
2008-07-24 12:27:23 +08:00
|
|
|
static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
static inline unsigned long hugetlb_total_pages(void)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.
While discussing the issue with huge_pte_offset [1], I remembered that
there were more outstanding hugetlb races. These issues are:
1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid global
reserve counts and state.
A previous attempt was made to use i_mmap_rwsem in this manner as
described at [2]. However, those patches were reverted starting with [3]
due to locking issues.
To effectively use i_mmap_rwsem to address the above issues it needs to be
held (in read mode) during page fault processing. However, during fault
processing we need to lock the page we will be adding. Lock ordering
requires we take page lock before i_mmap_rwsem. Waiting until after
taking the page lock is too late in the fault process for the
synchronization we want to do.
To address this lock ordering issue, the following patches change the lock
ordering for hugetlb pages. This is not too invasive as hugetlbfs
processing is done separate from core mm in many places. However, I don't
really like this idea. Much ugliness is contained in the new routine
hugetlb_page_mapping_lock_write() of patch 1.
The only other way I can think of to address these issues is by catching
all the races. After catching a race, cleanup, backout, retry ... etc,
as needed. This can get really ugly, especially for huge page
reservations. At one time, I started writing some of the reservation
backout code for page faults and it got so ugly and complicated I went
down the path of adding synchronization to avoid the races. Any other
suggestions would be welcome.
[1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
[2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
[3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
[4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
[5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/
This patch (of 2):
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with
the ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
One problem with this scheme is that it requires taking i_mmap_rwsem
before taking the page lock during page faults. This is not the order
specified in the rest of mm code. Handling of hugetlbfs pages is mostly
isolated today. Therefore, we use this alternative locking order for
PageHuge() pages.
mapping->i_mmap_rwsem
hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
page->flags PG_locked (lock_page)
To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
introduced to write lock the i_mmap_rwsem associated with a page.
In most cases it is easy to get address_space via vma->vm_file->f_mapping.
However, in the case of migration or memory errors for anon pages we do
not have an associated vma. A new routine _get_hugetlb_page_mapping()
will use anon_vma to get address_space in these cases.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:11:05 +08:00
|
|
|
static inline struct address_space *hugetlb_page_mapping_lock_write(
|
|
|
|
struct page *hpage)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2018-10-06 06:51:29 +08:00
|
|
|
static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr,
|
|
|
|
pte_t *ptep)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void adjust_range_if_pmd_sharing_possible(
|
|
|
|
struct vm_area_struct *vma,
|
|
|
|
unsigned long *start, unsigned long *end)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2019-12-01 09:56:40 +08:00
|
|
|
static inline long follow_hugetlb_page(struct mm_struct *mm,
|
|
|
|
struct vm_area_struct *vma, struct page **pages,
|
|
|
|
struct vm_area_struct **vmas, unsigned long *position,
|
|
|
|
unsigned long *nr_pages, long i, unsigned int flags,
|
|
|
|
int *nonblocking)
|
|
|
|
{
|
|
|
|
BUG();
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct page *follow_huge_addr(struct mm_struct *mm,
|
|
|
|
unsigned long address, int write)
|
|
|
|
{
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int copy_hugetlb_page_range(struct mm_struct *dst,
|
|
|
|
struct mm_struct *src, struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
BUG();
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-10-16 03:50:22 +08:00
|
|
|
static inline void hugetlb_report_meminfo(struct seq_file *m)
|
|
|
|
{
|
|
|
|
}
|
2019-12-01 09:56:40 +08:00
|
|
|
|
|
|
|
static inline int hugetlb_report_node_meminfo(int nid, char *buf)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-04-30 06:07:48 +08:00
|
|
|
static inline void hugetlb_show_meminfo(void)
|
|
|
|
{
|
|
|
|
}
|
2019-12-01 09:56:40 +08:00
|
|
|
|
|
|
|
static inline struct page *follow_huge_pd(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, hugepd_t hpd, int flags,
|
|
|
|
int pdshift)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct page *follow_huge_pmd(struct mm_struct *mm,
|
|
|
|
unsigned long address, pmd_t *pmd, int flags)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct page *follow_huge_pud(struct mm_struct *mm,
|
|
|
|
unsigned long address, pud_t *pud, int flags)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct page *follow_huge_pgd(struct mm_struct *mm,
|
|
|
|
unsigned long address, pgd_t *pgd, int flags)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int prepare_hugepage_range(struct file *file,
|
|
|
|
unsigned long addr, unsigned long len)
|
|
|
|
{
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int pmd_huge(pmd_t pmd)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int pud_huge(pud_t pud)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int is_hugepage_only_range(struct mm_struct *mm,
|
|
|
|
unsigned long addr, unsigned long len)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void hugetlb_free_pgd_range(struct mmu_gather *tlb,
|
|
|
|
unsigned long addr, unsigned long end,
|
|
|
|
unsigned long floor, unsigned long ceiling)
|
|
|
|
{
|
|
|
|
BUG();
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
|
|
|
|
pte_t *dst_pte,
|
|
|
|
struct vm_area_struct *dst_vma,
|
|
|
|
unsigned long dst_addr,
|
|
|
|
unsigned long src_addr,
|
|
|
|
struct page **pagep)
|
|
|
|
{
|
|
|
|
BUG();
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr,
|
|
|
|
unsigned long sz)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
2012-08-01 07:42:03 +08:00
|
|
|
|
2013-12-13 09:12:19 +08:00
|
|
|
static inline bool isolate_huge_page(struct page *page, struct list_head *list)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2019-12-01 09:56:40 +08:00
|
|
|
static inline void putback_active_hugepage(struct page *page)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void move_hugetlb_state(struct page *oldpage,
|
|
|
|
struct page *newpage, int reason)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long hugetlb_change_protection(
|
|
|
|
struct vm_area_struct *vma, unsigned long address,
|
|
|
|
unsigned long end, pgprot_t newprot)
|
2012-11-19 10:14:23 +08:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2006-03-22 16:08:50 +08:00
|
|
|
|
mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
If a process creates a large hugetlbfs mapping that is eligible for page
table sharing and forks heavily with children some of whom fault and
others which destroy the mapping then it is possible for page tables to
get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
output a message to the kernel log. The final teardown will trigger a
BUG_ON in mm/filemap.c.
This was reproduced in 3.4 but is known to have existed for a long time
and goes back at least as far as 2.6.37. It was probably was introduced
in 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
look like this;
[ ..........] Lots of bad pmd messages followed by this
[ 127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
[ 127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
[ 127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
[ 127.186778] ------------[ cut here ]------------
[ 127.186781] kernel BUG at mm/filemap.c:134!
[ 127.186782] invalid opcode: 0000 [#1] SMP
[ 127.186783] CPU 7
[ 127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
[ 127.186801]
[ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
[ 127.186804] RIP: 0010:[<ffffffff810ed6ce>] [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[ 127.186809] RSP: 0000:ffff8804144b5c08 EFLAGS: 00010002
[ 127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
[ 127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
[ 127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
[ 127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
[ 127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
[ 127.186815] FS: 00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
[ 127.186816] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
[ 127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
[ 127.186821] Stack:
[ 127.186822] ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
[ 127.186824] ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
[ 127.186825] ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
[ 127.186827] Call Trace:
[ 127.186829] [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
[ 127.186832] [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
[ 127.186834] [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
[ 127.186837] [<ffffffff811655c7>] evict+0xa7/0x1b0
[ 127.186839] [<ffffffff811657a3>] iput_final+0xd3/0x1f0
[ 127.186840] [<ffffffff811658f9>] iput+0x39/0x50
[ 127.186842] [<ffffffff81162708>] d_kill+0xf8/0x130
[ 127.186843] [<ffffffff81162812>] dput+0xd2/0x1a0
[ 127.186845] [<ffffffff8114e2d0>] __fput+0x170/0x230
[ 127.186848] [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
[ 127.186849] [<ffffffff8114e3ad>] fput+0x1d/0x30
[ 127.186851] [<ffffffff81117db7>] remove_vma+0x37/0x80
[ 127.186853] [<ffffffff81119182>] do_munmap+0x2d2/0x360
[ 127.186855] [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
[ 127.186857] [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
[ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
[ 127.186868] RIP [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[ 127.186870] RSP <ffff8804144b5c08>
[ 127.186871] ---[ end trace 7cbac5d1db69f426 ]---
The bug is a race and not always easy to reproduce. To reproduce it I was
doing the following on a single socket I7-based machine with 16G of RAM.
$ hugeadm --pool-pages-max DEFAULT:13G
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
$ for i in `seq 1 9000`; do ./hugetlbfs-test; done
On my particular machine, it usually triggers within 10 minutes but
enabling debug options can change the timing such that it never hits.
Once the bug is triggered, the machine is in trouble and needs to be
rebooted. The machine will respond but processes accessing proc like "ps
aux" will hang due to the BUG_ON. shutdown will also hang and needs a
hard reset or a sysrq-b.
The basic problem is a race between page table sharing and teardown. For
the most part page table sharing depends on i_mmap_mutex. In some cases,
it is also taking the mm->page_table_lock for the PTE updates but with
shared page tables, it is the i_mmap_mutex that is more important.
Unfortunately it appears to be also insufficient. Consider the following
situation
Process A Process B
--------- ---------
hugetlb_fault shmdt
LockWrite(mmap_sem)
do_munmap
unmap_region
unmap_vmas
unmap_single_vma
unmap_hugepage_range
Lock(i_mmap_mutex)
Lock(mm->page_table_lock)
huge_pmd_unshare/unmap tables <--- (1)
Unlock(mm->page_table_lock)
Unlock(i_mmap_mutex)
huge_pte_alloc ...
Lock(i_mmap_mutex) ...
vma_prio_walk, find svma, spte ...
Lock(mm->page_table_lock) ...
share spte ...
Unlock(mm->page_table_lock) ...
Unlock(i_mmap_mutex) ...
hugetlb_no_page <--- (2)
free_pgtables
unlink_file_vma
hugetlb_free_pgd_range
remove_vma_list
In this scenario, it is possible for Process A to share page tables with
Process B that is trying to tear them down. The i_mmap_mutex on its own
does not prevent Process A walking Process B's page tables. At (1) above,
the page tables are not shared yet so it unmaps the PMDs. Process A sets
up page table sharing and at (2) faults a new entry. Process B then trips
up on it in free_pgtables.
This patch fixes the problem by adding a new function
__unmap_hugepage_range_final that is only called when the VMA is about to
be destroyed. This function clears VM_MAYSHARE during
unmap_hugepage_range() under the i_mmap_mutex. This makes the VMA
ineligible for sharing and avoids the race. Superficially this looks like
it would then be vunerable to truncate and madvise issues but hugetlbfs
has its own truncate handlers so does not use unmap_mapping_range() and
does not support madvise(DONTNEED).
This should be treated as a -stable candidate if it is merged.
Test program is as follows. The test case was mostly written by Michal
Hocko with a few minor changes to reproduce this bug.
==== CUT HERE ====
static size_t huge_page_size = (2UL << 20);
static size_t nr_huge_page_A = 512;
static size_t nr_huge_page_B = 5632;
unsigned int get_random(unsigned int max)
{
struct timeval tv;
gettimeofday(&tv, NULL);
srandom(tv.tv_usec);
return random() % max;
}
static void play(void *addr, size_t size)
{
unsigned char *start = addr,
*end = start + size,
*a;
start += get_random(size/2);
/* we could itterate on huge pages but let's give it more time. */
for (a = start; a < end; a += 4096)
*a = 0;
}
int main(int argc, char **argv)
{
key_t key = IPC_PRIVATE;
size_t sizeA = nr_huge_page_A * huge_page_size;
size_t sizeB = nr_huge_page_B * huge_page_size;
int shmidA, shmidB;
void *addrA = NULL, *addrB = NULL;
int nr_children = 300, n = 0;
if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
perror("shmget:");
return 1;
}
if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
perror("shmat");
return 1;
}
if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
perror("shmget:");
return 1;
}
if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
perror("shmat");
return 1;
}
fork_child:
switch(fork()) {
case 0:
switch (n%3) {
case 0:
play(addrA, sizeA);
break;
case 1:
play(addrB, sizeB);
break;
case 2:
break;
}
break;
case -1:
perror("fork:");
break;
default:
if (++n < nr_children)
goto fork_child;
play(addrA, sizeA);
break;
}
shmdt(addrA);
shmdt(addrB);
do {
wait(NULL);
} while (--n > 0);
shmctl(shmidA, IPC_RMID, NULL);
shmctl(shmidB, IPC_RMID, NULL);
return 0;
}
[akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 07:46:20 +08:00
|
|
|
static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
|
|
|
|
struct vm_area_struct *vma, unsigned long start,
|
|
|
|
unsigned long end, struct page *ref_page)
|
|
|
|
{
|
|
|
|
BUG();
|
|
|
|
}
|
|
|
|
|
2012-08-01 07:42:03 +08:00
|
|
|
static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
|
|
|
|
struct vm_area_struct *vma, unsigned long start,
|
|
|
|
unsigned long end, struct page *ref_page)
|
|
|
|
{
|
|
|
|
BUG();
|
|
|
|
}
|
2019-12-01 09:56:40 +08:00
|
|
|
|
2019-03-29 11:43:51 +08:00
|
|
|
static inline vm_fault_t hugetlb_fault(struct mm_struct *mm,
|
2019-12-01 09:56:40 +08:00
|
|
|
struct vm_area_struct *vma, unsigned long address,
|
|
|
|
unsigned int flags)
|
2019-03-29 11:43:51 +08:00
|
|
|
{
|
|
|
|
BUG();
|
|
|
|
return 0;
|
|
|
|
}
|
2012-08-01 07:42:03 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif /* !CONFIG_HUGETLB_PAGE */
|
2014-11-06 00:27:40 +08:00
|
|
|
/*
|
|
|
|
* hugepages at page global directory. If arch support
|
|
|
|
* hugepages at pgd level, they need to define this.
|
|
|
|
*/
|
|
|
|
#ifndef pgd_huge
|
|
|
|
#define pgd_huge(x) 0
|
|
|
|
#endif
|
2017-03-09 22:24:07 +08:00
|
|
|
#ifndef p4d_huge
|
|
|
|
#define p4d_huge(x) 0
|
|
|
|
#endif
|
2014-11-06 00:27:40 +08:00
|
|
|
|
|
|
|
#ifndef pgd_write
|
|
|
|
static inline int pgd_write(pgd_t pgd)
|
|
|
|
{
|
|
|
|
BUG();
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2009-09-22 08:03:47 +08:00
|
|
|
#define HUGETLB_ANON_FILE "anon_hugepage"
|
|
|
|
|
2009-09-22 08:03:43 +08:00
|
|
|
enum {
|
|
|
|
/*
|
|
|
|
* The file will be used as an shm file so shmfs accounting rules
|
|
|
|
* apply
|
|
|
|
*/
|
|
|
|
HUGETLB_SHMFS_INODE = 1,
|
2009-09-22 08:03:47 +08:00
|
|
|
/*
|
|
|
|
* The file is being created on the internal vfs mount and shmfs
|
|
|
|
* accounting rules do not apply
|
|
|
|
*/
|
|
|
|
HUGETLB_ANONHUGE_INODE = 2,
|
2009-09-22 08:03:43 +08:00
|
|
|
};
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#ifdef CONFIG_HUGETLBFS
|
|
|
|
struct hugetlbfs_sb_info {
|
|
|
|
long max_inodes; /* inodes allowed */
|
|
|
|
long free_inodes; /* inodes free */
|
|
|
|
spinlock_t stat_lock;
|
2008-07-24 12:27:43 +08:00
|
|
|
struct hstate *hstate;
|
hugepages: fix use after free bug in "quota" handling
hugetlbfs_{get,put}_quota() are badly named. They don't interact with the
general quota handling code, and they don't much resemble its behaviour.
Rather than being about maintaining limits on on-disk block usage by
particular users, they are instead about maintaining limits on in-memory
page usage (including anonymous MAP_PRIVATE copied-on-write pages)
associated with a particular hugetlbfs filesystem instance.
Worse, they work by having callbacks to the hugetlbfs filesystem code from
the low-level page handling code, in particular from free_huge_page().
This is a layering violation of itself, but more importantly, if the
kernel does a get_user_pages() on hugepages (which can happen from KVM
amongst others), then the free_huge_page() can be delayed until after the
associated inode has already been freed. If an unmount occurs at the
wrong time, even the hugetlbfs superblock where the "quota" limits are
stored may have been freed.
Andrew Barry proposed a patch to fix this by having hugepages, instead of
storing a pointer to their address_space and reaching the superblock from
there, had the hugepages store pointers directly to the superblock,
bumping the reference count as appropriate to avoid it being freed.
Andrew Morton rejected that version, however, on the grounds that it made
the existing layering violation worse.
This is a reworked version of Andrew's patch, which removes the extra, and
some of the existing, layering violation. It works by introducing the
concept of a hugepage "subpool" at the lower hugepage mm layer - that is a
finite logical pool of hugepages to allocate from. hugetlbfs now creates
a subpool for each filesystem instance with a page limit set, and a
pointer to the subpool gets added to each allocated hugepage, instead of
the address_space pointer used now. The subpool has its own lifetime and
is only freed once all pages in it _and_ all other references to it (i.e.
superblocks) are gone.
subpools are optional - a NULL subpool pointer is taken by the code to
mean that no subpool limits are in effect.
Previous discussion of this bug found in: "Fix refcounting in hugetlbfs
quota handling.". See: https://lkml.org/lkml/2011/8/11/28 or
http://marc.info/?l=linux-mm&m=126928970510627&w=1
v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
alloc_huge_page() - since it already takes the vma, it is not necessary.
Signed-off-by: Andrew Barry <abarry@cray.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 07:34:12 +08:00
|
|
|
struct hugepage_subpool *spool;
|
2017-07-05 23:24:18 +08:00
|
|
|
kuid_t uid;
|
|
|
|
kgid_t gid;
|
|
|
|
umode_t mode;
|
2005-04-17 06:20:36 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
|
|
|
|
{
|
|
|
|
return sb->s_fs_info;
|
|
|
|
}
|
|
|
|
|
2018-02-01 08:19:22 +08:00
|
|
|
struct hugetlbfs_inode_info {
|
|
|
|
struct shared_policy policy;
|
|
|
|
struct inode vfs_inode;
|
2018-02-01 08:19:25 +08:00
|
|
|
unsigned int seals;
|
2018-02-01 08:19:22 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode)
|
|
|
|
{
|
|
|
|
return container_of(inode, struct hugetlbfs_inode_info, vfs_inode);
|
|
|
|
}
|
|
|
|
|
2006-03-28 17:56:42 +08:00
|
|
|
extern const struct file_operations hugetlbfs_file_operations;
|
2009-09-28 02:29:37 +08:00
|
|
|
extern const struct vm_operations_struct hugetlb_vm_ops;
|
2013-05-08 07:18:13 +08:00
|
|
|
struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
|
2012-12-12 08:01:34 +08:00
|
|
|
struct user_struct **user, int creat_flags,
|
|
|
|
int page_size_log);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2016-01-15 07:18:51 +08:00
|
|
|
static inline bool is_file_hugepages(struct file *file)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2007-03-02 07:46:08 +08:00
|
|
|
if (file->f_op == &hugetlbfs_file_operations)
|
2016-01-15 07:18:51 +08:00
|
|
|
return true;
|
2007-03-02 07:46:08 +08:00
|
|
|
|
2016-01-15 07:18:51 +08:00
|
|
|
return is_file_shm_hugepages(file);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2020-04-02 12:11:54 +08:00
|
|
|
static inline struct hstate *hstate_inode(struct inode *i)
|
|
|
|
{
|
|
|
|
return HUGETLBFS_SB(i->i_sb)->hstate;
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
#else /* !CONFIG_HUGETLBFS */
|
|
|
|
|
2016-01-15 07:18:51 +08:00
|
|
|
#define is_file_hugepages(file) false
|
2012-03-22 07:34:14 +08:00
|
|
|
static inline struct file *
|
2013-05-08 07:18:13 +08:00
|
|
|
hugetlb_file_setup(const char *name, size_t size, vm_flags_t acctflag,
|
|
|
|
struct user_struct **user, int creat_flags,
|
2012-12-12 08:01:34 +08:00
|
|
|
int page_size_log)
|
2009-09-25 05:47:45 +08:00
|
|
|
{
|
|
|
|
return ERR_PTR(-ENOSYS);
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2020-04-02 12:11:54 +08:00
|
|
|
static inline struct hstate *hstate_inode(struct inode *i)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif /* !CONFIG_HUGETLBFS */
|
|
|
|
|
2007-05-07 05:49:00 +08:00
|
|
|
#ifdef HAVE_ARCH_HUGETLB_UNMAPPED_AREA
|
|
|
|
unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
|
|
|
|
unsigned long len, unsigned long pgoff,
|
|
|
|
unsigned long flags);
|
|
|
|
#endif /* HAVE_ARCH_HUGETLB_UNMAPPED_AREA */
|
|
|
|
|
2008-07-24 12:27:41 +08:00
|
|
|
#ifdef CONFIG_HUGETLB_PAGE
|
|
|
|
|
2008-07-24 12:27:44 +08:00
|
|
|
#define HSTATE_NAME_LEN 32
|
2008-07-24 12:27:41 +08:00
|
|
|
/* Defines one hugetlb page size */
|
|
|
|
struct hstate {
|
2009-09-22 08:01:22 +08:00
|
|
|
int next_nid_to_alloc;
|
|
|
|
int next_nid_to_free;
|
2008-07-24 12:27:41 +08:00
|
|
|
unsigned int order;
|
|
|
|
unsigned long mask;
|
|
|
|
unsigned long max_huge_pages;
|
|
|
|
unsigned long nr_huge_pages;
|
|
|
|
unsigned long free_huge_pages;
|
|
|
|
unsigned long resv_huge_pages;
|
|
|
|
unsigned long surplus_huge_pages;
|
|
|
|
unsigned long nr_overcommit_huge_pages;
|
2012-08-01 07:42:07 +08:00
|
|
|
struct list_head hugepage_activelist;
|
2008-07-24 12:27:41 +08:00
|
|
|
struct list_head hugepage_freelists[MAX_NUMNODES];
|
|
|
|
unsigned int nr_huge_pages_node[MAX_NUMNODES];
|
|
|
|
unsigned int free_huge_pages_node[MAX_NUMNODES];
|
|
|
|
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
|
2012-08-01 07:42:24 +08:00
|
|
|
#ifdef CONFIG_CGROUP_HUGETLB
|
|
|
|
/* cgroup control files */
|
hugetlb_cgroup: add hugetlb_cgroup reservation counter
These counters will track hugetlb reservations rather than hugetlb memory
faulted in. This patch only adds the counter, following patches add the
charging and uncharging of the counter.
This is patch 1 of an 9 patch series.
Problem:
Currently tasks attempting to reserve more hugetlb memory than is
available get a failure at mmap/shmget time. This is thanks to Hugetlbfs
Reservations [1]. However, if a task attempts to reserve more hugetlb
memory than its hugetlb_cgroup limit allows, the kernel will allow the
mmap/shmget call, but will SIGBUS the task when it attempts to fault in
the excess memory.
We have users hitting their hugetlb_cgroup limits and thus we've been
looking at this failure mode. We'd like to improve this behavior such
that users violating the hugetlb_cgroup limits get an error on mmap/shmget
time, rather than getting SIGBUS'd when they try to fault the excess
memory in. This gives the user an opportunity to fallback more gracefully
to non-hugetlbfs memory for example.
The underlying problem is that today's hugetlb_cgroup accounting happens
at hugetlb memory *fault* time, rather than at *reservation* time. Thus,
enforcing the hugetlb_cgroup limit only happens at fault time, and the
offending task gets SIGBUS'd.
Proposed Solution:
A new page counter named
'hugetlb.xMB.rsvd.[limit|usage|max_usage]_in_bytes'. This counter has
slightly different semantics than
'hugetlb.xMB.[limit|usage|max_usage]_in_bytes':
- While usage_in_bytes tracks all *faulted* hugetlb memory,
rsvd.usage_in_bytes tracks all *reserved* hugetlb memory and hugetlb
memory faulted in without a prior reservation.
- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve
more memory than rsvd.limit_in_bytes, the kernel will fail this
reservation.
This proposal is implemented in this patch series, with tests to verify
functionality and show the usage.
Alternatives considered:
1. A new cgroup, instead of only a new page_counter attached to the
existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code
duplication with hugetlb_cgroup. Keeping hugetlb related page counters
under hugetlb_cgroup seemed cleaner as well.
2. Instead of adding a new counter, we considered adding a sysctl that
modifies the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do
accounting at reservation time rather than fault time. Adding a new
page_counter seems better as userspace could, if it wants, choose to
enforce different cgroups differently: one via limit_in_bytes, and
another via rsvd.limit_in_bytes. This could be very useful if you're
transitioning how hugetlb memory is partitioned on your system one
cgroup at a time, for example. Also, someone may find usage for both
limit_in_bytes and rsvd.limit_in_bytes concurrently, and this approach
gives them the option to do so.
Testing:
- Added tests passing.
- Used libhugetlbfs for regression testing.
[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Link: http://lkml.kernel.org/r/20200211213128.73302-1-almasrymina@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:11:11 +08:00
|
|
|
struct cftype cgroup_files_dfl[7];
|
|
|
|
struct cftype cgroup_files_legacy[9];
|
2012-08-01 07:42:24 +08:00
|
|
|
#endif
|
2008-07-24 12:27:44 +08:00
|
|
|
char name[HSTATE_NAME_LEN];
|
2008-07-24 12:27:41 +08:00
|
|
|
};
|
|
|
|
|
2008-07-24 12:27:52 +08:00
|
|
|
struct huge_bootmem_page {
|
|
|
|
struct list_head list;
|
|
|
|
struct hstate *hstate;
|
|
|
|
};
|
|
|
|
|
2015-09-09 06:01:54 +08:00
|
|
|
struct page *alloc_huge_page(struct vm_area_struct *vma,
|
|
|
|
unsigned long addr, int avoid_reserve);
|
2010-09-08 09:19:33 +08:00
|
|
|
struct page *alloc_huge_page_node(struct hstate *h, int nid);
|
2017-07-11 06:49:11 +08:00
|
|
|
struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
|
|
|
|
nodemask_t *nmask);
|
2018-02-01 08:21:03 +08:00
|
|
|
struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
|
|
|
|
unsigned long address);
|
2019-03-06 07:47:44 +08:00
|
|
|
struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
|
|
|
|
int nid, nodemask_t *nmask);
|
2015-09-09 06:01:50 +08:00
|
|
|
int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
|
|
|
|
pgoff_t idx);
|
2010-09-08 09:19:33 +08:00
|
|
|
|
2008-07-24 12:27:52 +08:00
|
|
|
/* arch callback */
|
2017-07-28 13:01:25 +08:00
|
|
|
int __init __alloc_bootmem_huge_page(struct hstate *h);
|
2008-07-24 12:27:52 +08:00
|
|
|
int __init alloc_bootmem_huge_page(struct hstate *h);
|
|
|
|
|
2008-07-24 12:27:42 +08:00
|
|
|
void __init hugetlb_add_hstate(unsigned order);
|
2020-06-04 07:00:34 +08:00
|
|
|
bool __init arch_hugetlb_valid_size(unsigned long size);
|
2008-07-24 12:27:42 +08:00
|
|
|
struct hstate *size_to_hstate(unsigned long size);
|
|
|
|
|
|
|
|
#ifndef HUGE_MAX_HSTATE
|
|
|
|
#define HUGE_MAX_HSTATE 1
|
|
|
|
#endif
|
|
|
|
|
|
|
|
extern struct hstate hstates[HUGE_MAX_HSTATE];
|
|
|
|
extern unsigned int default_hstate_idx;
|
|
|
|
|
|
|
|
#define default_hstate (hstates[default_hstate_idx])
|
2008-07-24 12:27:41 +08:00
|
|
|
|
|
|
|
static inline struct hstate *hstate_file(struct file *f)
|
|
|
|
{
|
2013-01-24 06:07:38 +08:00
|
|
|
return hstate_inode(file_inode(f));
|
2008-07-24 12:27:41 +08:00
|
|
|
}
|
|
|
|
|
2013-05-08 07:18:13 +08:00
|
|
|
static inline struct hstate *hstate_sizelog(int page_size_log)
|
|
|
|
{
|
|
|
|
if (!page_size_log)
|
|
|
|
return &default_hstate;
|
2014-12-11 07:44:13 +08:00
|
|
|
|
|
|
|
return size_to_hstate(1UL << page_size_log);
|
2013-05-08 07:18:13 +08:00
|
|
|
}
|
|
|
|
|
2008-07-24 12:27:43 +08:00
|
|
|
static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
|
2008-07-24 12:27:41 +08:00
|
|
|
{
|
2008-07-24 12:27:43 +08:00
|
|
|
return hstate_file(vma->vm_file);
|
2008-07-24 12:27:41 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long huge_page_size(struct hstate *h)
|
|
|
|
{
|
|
|
|
return (unsigned long)PAGE_SIZE << h->order;
|
|
|
|
}
|
|
|
|
|
2009-01-07 06:38:53 +08:00
|
|
|
extern unsigned long vma_kernel_pagesize(struct vm_area_struct *vma);
|
|
|
|
|
2009-01-07 06:38:54 +08:00
|
|
|
extern unsigned long vma_mmu_pagesize(struct vm_area_struct *vma);
|
|
|
|
|
2008-07-24 12:27:41 +08:00
|
|
|
static inline unsigned long huge_page_mask(struct hstate *h)
|
|
|
|
{
|
|
|
|
return h->mask;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int huge_page_order(struct hstate *h)
|
|
|
|
{
|
|
|
|
return h->order;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned huge_page_shift(struct hstate *h)
|
|
|
|
{
|
|
|
|
return h->order + PAGE_SHIFT;
|
|
|
|
}
|
|
|
|
|
2014-06-05 07:07:08 +08:00
|
|
|
static inline bool hstate_is_gigantic(struct hstate *h)
|
|
|
|
{
|
|
|
|
return huge_page_order(h) >= MAX_ORDER;
|
|
|
|
}
|
|
|
|
|
2008-07-24 12:27:41 +08:00
|
|
|
static inline unsigned int pages_per_huge_page(struct hstate *h)
|
|
|
|
{
|
|
|
|
return 1 << h->order;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int blocks_per_huge_page(struct hstate *h)
|
|
|
|
{
|
|
|
|
return huge_page_size(h) / 512;
|
|
|
|
}
|
|
|
|
|
|
|
|
#include <asm/hugetlb.h>
|
|
|
|
|
2020-06-04 07:01:01 +08:00
|
|
|
#ifndef is_hugepage_only_range
|
|
|
|
static inline int is_hugepage_only_range(struct mm_struct *mm,
|
|
|
|
unsigned long addr, unsigned long len)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#define is_hugepage_only_range is_hugepage_only_range
|
|
|
|
#endif
|
|
|
|
|
2020-06-04 07:01:05 +08:00
|
|
|
#ifndef arch_clear_hugepage_flags
|
|
|
|
static inline void arch_clear_hugepage_flags(struct page *page) { }
|
|
|
|
#define arch_clear_hugepage_flags arch_clear_hugepage_flags
|
|
|
|
#endif
|
|
|
|
|
2012-04-02 02:01:34 +08:00
|
|
|
#ifndef arch_make_huge_pte
|
|
|
|
static inline pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
|
|
|
|
struct page *page, int writable)
|
|
|
|
{
|
|
|
|
return entry;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2008-07-24 12:27:42 +08:00
|
|
|
static inline struct hstate *page_hstate(struct page *page)
|
|
|
|
{
|
2014-01-24 07:52:54 +08:00
|
|
|
VM_BUG_ON_PAGE(!PageHuge(page), page);
|
2019-09-24 06:34:25 +08:00
|
|
|
return size_to_hstate(page_size(page));
|
2008-07-24 12:27:42 +08:00
|
|
|
}
|
|
|
|
|
2010-10-07 03:45:00 +08:00
|
|
|
static inline unsigned hstate_index_to_shift(unsigned index)
|
|
|
|
{
|
|
|
|
return hstates[index].order + PAGE_SHIFT;
|
|
|
|
}
|
|
|
|
|
2012-08-01 07:42:00 +08:00
|
|
|
static inline int hstate_index(struct hstate *h)
|
|
|
|
{
|
|
|
|
return h - hstates;
|
|
|
|
}
|
|
|
|
|
2013-06-25 21:19:31 +08:00
|
|
|
pgoff_t __basepage_index(struct page *page);
|
|
|
|
|
|
|
|
/* Return page->index in PAGE_SIZE units */
|
|
|
|
static inline pgoff_t basepage_index(struct page *page)
|
|
|
|
{
|
|
|
|
if (!PageCompound(page))
|
|
|
|
return page->index;
|
|
|
|
|
|
|
|
return __basepage_index(page);
|
|
|
|
}
|
|
|
|
|
2017-07-11 06:47:41 +08:00
|
|
|
extern int dissolve_free_huge_page(struct page *page);
|
2016-10-08 08:01:10 +08:00
|
|
|
extern int dissolve_free_huge_pages(unsigned long start_pfn,
|
|
|
|
unsigned long end_pfn);
|
2019-03-06 07:43:51 +08:00
|
|
|
|
2014-06-05 07:05:35 +08:00
|
|
|
#ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
|
2019-03-06 07:43:51 +08:00
|
|
|
#ifndef arch_hugetlb_migration_supported
|
|
|
|
static inline bool arch_hugetlb_migration_supported(struct hstate *h)
|
|
|
|
{
|
2017-07-07 06:38:38 +08:00
|
|
|
if ((huge_page_shift(h) == PMD_SHIFT) ||
|
2019-03-06 07:43:48 +08:00
|
|
|
(huge_page_shift(h) == PUD_SHIFT) ||
|
|
|
|
(huge_page_shift(h) == PGDIR_SHIFT))
|
2017-07-07 06:38:38 +08:00
|
|
|
return true;
|
|
|
|
else
|
|
|
|
return false;
|
2019-03-06 07:43:51 +08:00
|
|
|
}
|
|
|
|
#endif
|
2014-06-05 07:05:35 +08:00
|
|
|
#else
|
2019-03-06 07:43:51 +08:00
|
|
|
static inline bool arch_hugetlb_migration_supported(struct hstate *h)
|
|
|
|
{
|
2016-05-21 07:58:01 +08:00
|
|
|
return false;
|
2019-03-06 07:43:51 +08:00
|
|
|
}
|
2014-06-05 07:05:35 +08:00
|
|
|
#endif
|
2019-03-06 07:43:51 +08:00
|
|
|
|
|
|
|
static inline bool hugepage_migration_supported(struct hstate *h)
|
|
|
|
{
|
|
|
|
return arch_hugetlb_migration_supported(h);
|
2013-09-12 05:22:11 +08:00
|
|
|
}
|
2013-09-12 05:22:09 +08:00
|
|
|
|
2019-03-06 07:43:44 +08:00
|
|
|
/*
|
|
|
|
* Movability check is different as compared to migration check.
|
|
|
|
* It determines whether or not a huge page should be placed on
|
|
|
|
* movable zone or not. Movability of any huge page should be
|
|
|
|
* required only if huge page size is supported for migration.
|
|
|
|
* There wont be any reason for the huge page to be movable if
|
|
|
|
* it is not migratable to start with. Also the size of the huge
|
|
|
|
* page should be large enough to be placed under a movable zone
|
|
|
|
* and still feasible enough to be migratable. Just the presence
|
|
|
|
* in movable zone does not make the migration feasible.
|
|
|
|
*
|
|
|
|
* So even though large huge page sizes like the gigantic ones
|
|
|
|
* are migratable they should not be movable because its not
|
|
|
|
* feasible to migrate them from movable zone.
|
|
|
|
*/
|
|
|
|
static inline bool hugepage_movable_supported(struct hstate *h)
|
|
|
|
{
|
|
|
|
if (!hugepage_migration_supported(h))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (hstate_is_gigantic(h))
|
|
|
|
return false;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2013-11-15 06:31:02 +08:00
|
|
|
static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
|
|
|
|
struct mm_struct *mm, pte_t *pte)
|
|
|
|
{
|
|
|
|
if (huge_page_size(h) == PMD_SIZE)
|
|
|
|
return pmd_lockptr(mm, (pmd_t *) pte);
|
|
|
|
VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
|
|
|
|
return &mm->page_table_lock;
|
|
|
|
}
|
|
|
|
|
2015-07-18 07:23:37 +08:00
|
|
|
#ifndef hugepages_supported
|
|
|
|
/*
|
|
|
|
* Some platform decide whether they support huge pages at boot
|
|
|
|
* time. Some of them, such as powerpc, set HPAGE_SHIFT to 0
|
|
|
|
* when there is no such support
|
|
|
|
*/
|
|
|
|
#define hugepages_supported() (HPAGE_SHIFT != 0)
|
|
|
|
#endif
|
hugetlb: ensure hugepage access is denied if hugepages are not supported
Currently, I am seeing the following when I `mount -t hugetlbfs /none
/dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's
related to the fact that hugetlbfs is properly not correctly setting
itself up in this state?:
Unable to handle kernel paging request for data at address 0x00000031
Faulting instruction address: 0xc000000000245710
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA pSeries
....
In KVM guests on Power, in a guest not backed by hugepages, we see the
following:
AnonHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 64 kB
HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
are not supported at boot-time, but this is only checked in
hugetlb_init(). Extract the check to a helper function, and use it in a
few relevant places.
This does make hugetlbfs not supported (not registered at all) in this
environment. I believe this is fine, as there are no valid hugepages
and that won't change at runtime.
[akpm@linux-foundation.org: use pr_info(), per Mel]
[akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-07 03:50:00 +08:00
|
|
|
|
2015-11-06 10:47:14 +08:00
|
|
|
void hugetlb_report_usage(struct seq_file *m, struct mm_struct *mm);
|
|
|
|
|
|
|
|
static inline void hugetlb_count_add(long l, struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
atomic_long_add(l, &mm->hugetlb_usage);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void hugetlb_count_sub(long l, struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
atomic_long_sub(l, &mm->hugetlb_usage);
|
|
|
|
}
|
2017-07-07 06:39:50 +08:00
|
|
|
|
|
|
|
#ifndef set_huge_swap_pte_at
|
|
|
|
static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr,
|
|
|
|
pte_t *ptep, pte_t pte, unsigned long sz)
|
|
|
|
{
|
|
|
|
set_huge_pte_at(mm, addr, ptep, pte);
|
|
|
|
}
|
|
|
|
#endif
|
2019-03-06 07:46:37 +08:00
|
|
|
|
|
|
|
#ifndef huge_ptep_modify_prot_start
|
|
|
|
#define huge_ptep_modify_prot_start huge_ptep_modify_prot_start
|
|
|
|
static inline pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma,
|
|
|
|
unsigned long addr, pte_t *ptep)
|
|
|
|
{
|
|
|
|
return huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef huge_ptep_modify_prot_commit
|
|
|
|
#define huge_ptep_modify_prot_commit huge_ptep_modify_prot_commit
|
|
|
|
static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
|
|
|
|
unsigned long addr, pte_t *ptep,
|
|
|
|
pte_t old_pte, pte_t pte)
|
|
|
|
{
|
|
|
|
set_huge_pte_at(vma->vm_mm, addr, ptep, pte);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2013-05-08 07:18:13 +08:00
|
|
|
#else /* CONFIG_HUGETLB_PAGE */
|
2008-07-24 12:27:41 +08:00
|
|
|
struct hstate {};
|
2019-07-12 11:54:40 +08:00
|
|
|
|
|
|
|
static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
|
|
|
|
unsigned long addr,
|
|
|
|
int avoid_reserve)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct page *alloc_huge_page_node(struct hstate *h, int nid)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct page *
|
|
|
|
alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, nodemask_t *nmask)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct page *alloc_huge_page_vma(struct hstate *h,
|
|
|
|
struct vm_area_struct *vma,
|
|
|
|
unsigned long address)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int __alloc_bootmem_huge_page(struct hstate *h)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct hstate *hstate_file(struct file *f)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct hstate *hstate_sizelog(int page_size_log)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct hstate *page_hstate(struct page *page)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long huge_page_size(struct hstate *h)
|
|
|
|
{
|
|
|
|
return PAGE_SIZE;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long huge_page_mask(struct hstate *h)
|
|
|
|
{
|
|
|
|
return PAGE_MASK;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long vma_kernel_pagesize(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
return PAGE_SIZE;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long vma_mmu_pagesize(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
return PAGE_SIZE;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int huge_page_order(struct hstate *h)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int huge_page_shift(struct hstate *h)
|
|
|
|
{
|
|
|
|
return PAGE_SHIFT;
|
|
|
|
}
|
|
|
|
|
2017-07-07 06:38:38 +08:00
|
|
|
static inline bool hstate_is_gigantic(struct hstate *h)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2008-07-27 06:22:27 +08:00
|
|
|
static inline unsigned int pages_per_huge_page(struct hstate *h)
|
|
|
|
{
|
|
|
|
return 1;
|
|
|
|
}
|
2017-07-11 06:47:41 +08:00
|
|
|
|
|
|
|
static inline unsigned hstate_index_to_shift(unsigned index)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int hstate_index(struct hstate *h)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2013-06-25 21:19:31 +08:00
|
|
|
|
|
|
|
static inline pgoff_t basepage_index(struct page *page)
|
|
|
|
{
|
|
|
|
return page->index;
|
|
|
|
}
|
2017-07-11 06:47:41 +08:00
|
|
|
|
|
|
|
static inline int dissolve_free_huge_page(struct page *page)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int dissolve_free_huge_pages(unsigned long start_pfn,
|
|
|
|
unsigned long end_pfn)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool hugepage_migration_supported(struct hstate *h)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
2013-11-15 06:31:02 +08:00
|
|
|
|
2019-03-06 07:43:44 +08:00
|
|
|
static inline bool hugepage_movable_supported(struct hstate *h)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2013-11-15 06:31:02 +08:00
|
|
|
static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
|
|
|
|
struct mm_struct *mm, pte_t *pte)
|
|
|
|
{
|
|
|
|
return &mm->page_table_lock;
|
|
|
|
}
|
2015-11-06 10:47:14 +08:00
|
|
|
|
|
|
|
static inline void hugetlb_report_usage(struct seq_file *f, struct mm_struct *m)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void hugetlb_count_sub(long l, struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
}
|
2017-07-07 06:39:50 +08:00
|
|
|
|
|
|
|
static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr,
|
|
|
|
pte_t *ptep, pte_t pte, unsigned long sz)
|
|
|
|
{
|
|
|
|
}
|
2013-05-08 07:18:13 +08:00
|
|
|
#endif /* CONFIG_HUGETLB_PAGE */
|
2008-07-24 12:27:41 +08:00
|
|
|
|
2013-11-15 06:31:02 +08:00
|
|
|
static inline spinlock_t *huge_pte_lock(struct hstate *h,
|
|
|
|
struct mm_struct *mm, pte_t *pte)
|
|
|
|
{
|
|
|
|
spinlock_t *ptl;
|
|
|
|
|
|
|
|
ptl = huge_pte_lockptr(h, mm, pte);
|
|
|
|
spin_lock(ptl);
|
|
|
|
return ptl;
|
|
|
|
}
|
|
|
|
|
mm: hugetlb: optionally allocate gigantic hugepages using cma
Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
at runtime") has added the run-time allocation of gigantic pages.
However it actually works only at early stages of the system loading,
when the majority of memory is free. After some time the memory gets
fragmented by non-movable pages, so the chances to find a contiguous 1GB
block are getting close to zero. Even dropping caches manually doesn't
help a lot.
At large scale rebooting servers in order to allocate gigantic hugepages
is quite expensive and complex. At the same time keeping some constant
percentage of memory in reserved hugepages even if the workload isn't
using it is a big waste: not all workloads can benefit from using 1 GB
pages.
The following solution can solve the problem:
1) On boot time a dedicated cma area* is reserved. The size is passed
as a kernel argument.
2) Run-time allocations of gigantic hugepages are performed using the
cma allocator and the dedicated cma area
In this case gigantic hugepages can be allocated successfully with a
high probability, however the memory isn't completely wasted if nobody
is using 1GB hugepages: it can be used for pagecache, anon memory, THPs,
etc.
* On a multi-node machine a per-node cma area is allocated on each node.
Following gigantic hugetlb allocation are using the first available
numa node if the mask isn't specified by a user.
Usage:
1) configure the kernel to allocate a cma area for hugetlb allocations:
pass hugetlb_cma=10G as a kernel argument
2) allocate hugetlb pages as usual, e.g.
echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
If the option isn't enabled or the allocation of the cma area failed,
the current behavior of the system is preserved.
x86 and arm-64 are covered by this patch, other architectures can be
trivially added later.
The patch contains clean-ups and fixes proposed and implemented by Aslan
Bakirov and Randy Dunlap. It also contains ideas and suggestions
proposed by Rik van Riel, Michal Hocko and Mike Kravetz. Thanks!
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Andreas Schaufler <andreas.schaufler@gmx.de>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Aslan Bakirov <aslan@fb.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Link: http://lkml.kernel.org/r/20200407163840.92263-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-11 05:32:45 +08:00
|
|
|
#if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
|
|
|
|
extern void __init hugetlb_cma_reserve(int order);
|
|
|
|
extern void __init hugetlb_cma_check(void);
|
|
|
|
#else
|
|
|
|
static inline __init void hugetlb_cma_reserve(int order)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline __init void hugetlb_cma_check(void)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif /* _LINUX_HUGETLB_H */
|