License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2005-04-17 06:20:36 +08:00
|
|
|
#ifndef _LINUX_PAGEMAP_H
|
|
|
|
#define _LINUX_PAGEMAP_H
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Copyright 1995 Linus Torvalds
|
|
|
|
*/
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/fs.h>
|
|
|
|
#include <linux/list.h>
|
|
|
|
#include <linux/highmem.h>
|
|
|
|
#include <linux/compiler.h>
|
2016-12-25 03:46:01 +08:00
|
|
|
#include <linux/uaccess.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
#include <linux/gfp.h>
|
2007-05-08 15:23:25 +08:00
|
|
|
#include <linux/bitops.h>
|
2008-07-26 10:45:30 +08:00
|
|
|
#include <linux/hardirq.h> /* for in_interrupt() */
|
2010-05-28 08:29:15 +08:00
|
|
|
#include <linux/hugetlb_inline.h>
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2021-12-08 03:15:07 +08:00
|
|
|
struct folio_batch;
|
2017-11-16 09:37:33 +08:00
|
|
|
|
2021-05-05 09:32:45 +08:00
|
|
|
static inline bool mapping_empty(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
return xa_empty(&mapping->i_pages);
|
|
|
|
}
|
|
|
|
|
vfs: keep inodes with page cache off the inode shrinker LRU
Historically (pre-2.5), the inode shrinker used to reclaim only empty
inodes and skip over those that still contained page cache. This caused
problems on highmem hosts: struct inode could put fill lowmem zones
before the cache was getting reclaimed in the highmem zones.
To address this, the inode shrinker started to strip page cache to
facilitate reclaiming lowmem. However, this comes with its own set of
problems: the shrinkers may drop actively used page cache just because
the inodes are not currently open or dirty - think working with a large
git tree. It further doesn't respect cgroup memory protection settings
and can cause priority inversions between containers.
Nowadays, the page cache also holds non-resident info for evicted cache
pages in order to detect refaults. We've come to rely heavily on this
data inside reclaim for protecting the cache workingset and driving swap
behavior. We also use it to quantify and report workload health through
psi. The latter in turn is used for fleet health monitoring, as well as
driving automated memory sizing of workloads and containers, proactive
reclaim and memory offloading schemes.
The consequences of dropping page cache prematurely is that we're seeing
subtle and not-so-subtle failures in all of the above-mentioned
scenarios, with the workload generally entering unexpected thrashing
states while losing the ability to reliably detect it.
To fix this on non-highmem systems at least, going back to rotating
inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7
("mm: don't reclaim inodes with many attached pages")) and failed
(commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many
attached pages"")).
The issue is mostly that shrinker pools attract pressure based on their
size, and when objects get skipped the shrinkers remember this as
deferred reclaim work. This accumulates excessive pressure on the
remaining inodes, and we can quickly eat into heavily used ones, or
dirty ones that require IO to reclaim, when there potentially is plenty
of cold, clean cache around still.
Instead, this patch keeps populated inodes off the inode LRU in the
first place - just like an open file or dirty state would. An otherwise
clean and unused inode then gets queued when the last cache entry
disappears. This solves the problem without reintroducing the reclaim
issues, and generally is a bit more scalable than having to wade through
potentially hundreds of thousands of busy inodes.
Locking is a bit tricky because the locks protecting the inode state
(i_lock) and the inode LRU (lru_list.lock) don't nest inside the
irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are
serialized through i_lock, taken before the i_pages lock, to make sure
depopulated inodes are queued reliably. Additions may race with
deletions, but we'll check again in the shrinker. If additions race
with the shrinker itself, we're protected by the i_lock: if find_inode()
or iput() win, the shrinker will bail on the elevated i_count or
I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
will set I_FREEING and inhibit further igets(), which will cause the
other side to create a new instance of the inode instead.
Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09 10:31:24 +08:00
|
|
|
/*
|
|
|
|
* mapping_shrinkable - test if page cache state allows inode reclaim
|
|
|
|
* @mapping: the page cache mapping
|
|
|
|
*
|
|
|
|
* This checks the mapping's cache state for the pupose of inode
|
|
|
|
* reclaim and LRU management.
|
|
|
|
*
|
|
|
|
* The caller is expected to hold the i_lock, but is not required to
|
|
|
|
* hold the i_pages lock, which usually protects cache state. That's
|
|
|
|
* because the i_lock and the list_lru lock that protect the inode and
|
|
|
|
* its LRU state don't nest inside the irq-safe i_pages lock.
|
|
|
|
*
|
|
|
|
* Cache deletions are performed under the i_lock, which ensures that
|
|
|
|
* when an inode goes empty, it will reliably get queued on the LRU.
|
|
|
|
*
|
|
|
|
* Cache additions do not acquire the i_lock and may race with this
|
|
|
|
* check, in which case we'll report the inode as shrinkable when it
|
|
|
|
* has cache pages. This is okay: the shrinker also checks the
|
|
|
|
* refcount and the referenced bit, which will be elevated or set in
|
|
|
|
* the process of adding new cache pages to an inode.
|
|
|
|
*/
|
|
|
|
static inline bool mapping_shrinkable(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
void *head;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* On highmem systems, there could be lowmem pressure from the
|
|
|
|
* inodes before there is highmem pressure from the page
|
|
|
|
* cache. Make inodes shrinkable regardless of cache state.
|
|
|
|
*/
|
|
|
|
if (IS_ENABLED(CONFIG_HIGHMEM))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
/* Cache completely empty? Shrink away. */
|
|
|
|
head = rcu_access_pointer(mapping->i_pages.xa_head);
|
|
|
|
if (!head)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The xarray stores single offset-0 entries directly in the
|
|
|
|
* head pointer, which allows non-resident page cache entries
|
|
|
|
* to escape the shadow shrinker's list of xarray nodes. The
|
|
|
|
* inode shrinker needs to pick them up under memory pressure.
|
|
|
|
*/
|
|
|
|
if (!xa_is_node(head) && xa_is_value(head))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2016-10-12 04:56:04 +08:00
|
|
|
* Bits in mapping->flags.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2009-04-03 07:56:45 +08:00
|
|
|
enum mapping_flags {
|
2016-10-12 04:56:04 +08:00
|
|
|
AS_EIO = 0, /* IO error on async write */
|
|
|
|
AS_ENOSPC = 1, /* ENOSPC on async write */
|
|
|
|
AS_MM_ALL_LOCKS = 2, /* under mm_take_all_locks() */
|
|
|
|
AS_UNEVICTABLE = 3, /* e.g., ramdisk, SHM_LOCK */
|
|
|
|
AS_EXITING = 4, /* final truncate in progress */
|
mm: don't use radix tree writeback tags for pages in swap cache
File pages use a set of radix tree tags (DIRTY, TOWRITE, WRITEBACK,
etc.) to accelerate finding the pages with a specific tag in the radix
tree during inode writeback. But for anonymous pages in the swap cache,
there is no inode writeback. So there is no need to find the pages with
some writeback tags in the radix tree. It is not necessary to touch
radix tree writeback tags for pages in the swap cache.
Per Rik van Riel's suggestion, a new flag AS_NO_WRITEBACK_TAGS is
introduced for address spaces which don't need to update the writeback
tags. The flag is set for swap caches. It may be used for DAX file
systems, etc.
With this patch, the swap out bandwidth improved 22.3% (from ~1.2GB/s to
~1.48GBps) in the vm-scalability swap-w-seq test case with 8 processes.
The test is done on a Xeon E5 v3 system. The swap device used is a RAM
simulated PMEM (persistent memory) device. The improvement comes from
the reduced contention on the swap cache radix tree lock. To test
sequential swapping out, the test case uses 8 processes, which
sequentially allocate and write to the anonymous pages until RAM and
part of the swap device is used up.
Details of comparison is as follow,
base base+patch
---------------- --------------------------
%stddev %change %stddev
\ | \
2506952 ± 2% +28.1% 3212076 ± 7% vm-scalability.throughput
1207402 ± 7% +22.3% 1476578 ± 6% vmstat.swap.so
10.86 ± 12% -23.4% 8.31 ± 16% perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list
10.82 ± 13% -33.1% 7.24 ± 14% perf-profile.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_zone_memcg
10.36 ± 11% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.__test_set_page_writeback.bdev_write_page.__swap_writepage.swap_writepage
10.52 ± 12% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.test_clear_page_writeback.end_page_writeback.page_endio.pmem_rw_page
Link: http://lkml.kernel.org/r/1472578089-5560-1-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-08 07:59:30 +08:00
|
|
|
/* writeback related tags are not used */
|
2016-10-12 04:56:04 +08:00
|
|
|
AS_NO_WRITEBACK_TAGS = 5,
|
2021-08-29 18:28:19 +08:00
|
|
|
AS_LARGE_FOLIO_SUPPORT = 6,
|
2009-04-03 07:56:45 +08:00
|
|
|
};
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2017-07-06 19:02:26 +08:00
|
|
|
/**
|
|
|
|
* mapping_set_error - record a writeback error in the address_space
|
2020-04-02 12:07:55 +08:00
|
|
|
* @mapping: the mapping in which an error should be set
|
|
|
|
* @error: the error to set in the mapping
|
2017-07-06 19:02:26 +08:00
|
|
|
*
|
|
|
|
* When writeback fails in some way, we must record that error so that
|
|
|
|
* userspace can be informed when fsync and the like are called. We endeavor
|
|
|
|
* to report errors on any file that was open at the time of the error. Some
|
|
|
|
* internal callers also need to know when writeback errors have occurred.
|
|
|
|
*
|
|
|
|
* When a writeback error occurs, most filesystems will want to call
|
|
|
|
* mapping_set_error to record the error in the mapping so that it can be
|
|
|
|
* reported when the application calls fsync(2).
|
|
|
|
*/
|
2007-05-08 15:23:25 +08:00
|
|
|
static inline void mapping_set_error(struct address_space *mapping, int error)
|
|
|
|
{
|
2017-07-06 19:02:26 +08:00
|
|
|
if (likely(!error))
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* Record in wb_err for checkers using errseq_t based tracking */
|
vfs: track per-sb writeback errors and report them to syncfs
Patch series "vfs: have syncfs() return error when there are writeback
errors", v6.
Currently, syncfs does not return errors when one of the inodes fails to
be written back. It will return errors based on the legacy AS_EIO and
AS_ENOSPC flags when syncing out the block device fails, but that's not
particularly helpful for filesystems that aren't backed by a blockdev.
It's also possible for a stray sync to lose those errors.
The basic idea in this set is to track writeback errors at the
superblock level, so that we can quickly and easily check whether
something bad happened without having to fsync each file individually.
syncfs is then changed to reliably report writeback errors after they
occur, much in the same fashion as fsync does now.
This patch (of 2):
Usually we suggest that applications call fsync when they want to ensure
that all data written to the file has made it to the backing store, but
that can be inefficient when there are a lot of open files.
Calling syncfs on the filesystem can be more efficient in some
situations, but the error reporting doesn't currently work the way most
people expect. If a single inode on a filesystem reports a writeback
error, syncfs won't necessarily return an error. syncfs only returns an
error if __sync_blockdev fails, and on some filesystems that's a no-op.
It would be better if syncfs reported an error if there were any
writeback failures. Then applications could call syncfs to see if there
are any errors on any open files, and could then call fsync on all of
the other descriptors to figure out which one failed.
This patch adds a new errseq_t to struct super_block, and has
mapping_set_error also record writeback errors there.
To report those errors, we also need to keep an errseq_t in struct file
to act as a cursor. This patch adds a dedicated field for that purpose,
which slots nicely into 4 bytes of padding at the end of struct file on
x86_64.
An earlier version of this patch used an O_PATH file descriptor to cue
the kernel that the open file should track the superblock error and not
the inode's writeback error.
I think that API is just too weird though. This is simpler and should
make syncfs error reporting "just work" even if someone is multiplexing
fsync and syncfs on the same fds.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andres Freund <andres@anarazel.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 12:45:36 +08:00
|
|
|
__filemap_set_wb_err(mapping, error);
|
|
|
|
|
|
|
|
/* Record it in superblock */
|
2020-10-11 14:16:37 +08:00
|
|
|
if (mapping->host)
|
|
|
|
errseq_set(&mapping->host->i_sb->s_wb_err, error);
|
2017-07-06 19:02:26 +08:00
|
|
|
|
|
|
|
/* Record it in flags for now, for legacy callers */
|
|
|
|
if (error == -ENOSPC)
|
|
|
|
set_bit(AS_ENOSPC, &mapping->flags);
|
|
|
|
else
|
|
|
|
set_bit(AS_EIO, &mapping->flags);
|
2007-05-08 15:23:25 +08:00
|
|
|
}
|
|
|
|
|
2008-10-19 11:26:42 +08:00
|
|
|
static inline void mapping_set_unevictable(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
set_bit(AS_UNEVICTABLE, &mapping->flags);
|
|
|
|
}
|
|
|
|
|
2008-10-19 11:26:43 +08:00
|
|
|
static inline void mapping_clear_unevictable(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
clear_bit(AS_UNEVICTABLE, &mapping->flags);
|
|
|
|
}
|
|
|
|
|
mm: swap: make page_evictable() inline
When backporting commit 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping
pagevecs") to our 4.9 kernel, our test bench noticed around 10% down with
a couple of vm-scalability's test cases (lru-file-readonce,
lru-file-readtwice and lru-file-mmap-read). I didn't see that much down
on my VM (32c-64g-2nodes). It might be caused by the test configuration,
which is 32c-256g with NUMA disabled and the tests were run in root memcg,
so the tests actually stress only one inactive and active lru. It sounds
not very usual in mordern production environment.
That commit did two major changes:
1. Call page_evictable()
2. Use smp_mb to force the PG_lru set visible
It looks they contribute the most overhead. The page_evictable() is a
function which does function prologue and epilogue, and that was used by
page reclaim path only. However, lru add is a very hot path, so it sounds
better to make it inline. However, it calls page_mapping() which is not
inlined either, but the disassemble shows it doesn't do push and pop
operations and it sounds not very straightforward to inline it.
Other than this, it sounds smp_mb() is not necessary for x86 since
SetPageLRU is atomic which enforces memory barrier already, replace it
with smp_mb__after_atomic() in the following patch.
With the two fixes applied, the tests can get back around 5% on that test
bench and get back normal on my VM. Since the test bench configuration is
not that usual and I also saw around 6% up on the latest upstream, so it
sounds good enough IMHO.
The below is test data (lru-file-readtwice throughput) against the v5.6-rc4:
mainline w/ inline fix
150MB 154MB
With this patch the throughput gets 2.67% up. The data with using
smp_mb__after_atomic() is showed in the following patch.
Shakeel Butt did the below test:
On a real machine with limiting the 'dd' on a single node and reading 100
GiB sparse file (less than a single node). Just ran a single instance to
not cause the lru lock contention. The cmdline used is "dd if=file-100GiB
of=/dev/null bs=4k". Ran the cmd 10 times with drop_caches in between and
measured the time it took.
Without patch: 56.64143 +- 0.672 sec
With patches: 56.10 +- 0.21 sec
[akpm@linux-foundation.org: move page_evictable() to internal.h]
Fixes: 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping pagevecs")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: http://lkml.kernel.org/r/1584500541-46817-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:06:20 +08:00
|
|
|
static inline bool mapping_unevictable(struct address_space *mapping)
|
2008-10-19 11:26:42 +08:00
|
|
|
{
|
mm: swap: make page_evictable() inline
When backporting commit 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping
pagevecs") to our 4.9 kernel, our test bench noticed around 10% down with
a couple of vm-scalability's test cases (lru-file-readonce,
lru-file-readtwice and lru-file-mmap-read). I didn't see that much down
on my VM (32c-64g-2nodes). It might be caused by the test configuration,
which is 32c-256g with NUMA disabled and the tests were run in root memcg,
so the tests actually stress only one inactive and active lru. It sounds
not very usual in mordern production environment.
That commit did two major changes:
1. Call page_evictable()
2. Use smp_mb to force the PG_lru set visible
It looks they contribute the most overhead. The page_evictable() is a
function which does function prologue and epilogue, and that was used by
page reclaim path only. However, lru add is a very hot path, so it sounds
better to make it inline. However, it calls page_mapping() which is not
inlined either, but the disassemble shows it doesn't do push and pop
operations and it sounds not very straightforward to inline it.
Other than this, it sounds smp_mb() is not necessary for x86 since
SetPageLRU is atomic which enforces memory barrier already, replace it
with smp_mb__after_atomic() in the following patch.
With the two fixes applied, the tests can get back around 5% on that test
bench and get back normal on my VM. Since the test bench configuration is
not that usual and I also saw around 6% up on the latest upstream, so it
sounds good enough IMHO.
The below is test data (lru-file-readtwice throughput) against the v5.6-rc4:
mainline w/ inline fix
150MB 154MB
With this patch the throughput gets 2.67% up. The data with using
smp_mb__after_atomic() is showed in the following patch.
Shakeel Butt did the below test:
On a real machine with limiting the 'dd' on a single node and reading 100
GiB sparse file (less than a single node). Just ran a single instance to
not cause the lru lock contention. The cmdline used is "dd if=file-100GiB
of=/dev/null bs=4k". Ran the cmd 10 times with drop_caches in between and
measured the time it took.
Without patch: 56.64143 +- 0.672 sec
With patches: 56.10 +- 0.21 sec
[akpm@linux-foundation.org: move page_evictable() to internal.h]
Fixes: 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping pagevecs")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: http://lkml.kernel.org/r/1584500541-46817-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 12:06:20 +08:00
|
|
|
return mapping && test_bit(AS_UNEVICTABLE, &mapping->flags);
|
2008-10-19 11:26:42 +08:00
|
|
|
}
|
|
|
|
|
2014-04-04 05:47:49 +08:00
|
|
|
static inline void mapping_set_exiting(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
set_bit(AS_EXITING, &mapping->flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int mapping_exiting(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
return test_bit(AS_EXITING, &mapping->flags);
|
|
|
|
}
|
|
|
|
|
mm: don't use radix tree writeback tags for pages in swap cache
File pages use a set of radix tree tags (DIRTY, TOWRITE, WRITEBACK,
etc.) to accelerate finding the pages with a specific tag in the radix
tree during inode writeback. But for anonymous pages in the swap cache,
there is no inode writeback. So there is no need to find the pages with
some writeback tags in the radix tree. It is not necessary to touch
radix tree writeback tags for pages in the swap cache.
Per Rik van Riel's suggestion, a new flag AS_NO_WRITEBACK_TAGS is
introduced for address spaces which don't need to update the writeback
tags. The flag is set for swap caches. It may be used for DAX file
systems, etc.
With this patch, the swap out bandwidth improved 22.3% (from ~1.2GB/s to
~1.48GBps) in the vm-scalability swap-w-seq test case with 8 processes.
The test is done on a Xeon E5 v3 system. The swap device used is a RAM
simulated PMEM (persistent memory) device. The improvement comes from
the reduced contention on the swap cache radix tree lock. To test
sequential swapping out, the test case uses 8 processes, which
sequentially allocate and write to the anonymous pages until RAM and
part of the swap device is used up.
Details of comparison is as follow,
base base+patch
---------------- --------------------------
%stddev %change %stddev
\ | \
2506952 ± 2% +28.1% 3212076 ± 7% vm-scalability.throughput
1207402 ± 7% +22.3% 1476578 ± 6% vmstat.swap.so
10.86 ± 12% -23.4% 8.31 ± 16% perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list
10.82 ± 13% -33.1% 7.24 ± 14% perf-profile.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_zone_memcg
10.36 ± 11% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.__test_set_page_writeback.bdev_write_page.__swap_writepage.swap_writepage
10.52 ± 12% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.test_clear_page_writeback.end_page_writeback.page_endio.pmem_rw_page
Link: http://lkml.kernel.org/r/1472578089-5560-1-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-08 07:59:30 +08:00
|
|
|
static inline void mapping_set_no_writeback_tags(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
set_bit(AS_NO_WRITEBACK_TAGS, &mapping->flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int mapping_use_writeback_tags(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
return !test_bit(AS_NO_WRITEBACK_TAGS, &mapping->flags);
|
|
|
|
}
|
|
|
|
|
2005-10-07 14:46:04 +08:00
|
|
|
static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2016-10-12 04:56:04 +08:00
|
|
|
return mapping->gfp_mask;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2015-11-07 08:28:49 +08:00
|
|
|
/* Restricts the given gfp_mask to what the mapping allows. */
|
|
|
|
static inline gfp_t mapping_gfp_constraint(struct address_space *mapping,
|
|
|
|
gfp_t gfp_mask)
|
|
|
|
{
|
|
|
|
return mapping_gfp_mask(mapping) & gfp_mask;
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* This is non-atomic. Only to be used before the mapping is activated.
|
|
|
|
* Probably needs a barrier...
|
|
|
|
*/
|
2005-10-21 15:22:44 +08:00
|
|
|
static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
2016-10-12 04:56:04 +08:00
|
|
|
m->gfp_mask = mask;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2021-08-29 18:07:03 +08:00
|
|
|
/**
|
|
|
|
* mapping_set_large_folios() - Indicate the file supports large folios.
|
|
|
|
* @mapping: The file.
|
|
|
|
*
|
|
|
|
* The filesystem should call this function in its inode constructor to
|
|
|
|
* indicate that the VFS can use large folios to cache the contents of
|
|
|
|
* the file.
|
|
|
|
*
|
|
|
|
* Context: This should not be called while the inode is active as it
|
|
|
|
* is non-atomic.
|
|
|
|
*/
|
|
|
|
static inline void mapping_set_large_folios(struct address_space *mapping)
|
|
|
|
{
|
2021-08-29 18:28:19 +08:00
|
|
|
__set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
|
2021-08-29 18:07:03 +08:00
|
|
|
}
|
|
|
|
|
2021-08-29 18:28:19 +08:00
|
|
|
static inline bool mapping_large_folio_support(struct address_space *mapping)
|
2020-10-16 11:06:00 +08:00
|
|
|
{
|
2021-08-29 18:28:19 +08:00
|
|
|
return test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
|
2020-10-16 11:06:00 +08:00
|
|
|
}
|
|
|
|
|
2020-10-16 11:06:03 +08:00
|
|
|
static inline int filemap_nr_thps(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_READ_ONLY_THP_FOR_FS
|
|
|
|
return atomic_read(&mapping->nr_thps);
|
|
|
|
#else
|
|
|
|
return 0;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void filemap_nr_thps_inc(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_READ_ONLY_THP_FOR_FS
|
2021-08-29 18:28:19 +08:00
|
|
|
if (!mapping_large_folio_support(mapping))
|
2020-10-16 11:06:03 +08:00
|
|
|
atomic_inc(&mapping->nr_thps);
|
|
|
|
#else
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void filemap_nr_thps_dec(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_READ_ONLY_THP_FOR_FS
|
2021-08-29 18:28:19 +08:00
|
|
|
if (!mapping_large_folio_support(mapping))
|
2020-10-16 11:06:03 +08:00
|
|
|
atomic_dec(&mapping->nr_thps);
|
|
|
|
#else
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2017-11-16 09:37:55 +08:00
|
|
|
void release_pages(struct page **pages, int nr);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
mm/util: Add folio_mapping() and folio_file_mapping()
These are the folio equivalent of page_mapping() and page_file_mapping().
Add an out-of-line page_mapping() wrapper around folio_mapping()
in order to prevent the page_folio() call from bloating every caller
of page_mapping(). Adjust page_file_mapping() and page_mapping_file()
to use folios internally. Rename __page_file_mapping() to
swapcache_mapping() and change it to take a folio.
This ends up saving 122 bytes of text overall. folio_mapping() is
45 bytes shorter than page_mapping() was, but the new page_mapping()
wrapper is 30 bytes. The major reduction is a few bytes less in dozens
of nfs functions (which call page_file_mapping()). Most of these appear
to be a slight change in gcc's register allocation decisions, which allow:
48 8b 56 08 mov 0x8(%rsi),%rdx
48 8d 42 ff lea -0x1(%rdx),%rax
83 e2 01 and $0x1,%edx
48 0f 44 c6 cmove %rsi,%rax
to become:
48 8b 46 08 mov 0x8(%rsi),%rax
48 8d 78 ff lea -0x1(%rax),%rdi
a8 01 test $0x1,%al
48 0f 44 fe cmove %rsi,%rdi
for a reduction of a single byte. Once the NFS client is converted to
use folios, this entire sequence will disappear.
Also add folio_mapping() documentation.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: David Howells <dhowells@redhat.com>
2020-12-10 23:55:05 +08:00
|
|
|
struct address_space *page_mapping(struct page *);
|
|
|
|
struct address_space *folio_mapping(struct folio *);
|
|
|
|
struct address_space *swapcache_mapping(struct folio *);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* folio_file_mapping - Find the mapping this folio belongs to.
|
|
|
|
* @folio: The folio.
|
|
|
|
*
|
|
|
|
* For folios which are in the page cache, return the mapping that this
|
|
|
|
* page belongs to. Folios in the swap cache return the mapping of the
|
|
|
|
* swap file or swap device where the data is stored. This is different
|
|
|
|
* from the mapping returned by folio_mapping(). The only reason to
|
|
|
|
* use it is if, like NFS, you return 0 from ->activate_swapfile.
|
|
|
|
*
|
|
|
|
* Do not call this for folios which aren't in the page cache or swap cache.
|
|
|
|
*/
|
|
|
|
static inline struct address_space *folio_file_mapping(struct folio *folio)
|
|
|
|
{
|
|
|
|
if (unlikely(folio_test_swapcache(folio)))
|
|
|
|
return swapcache_mapping(folio);
|
|
|
|
|
|
|
|
return folio->mapping;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct address_space *page_file_mapping(struct page *page)
|
|
|
|
{
|
|
|
|
return folio_file_mapping(page_folio(page));
|
|
|
|
}
|
|
|
|
|
2021-04-30 13:55:35 +08:00
|
|
|
/*
|
|
|
|
* For file cache pages, return the address_space, otherwise return NULL
|
|
|
|
*/
|
|
|
|
static inline struct address_space *page_mapping_file(struct page *page)
|
|
|
|
{
|
mm/util: Add folio_mapping() and folio_file_mapping()
These are the folio equivalent of page_mapping() and page_file_mapping().
Add an out-of-line page_mapping() wrapper around folio_mapping()
in order to prevent the page_folio() call from bloating every caller
of page_mapping(). Adjust page_file_mapping() and page_mapping_file()
to use folios internally. Rename __page_file_mapping() to
swapcache_mapping() and change it to take a folio.
This ends up saving 122 bytes of text overall. folio_mapping() is
45 bytes shorter than page_mapping() was, but the new page_mapping()
wrapper is 30 bytes. The major reduction is a few bytes less in dozens
of nfs functions (which call page_file_mapping()). Most of these appear
to be a slight change in gcc's register allocation decisions, which allow:
48 8b 56 08 mov 0x8(%rsi),%rdx
48 8d 42 ff lea -0x1(%rdx),%rax
83 e2 01 and $0x1,%edx
48 0f 44 c6 cmove %rsi,%rax
to become:
48 8b 46 08 mov 0x8(%rsi),%rax
48 8d 78 ff lea -0x1(%rax),%rdi
a8 01 test $0x1,%al
48 0f 44 fe cmove %rsi,%rdi
for a reduction of a single byte. Once the NFS client is converted to
use folios, this entire sequence will disappear.
Also add folio_mapping() documentation.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: David Howells <dhowells@redhat.com>
2020-12-10 23:55:05 +08:00
|
|
|
struct folio *folio = page_folio(page);
|
|
|
|
|
|
|
|
if (unlikely(folio_test_swapcache(folio)))
|
2021-04-30 13:55:35 +08:00
|
|
|
return NULL;
|
mm/util: Add folio_mapping() and folio_file_mapping()
These are the folio equivalent of page_mapping() and page_file_mapping().
Add an out-of-line page_mapping() wrapper around folio_mapping()
in order to prevent the page_folio() call from bloating every caller
of page_mapping(). Adjust page_file_mapping() and page_mapping_file()
to use folios internally. Rename __page_file_mapping() to
swapcache_mapping() and change it to take a folio.
This ends up saving 122 bytes of text overall. folio_mapping() is
45 bytes shorter than page_mapping() was, but the new page_mapping()
wrapper is 30 bytes. The major reduction is a few bytes less in dozens
of nfs functions (which call page_file_mapping()). Most of these appear
to be a slight change in gcc's register allocation decisions, which allow:
48 8b 56 08 mov 0x8(%rsi),%rdx
48 8d 42 ff lea -0x1(%rdx),%rax
83 e2 01 and $0x1,%edx
48 0f 44 c6 cmove %rsi,%rax
to become:
48 8b 46 08 mov 0x8(%rsi),%rax
48 8d 78 ff lea -0x1(%rax),%rdi
a8 01 test $0x1,%al
48 0f 44 fe cmove %rsi,%rdi
for a reduction of a single byte. Once the NFS client is converted to
use folios, this entire sequence will disappear.
Also add folio_mapping() documentation.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: David Howells <dhowells@redhat.com>
2020-12-10 23:55:05 +08:00
|
|
|
return folio_mapping(folio);
|
2021-04-30 13:55:35 +08:00
|
|
|
}
|
|
|
|
|
2021-08-13 05:09:57 +08:00
|
|
|
/**
|
|
|
|
* folio_inode - Get the host inode for this folio.
|
|
|
|
* @folio: The folio.
|
|
|
|
*
|
|
|
|
* For folios which are in the page cache, return the inode that this folio
|
|
|
|
* belongs to.
|
|
|
|
*
|
|
|
|
* Do not call this for folios which aren't in the page cache.
|
|
|
|
*/
|
|
|
|
static inline struct inode *folio_inode(struct folio *folio)
|
|
|
|
{
|
|
|
|
return folio->mapping->host;
|
|
|
|
}
|
|
|
|
|
2021-05-11 04:33:22 +08:00
|
|
|
static inline bool page_cache_add_speculative(struct page *page, int count)
|
2008-07-26 10:45:30 +08:00
|
|
|
{
|
2021-05-11 04:33:22 +08:00
|
|
|
return folio_ref_try_add_rcu((struct folio *)page, count);
|
2019-03-06 07:48:49 +08:00
|
|
|
}
|
2008-07-30 13:23:13 +08:00
|
|
|
|
2021-05-11 04:33:22 +08:00
|
|
|
static inline bool page_cache_get_speculative(struct page *page)
|
2019-03-06 07:48:49 +08:00
|
|
|
{
|
2021-05-11 04:33:22 +08:00
|
|
|
return page_cache_add_speculative(page, 1);
|
2008-07-30 13:23:13 +08:00
|
|
|
}
|
|
|
|
|
2020-06-02 12:47:38 +08:00
|
|
|
/**
|
2021-01-11 23:04:40 +08:00
|
|
|
* folio_attach_private - Attach private data to a folio.
|
|
|
|
* @folio: Folio to attach data to.
|
|
|
|
* @data: Data to attach to folio.
|
2020-06-02 12:47:38 +08:00
|
|
|
*
|
2021-01-11 23:04:40 +08:00
|
|
|
* Attaching private data to a folio increments the page's reference count.
|
|
|
|
* The data must be detached before the folio will be freed.
|
2020-06-02 12:47:38 +08:00
|
|
|
*/
|
2021-01-11 23:04:40 +08:00
|
|
|
static inline void folio_attach_private(struct folio *folio, void *data)
|
2020-06-02 12:47:38 +08:00
|
|
|
{
|
2021-01-11 23:04:40 +08:00
|
|
|
folio_get(folio);
|
|
|
|
folio->private = data;
|
|
|
|
folio_set_private(folio);
|
2020-06-02 12:47:38 +08:00
|
|
|
}
|
|
|
|
|
2021-08-13 04:54:58 +08:00
|
|
|
/**
|
|
|
|
* folio_change_private - Change private data on a folio.
|
|
|
|
* @folio: Folio to change the data on.
|
|
|
|
* @data: Data to set on the folio.
|
|
|
|
*
|
|
|
|
* Change the private data attached to a folio and return the old
|
|
|
|
* data. The page must previously have had data attached and the data
|
|
|
|
* must be detached before the folio will be freed.
|
|
|
|
*
|
|
|
|
* Return: Data that was previously attached to the folio.
|
|
|
|
*/
|
|
|
|
static inline void *folio_change_private(struct folio *folio, void *data)
|
|
|
|
{
|
|
|
|
void *old = folio_get_private(folio);
|
|
|
|
|
|
|
|
folio->private = data;
|
|
|
|
return old;
|
|
|
|
}
|
|
|
|
|
2020-06-02 12:47:38 +08:00
|
|
|
/**
|
2021-01-11 23:04:40 +08:00
|
|
|
* folio_detach_private - Detach private data from a folio.
|
|
|
|
* @folio: Folio to detach data from.
|
2020-06-02 12:47:38 +08:00
|
|
|
*
|
2021-01-11 23:04:40 +08:00
|
|
|
* Removes the data that was previously attached to the folio and decrements
|
2020-06-02 12:47:38 +08:00
|
|
|
* the refcount on the page.
|
|
|
|
*
|
2021-01-11 23:04:40 +08:00
|
|
|
* Return: Data that was attached to the folio.
|
2020-06-02 12:47:38 +08:00
|
|
|
*/
|
2021-01-11 23:04:40 +08:00
|
|
|
static inline void *folio_detach_private(struct folio *folio)
|
2020-06-02 12:47:38 +08:00
|
|
|
{
|
2021-01-11 23:04:40 +08:00
|
|
|
void *data = folio_get_private(folio);
|
2020-06-02 12:47:38 +08:00
|
|
|
|
2021-01-11 23:04:40 +08:00
|
|
|
if (!folio_test_private(folio))
|
2020-06-02 12:47:38 +08:00
|
|
|
return NULL;
|
2021-01-11 23:04:40 +08:00
|
|
|
folio_clear_private(folio);
|
|
|
|
folio->private = NULL;
|
|
|
|
folio_put(folio);
|
2020-06-02 12:47:38 +08:00
|
|
|
|
|
|
|
return data;
|
|
|
|
}
|
|
|
|
|
2021-01-11 23:04:40 +08:00
|
|
|
static inline void attach_page_private(struct page *page, void *data)
|
|
|
|
{
|
|
|
|
folio_attach_private(page_folio(page), data);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void *detach_page_private(struct page *page)
|
|
|
|
{
|
|
|
|
return folio_detach_private(page_folio(page));
|
|
|
|
}
|
|
|
|
|
2006-03-24 19:16:04 +08:00
|
|
|
#ifdef CONFIG_NUMA
|
2020-12-16 12:11:07 +08:00
|
|
|
struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
|
2006-03-24 19:16:04 +08:00
|
|
|
#else
|
2020-12-16 12:11:07 +08:00
|
|
|
static inline struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order)
|
2006-10-29 01:38:23 +08:00
|
|
|
{
|
2020-12-16 12:11:07 +08:00
|
|
|
return folio_alloc(gfp, order);
|
2006-10-29 01:38:23 +08:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2020-12-16 12:11:07 +08:00
|
|
|
static inline struct page *__page_cache_alloc(gfp_t gfp)
|
|
|
|
{
|
|
|
|
return &filemap_alloc_folio(gfp, 0)->page;
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
static inline struct page *page_cache_alloc(struct address_space *x)
|
|
|
|
{
|
2006-10-29 01:38:23 +08:00
|
|
|
return __page_cache_alloc(mapping_gfp_mask(x));
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2016-07-27 06:24:53 +08:00
|
|
|
static inline gfp_t readahead_gfp_mask(struct address_space *x)
|
2011-05-25 08:12:25 +08:00
|
|
|
{
|
2017-11-16 09:38:03 +08:00
|
|
|
return mapping_gfp_mask(x) | __GFP_NORETRY | __GFP_NOWARN;
|
2011-05-25 08:12:25 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
typedef int filler_t(void *, struct page *);
|
|
|
|
|
2017-11-22 03:07:06 +08:00
|
|
|
pgoff_t page_cache_next_miss(struct address_space *mapping,
|
2014-04-04 05:47:44 +08:00
|
|
|
pgoff_t index, unsigned long max_scan);
|
2017-11-22 03:07:06 +08:00
|
|
|
pgoff_t page_cache_prev_miss(struct address_space *mapping,
|
2014-04-04 05:47:44 +08:00
|
|
|
pgoff_t index, unsigned long max_scan);
|
|
|
|
|
2014-06-05 07:10:31 +08:00
|
|
|
#define FGP_ACCESSED 0x00000001
|
|
|
|
#define FGP_LOCK 0x00000002
|
|
|
|
#define FGP_CREAT 0x00000004
|
|
|
|
#define FGP_WRITE 0x00000008
|
|
|
|
#define FGP_NOFS 0x00000010
|
|
|
|
#define FGP_NOWAIT 0x00000020
|
filemap: kill page_cache_read usage in filemap_fault
Patch series "drop the mmap_sem when doing IO in the fault path", v6.
Now that we have proper isolation in place with cgroups2 we have started
going through and fixing the various priority inversions. Most are all
gone now, but this one is sort of weird since it's not necessarily a
priority inversion that happens within the kernel, but rather because of
something userspace does.
We have giant applications that we want to protect, and parts of these
giant applications do things like watch the system state to determine how
healthy the box is for load balancing and such. This involves running
'ps' or other such utilities. These utilities will often walk
/proc/<pid>/whatever, and these files can sometimes need to
down_read(&task->mmap_sem). Not usually a big deal, but we noticed when
we are stress testing that sometimes our protected application has latency
spikes trying to get the mmap_sem for tasks that are in lower priority
cgroups.
This is because any down_write() on a semaphore essentially turns it into
a mutex, so even if we currently have it held for reading, any new readers
will not be allowed on to keep from starving the writer. This is fine,
except a lower priority task could be stuck doing IO because it has been
throttled to the point that its IO is taking much longer than normal. But
because a higher priority group depends on this completing it is now stuck
behind lower priority work.
In order to avoid this particular priority inversion we want to use the
existing retry mechanism to stop from holding the mmap_sem at all if we
are going to do IO. This already exists in the read case sort of, but
needed to be extended for more than just grabbing the page lock. With
io.latency we throttle at submit_bio() time, so the readahead stuff can
block and even page_cache_read can block, so all these paths need to have
the mmap_sem dropped.
The other big thing is ->page_mkwrite. btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation. We use the same retry method as the read path, and
simply cache the page and verify the page is still setup properly the next
pass through ->page_mkwrite().
I've tested these patches with xfstests and there are no regressions.
This patch (of 3):
If we do not have a page at filemap_fault time we'll do this weird forced
page_cache_read thing to populate the page, and then drop it again and
loop around and find it. This makes for 2 ways we can read a page in
filemap_fault, and it's not really needed. Instead add a FGP_FOR_MMAP
flag so that pagecache_get_page() will return a unlocked page that's in
pagecache. Then use the normal page locking and readpage logic already in
filemap_fault. This simplifies the no page in page cache case
significantly.
[akpm@linux-foundation.org: fix comment text]
[josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.com
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-14 02:44:14 +08:00
|
|
|
#define FGP_FOR_MMAP 0x00000040
|
2020-10-14 07:51:41 +08:00
|
|
|
#define FGP_HEAD 0x00000080
|
2021-02-26 09:15:36 +08:00
|
|
|
#define FGP_ENTRY 0x00000100
|
2020-12-25 01:55:56 +08:00
|
|
|
#define FGP_STABLE 0x00000200
|
2014-06-05 07:10:31 +08:00
|
|
|
|
2021-03-09 00:45:35 +08:00
|
|
|
struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
|
|
|
|
int fgp_flags, gfp_t gfp);
|
|
|
|
struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
|
|
|
|
int fgp_flags, gfp_t gfp);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* filemap_get_folio - Find and get a folio.
|
|
|
|
* @mapping: The address_space to search.
|
|
|
|
* @index: The page index.
|
|
|
|
*
|
|
|
|
* Looks up the page cache entry at @mapping & @index. If a folio is
|
|
|
|
* present, it is returned with an increased refcount.
|
|
|
|
*
|
|
|
|
* Otherwise, %NULL is returned.
|
|
|
|
*/
|
|
|
|
static inline struct folio *filemap_get_folio(struct address_space *mapping,
|
|
|
|
pgoff_t index)
|
|
|
|
{
|
|
|
|
return __filemap_get_folio(mapping, index, 0, 0);
|
|
|
|
}
|
2014-06-05 07:10:31 +08:00
|
|
|
|
|
|
|
/**
|
|
|
|
* find_get_page - find and get a page reference
|
|
|
|
* @mapping: the address_space to search
|
|
|
|
* @offset: the page index
|
|
|
|
*
|
|
|
|
* Looks up the page cache slot at @mapping & @offset. If there is a
|
|
|
|
* page cache page, it is returned with an increased refcount.
|
|
|
|
*
|
|
|
|
* Otherwise, %NULL is returned.
|
|
|
|
*/
|
|
|
|
static inline struct page *find_get_page(struct address_space *mapping,
|
|
|
|
pgoff_t offset)
|
|
|
|
{
|
2014-12-30 03:30:35 +08:00
|
|
|
return pagecache_get_page(mapping, offset, 0, 0);
|
2014-06-05 07:10:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct page *find_get_page_flags(struct address_space *mapping,
|
|
|
|
pgoff_t offset, int fgp_flags)
|
|
|
|
{
|
2014-12-30 03:30:35 +08:00
|
|
|
return pagecache_get_page(mapping, offset, fgp_flags, 0);
|
2014-06-05 07:10:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* find_lock_page - locate, pin and lock a pagecache page
|
|
|
|
* @mapping: the address_space to search
|
2020-10-27 17:51:17 +08:00
|
|
|
* @index: the page index
|
2014-06-05 07:10:31 +08:00
|
|
|
*
|
2020-10-27 17:51:17 +08:00
|
|
|
* Looks up the page cache entry at @mapping & @index. If there is a
|
2014-06-05 07:10:31 +08:00
|
|
|
* page cache page, it is returned locked and with an increased
|
|
|
|
* refcount.
|
|
|
|
*
|
2020-10-14 07:51:41 +08:00
|
|
|
* Context: May sleep.
|
|
|
|
* Return: A struct page or %NULL if there is no page in the cache for this
|
|
|
|
* index.
|
2014-06-05 07:10:31 +08:00
|
|
|
*/
|
|
|
|
static inline struct page *find_lock_page(struct address_space *mapping,
|
2020-10-14 07:51:41 +08:00
|
|
|
pgoff_t index)
|
|
|
|
{
|
|
|
|
return pagecache_get_page(mapping, index, FGP_LOCK, 0);
|
|
|
|
}
|
|
|
|
|
2014-06-05 07:10:31 +08:00
|
|
|
/**
|
|
|
|
* find_or_create_page - locate or add a pagecache page
|
|
|
|
* @mapping: the page's address_space
|
|
|
|
* @index: the page's index into the mapping
|
|
|
|
* @gfp_mask: page allocation mode
|
|
|
|
*
|
|
|
|
* Looks up the page cache slot at @mapping & @offset. If there is a
|
|
|
|
* page cache page, it is returned locked and with an increased
|
|
|
|
* refcount.
|
|
|
|
*
|
|
|
|
* If the page is not present, a new page is allocated using @gfp_mask
|
|
|
|
* and added to the page cache and the VM's LRU list. The page is
|
|
|
|
* returned locked and with an increased refcount.
|
|
|
|
*
|
|
|
|
* On memory exhaustion, %NULL is returned.
|
|
|
|
*
|
|
|
|
* find_or_create_page() may sleep, even if @gfp_flags specifies an
|
|
|
|
* atomic allocation!
|
|
|
|
*/
|
|
|
|
static inline struct page *find_or_create_page(struct address_space *mapping,
|
2020-04-02 12:07:55 +08:00
|
|
|
pgoff_t index, gfp_t gfp_mask)
|
2014-06-05 07:10:31 +08:00
|
|
|
{
|
2020-04-02 12:07:55 +08:00
|
|
|
return pagecache_get_page(mapping, index,
|
2014-06-05 07:10:31 +08:00
|
|
|
FGP_LOCK|FGP_ACCESSED|FGP_CREAT,
|
2014-12-30 03:30:35 +08:00
|
|
|
gfp_mask);
|
2014-06-05 07:10:31 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* grab_cache_page_nowait - returns locked page at given index in given cache
|
|
|
|
* @mapping: target address_space
|
|
|
|
* @index: the page index
|
|
|
|
*
|
|
|
|
* Same as grab_cache_page(), but do not wait if the page is unavailable.
|
|
|
|
* This is intended for speculative data generators, where the data can
|
|
|
|
* be regenerated if the page couldn't be grabbed. This routine should
|
|
|
|
* be safe to call while holding the lock for another page.
|
|
|
|
*
|
|
|
|
* Clear __GFP_FS when allocating the page to avoid recursion into the fs
|
|
|
|
* and deadlock against the caller's locked page.
|
|
|
|
*/
|
|
|
|
static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
|
|
|
|
pgoff_t index)
|
|
|
|
{
|
|
|
|
return pagecache_get_page(mapping, index,
|
|
|
|
FGP_LOCK|FGP_CREAT|FGP_NOFS|FGP_NOWAIT,
|
2014-12-30 03:30:35 +08:00
|
|
|
mapping_gfp_mask(mapping));
|
2014-06-05 07:10:31 +08:00
|
|
|
}
|
|
|
|
|
2021-01-16 12:39:21 +08:00
|
|
|
#define swapcache_index(folio) __page_file_index(&(folio)->page)
|
|
|
|
|
|
|
|
/**
|
|
|
|
* folio_index - File index of a folio.
|
|
|
|
* @folio: The folio.
|
|
|
|
*
|
|
|
|
* For a folio which is either in the page cache or the swap cache,
|
|
|
|
* return its index within the address_space it belongs to. If you know
|
|
|
|
* the page is definitely in the page cache, you can look at the folio's
|
|
|
|
* index directly.
|
|
|
|
*
|
|
|
|
* Return: The index (offset in units of pages) of a folio in its file.
|
|
|
|
*/
|
|
|
|
static inline pgoff_t folio_index(struct folio *folio)
|
|
|
|
{
|
|
|
|
if (unlikely(folio_test_swapcache(folio)))
|
|
|
|
return swapcache_index(folio);
|
|
|
|
return folio->index;
|
|
|
|
}
|
|
|
|
|
2021-03-22 04:24:31 +08:00
|
|
|
/**
|
|
|
|
* folio_next_index - Get the index of the next folio.
|
|
|
|
* @folio: The current folio.
|
|
|
|
*
|
|
|
|
* Return: The index of the folio which follows this folio in the file.
|
|
|
|
*/
|
|
|
|
static inline pgoff_t folio_next_index(struct folio *folio)
|
|
|
|
{
|
|
|
|
return folio->index + folio_nr_pages(folio);
|
|
|
|
}
|
|
|
|
|
2021-01-16 12:39:21 +08:00
|
|
|
/**
|
|
|
|
* folio_file_page - The page for a particular index.
|
|
|
|
* @folio: The folio which contains this index.
|
|
|
|
* @index: The index we want to look up.
|
|
|
|
*
|
|
|
|
* Sometimes after looking up a folio in the page cache, we need to
|
|
|
|
* obtain the specific page for an index (eg a page fault).
|
|
|
|
*
|
|
|
|
* Return: The page containing the file data for this index.
|
|
|
|
*/
|
|
|
|
static inline struct page *folio_file_page(struct folio *folio, pgoff_t index)
|
|
|
|
{
|
|
|
|
/* HugeTLBfs indexes the page cache in units of hpage_size */
|
|
|
|
if (folio_test_hugetlb(folio))
|
|
|
|
return &folio->page;
|
|
|
|
return folio_page(folio, index & (folio_nr_pages(folio) - 1));
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* folio_contains - Does this folio contain this index?
|
|
|
|
* @folio: The folio.
|
|
|
|
* @index: The page index within the file.
|
|
|
|
*
|
|
|
|
* Context: The caller should have the page locked in order to prevent
|
|
|
|
* (eg) shmem from moving the page between the page cache and swap cache
|
|
|
|
* and changing its index in the middle of the operation.
|
|
|
|
* Return: true or false.
|
|
|
|
*/
|
|
|
|
static inline bool folio_contains(struct folio *folio, pgoff_t index)
|
|
|
|
{
|
|
|
|
/* HugeTLBfs indexes the page cache in units of hpage_size */
|
|
|
|
if (folio_test_hugetlb(folio))
|
|
|
|
return folio->index == index;
|
|
|
|
return index - folio_index(folio) < folio_nr_pages(folio);
|
|
|
|
}
|
|
|
|
|
2020-04-02 12:04:57 +08:00
|
|
|
/*
|
|
|
|
* Given the page we found in the page cache, return the page corresponding
|
|
|
|
* to this index in the file
|
|
|
|
*/
|
|
|
|
static inline struct page *find_subpage(struct page *head, pgoff_t index)
|
2019-09-24 06:34:52 +08:00
|
|
|
{
|
2020-04-02 12:04:57 +08:00
|
|
|
/* HugeTLBfs wants the head page regardless */
|
|
|
|
if (PageHuge(head))
|
|
|
|
return head;
|
2019-09-24 06:34:52 +08:00
|
|
|
|
2020-08-15 08:30:37 +08:00
|
|
|
return head + (index & (thp_nr_pages(head) - 1));
|
2019-09-24 06:34:52 +08:00
|
|
|
}
|
|
|
|
|
2017-09-07 07:21:21 +08:00
|
|
|
unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
|
|
|
|
pgoff_t end, unsigned int nr_pages,
|
|
|
|
struct page **pages);
|
|
|
|
static inline unsigned find_get_pages(struct address_space *mapping,
|
|
|
|
pgoff_t *start, unsigned int nr_pages,
|
|
|
|
struct page **pages)
|
|
|
|
{
|
|
|
|
return find_get_pages_range(mapping, start, (pgoff_t)-1, nr_pages,
|
|
|
|
pages);
|
|
|
|
}
|
2006-04-27 14:46:01 +08:00
|
|
|
unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
|
|
|
|
unsigned int nr_pages, struct page **pages);
|
2017-11-16 09:34:33 +08:00
|
|
|
unsigned find_get_pages_range_tag(struct address_space *mapping, pgoff_t *index,
|
2018-05-17 06:12:54 +08:00
|
|
|
pgoff_t end, xa_mark_t tag, unsigned int nr_pages,
|
2017-11-16 09:34:33 +08:00
|
|
|
struct page **pages);
|
|
|
|
static inline unsigned find_get_pages_tag(struct address_space *mapping,
|
2018-05-17 06:12:54 +08:00
|
|
|
pgoff_t *index, xa_mark_t tag, unsigned int nr_pages,
|
2017-11-16 09:34:33 +08:00
|
|
|
struct page **pages)
|
|
|
|
{
|
|
|
|
return find_get_pages_range_tag(mapping, index, (pgoff_t)-1, tag,
|
|
|
|
nr_pages, pages);
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
fs: symlink write_begin allocation context fix
With the write_begin/write_end aops, page_symlink was broken because it
could no longer pass a GFP_NOFS type mask into the point where the
allocations happened. They are done in write_begin, which would always
assume that the filesystem can be entered from reclaim. This bug could
cause filesystem deadlocks.
The funny thing with having a gfp_t mask there is that it doesn't really
allow the caller to arbitrarily tinker with the context in which it can be
called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
take the page lock. The only thing any callers care about is __GFP_FS
anyway, so turn that into a single flag.
Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
this flag in their write_begin function. Change __grab_cache_page to
accept a nofs argument as well, to honour that flag (while we're there,
change the name to grab_cache_page_write_begin which is more instructive
and does away with random leading underscores).
This is really a more flexible way to go in the end anyway -- if a
filesystem happens to want any extra allocations aside from the pagecache
ones in ints write_begin function, it may now use GFP_KERNEL (rather than
GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
random example).
[kosaki.motohiro@jp.fujitsu.com: fix ubifs]
[kosaki.motohiro@jp.fujitsu.com: fix fuse]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Cleaned up the calling convention: just pass in the AOP flags
untouched to the grab_cache_page_write_begin() function. That
just simplifies everybody, and may even allow future expansion of the
logic. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-05 04:00:53 +08:00
|
|
|
struct page *grab_cache_page_write_begin(struct address_space *mapping,
|
|
|
|
pgoff_t index, unsigned flags);
|
2007-10-16 16:25:01 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* Returns locked page at given index in given cache, creating it if needed.
|
|
|
|
*/
|
2007-10-16 16:24:37 +08:00
|
|
|
static inline struct page *grab_cache_page(struct address_space *mapping,
|
|
|
|
pgoff_t index)
|
2005-04-17 06:20:36 +08:00
|
|
|
{
|
|
|
|
return find_or_create_page(mapping, index, mapping_gfp_mask(mapping));
|
|
|
|
}
|
|
|
|
|
2020-12-17 00:45:30 +08:00
|
|
|
struct folio *read_cache_folio(struct address_space *, pgoff_t index,
|
|
|
|
filler_t *filler, void *data);
|
|
|
|
struct page *read_cache_page(struct address_space *, pgoff_t index,
|
|
|
|
filler_t *filler, void *data);
|
2010-01-28 01:20:03 +08:00
|
|
|
extern struct page * read_cache_page_gfp(struct address_space *mapping,
|
|
|
|
pgoff_t index, gfp_t gfp_mask);
|
2005-04-17 06:20:36 +08:00
|
|
|
extern int read_cache_pages(struct address_space *mapping,
|
|
|
|
struct list_head *pages, filler_t *filler, void *data);
|
|
|
|
|
2006-06-23 17:05:08 +08:00
|
|
|
static inline struct page *read_mapping_page(struct address_space *mapping,
|
2011-07-26 08:12:23 +08:00
|
|
|
pgoff_t index, void *data)
|
2006-06-23 17:05:08 +08:00
|
|
|
{
|
2019-07-12 11:55:20 +08:00
|
|
|
return read_cache_page(mapping, index, NULL, data);
|
2006-06-23 17:05:08 +08:00
|
|
|
}
|
|
|
|
|
2020-12-17 00:45:30 +08:00
|
|
|
static inline struct folio *read_mapping_folio(struct address_space *mapping,
|
|
|
|
pgoff_t index, void *data)
|
|
|
|
{
|
|
|
|
return read_cache_folio(mapping, index, NULL, data);
|
|
|
|
}
|
|
|
|
|
2014-07-24 05:00:01 +08:00
|
|
|
/*
|
mm, futex: fix shared futex pgoff on shmem huge page
If more than one futex is placed on a shmem huge page, it can happen
that waking the second wakes the first instead, and leaves the second
waiting: the key's shared.pgoff is wrong.
When 3.11 commit 13d60f4b6ab5 ("futex: Take hugepages into account when
generating futex_key"), the only shared huge pages came from hugetlbfs,
and the code added to deal with its exceptional page->index was put into
hugetlb source. Then that was missed when 4.8 added shmem huge pages.
page_to_pgoff() is what others use for this nowadays: except that, as
currently written, it gives the right answer on hugetlbfs head, but
nonsense on hugetlbfs tails. Fix that by calling hugetlbfs-specific
hugetlb_basepage_index() on PageHuge tails as well as on head.
Yes, it's unconventional to declare hugetlb_basepage_index() there in
pagemap.h, rather than in hugetlb.h; but I do not expect anything but
page_to_pgoff() ever to need it.
[akpm@linux-foundation.org: give hugetlb_basepage_index() prototype the correct scope]
Link: https://lkml.kernel.org/r/b17d946b-d09-326e-b42a-52884c36df32@google.com
Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Reported-by: Neel Natu <neelnatu@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Zhang Yi <wetpzy@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-25 09:39:52 +08:00
|
|
|
* Get index of the page within radix-tree (but not for hugetlb pages).
|
2016-12-01 07:54:19 +08:00
|
|
|
* (TODO: remove once hugetlb pages will have ->index in PAGE_SIZE)
|
2014-07-24 05:00:01 +08:00
|
|
|
*/
|
2016-12-01 07:54:19 +08:00
|
|
|
static inline pgoff_t page_to_index(struct page *page)
|
2014-07-24 05:00:01 +08:00
|
|
|
{
|
2021-09-08 10:55:55 +08:00
|
|
|
struct page *head;
|
2016-01-16 08:54:10 +08:00
|
|
|
|
|
|
|
if (likely(!PageTransTail(page)))
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
return page->index;
|
2016-01-16 08:54:10 +08:00
|
|
|
|
2021-09-08 10:55:55 +08:00
|
|
|
head = compound_head(page);
|
2016-01-16 08:54:10 +08:00
|
|
|
/*
|
|
|
|
* We don't initialize ->index for tail pages: calculate based on
|
|
|
|
* head page
|
|
|
|
*/
|
2021-09-08 10:55:55 +08:00
|
|
|
return head->index + page - head;
|
2014-07-24 05:00:01 +08:00
|
|
|
}
|
|
|
|
|
mm, futex: fix shared futex pgoff on shmem huge page
If more than one futex is placed on a shmem huge page, it can happen
that waking the second wakes the first instead, and leaves the second
waiting: the key's shared.pgoff is wrong.
When 3.11 commit 13d60f4b6ab5 ("futex: Take hugepages into account when
generating futex_key"), the only shared huge pages came from hugetlbfs,
and the code added to deal with its exceptional page->index was put into
hugetlb source. Then that was missed when 4.8 added shmem huge pages.
page_to_pgoff() is what others use for this nowadays: except that, as
currently written, it gives the right answer on hugetlbfs head, but
nonsense on hugetlbfs tails. Fix that by calling hugetlbfs-specific
hugetlb_basepage_index() on PageHuge tails as well as on head.
Yes, it's unconventional to declare hugetlb_basepage_index() there in
pagemap.h, rather than in hugetlb.h; but I do not expect anything but
page_to_pgoff() ever to need it.
[akpm@linux-foundation.org: give hugetlb_basepage_index() prototype the correct scope]
Link: https://lkml.kernel.org/r/b17d946b-d09-326e-b42a-52884c36df32@google.com
Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Reported-by: Neel Natu <neelnatu@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Zhang Yi <wetpzy@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-25 09:39:52 +08:00
|
|
|
extern pgoff_t hugetlb_basepage_index(struct page *page);
|
|
|
|
|
2016-12-01 07:54:19 +08:00
|
|
|
/*
|
mm, futex: fix shared futex pgoff on shmem huge page
If more than one futex is placed on a shmem huge page, it can happen
that waking the second wakes the first instead, and leaves the second
waiting: the key's shared.pgoff is wrong.
When 3.11 commit 13d60f4b6ab5 ("futex: Take hugepages into account when
generating futex_key"), the only shared huge pages came from hugetlbfs,
and the code added to deal with its exceptional page->index was put into
hugetlb source. Then that was missed when 4.8 added shmem huge pages.
page_to_pgoff() is what others use for this nowadays: except that, as
currently written, it gives the right answer on hugetlbfs head, but
nonsense on hugetlbfs tails. Fix that by calling hugetlbfs-specific
hugetlb_basepage_index() on PageHuge tails as well as on head.
Yes, it's unconventional to declare hugetlb_basepage_index() there in
pagemap.h, rather than in hugetlb.h; but I do not expect anything but
page_to_pgoff() ever to need it.
[akpm@linux-foundation.org: give hugetlb_basepage_index() prototype the correct scope]
Link: https://lkml.kernel.org/r/b17d946b-d09-326e-b42a-52884c36df32@google.com
Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Reported-by: Neel Natu <neelnatu@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Zhang Yi <wetpzy@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-25 09:39:52 +08:00
|
|
|
* Get the offset in PAGE_SIZE (even for hugetlb pages).
|
|
|
|
* (TODO: hugetlb pages should have ->index in PAGE_SIZE)
|
2016-12-01 07:54:19 +08:00
|
|
|
*/
|
|
|
|
static inline pgoff_t page_to_pgoff(struct page *page)
|
|
|
|
{
|
mm, futex: fix shared futex pgoff on shmem huge page
If more than one futex is placed on a shmem huge page, it can happen
that waking the second wakes the first instead, and leaves the second
waiting: the key's shared.pgoff is wrong.
When 3.11 commit 13d60f4b6ab5 ("futex: Take hugepages into account when
generating futex_key"), the only shared huge pages came from hugetlbfs,
and the code added to deal with its exceptional page->index was put into
hugetlb source. Then that was missed when 4.8 added shmem huge pages.
page_to_pgoff() is what others use for this nowadays: except that, as
currently written, it gives the right answer on hugetlbfs head, but
nonsense on hugetlbfs tails. Fix that by calling hugetlbfs-specific
hugetlb_basepage_index() on PageHuge tails as well as on head.
Yes, it's unconventional to declare hugetlb_basepage_index() there in
pagemap.h, rather than in hugetlb.h; but I do not expect anything but
page_to_pgoff() ever to need it.
[akpm@linux-foundation.org: give hugetlb_basepage_index() prototype the correct scope]
Link: https://lkml.kernel.org/r/b17d946b-d09-326e-b42a-52884c36df32@google.com
Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Reported-by: Neel Natu <neelnatu@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Zhang Yi <wetpzy@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-25 09:39:52 +08:00
|
|
|
if (unlikely(PageHuge(page)))
|
|
|
|
return hugetlb_basepage_index(page);
|
2016-12-01 07:54:19 +08:00
|
|
|
return page_to_index(page);
|
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* Return byte-offset into filesystem object for page.
|
|
|
|
*/
|
|
|
|
static inline loff_t page_offset(struct page *page)
|
|
|
|
{
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
return ((loff_t)page->index) << PAGE_SHIFT;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2012-08-01 07:44:47 +08:00
|
|
|
static inline loff_t page_file_offset(struct page *page)
|
|
|
|
{
|
2016-10-08 08:00:24 +08:00
|
|
|
return ((loff_t)page_index(page)) << PAGE_SHIFT;
|
2012-08-01 07:44:47 +08:00
|
|
|
}
|
|
|
|
|
2020-12-24 20:25:19 +08:00
|
|
|
/**
|
|
|
|
* folio_pos - Returns the byte position of this folio in its file.
|
|
|
|
* @folio: The folio.
|
|
|
|
*/
|
|
|
|
static inline loff_t folio_pos(struct folio *folio)
|
|
|
|
{
|
|
|
|
return page_offset(&folio->page);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* folio_file_pos - Returns the byte position of this folio in its file.
|
|
|
|
* @folio: The folio.
|
|
|
|
*
|
|
|
|
* This differs from folio_pos() for folios which belong to a swap file.
|
|
|
|
* NFS is the only filesystem today which needs to use folio_file_pos().
|
|
|
|
*/
|
|
|
|
static inline loff_t folio_file_pos(struct folio *folio)
|
|
|
|
{
|
|
|
|
return page_file_offset(&folio->page);
|
|
|
|
}
|
|
|
|
|
2010-05-28 08:29:16 +08:00
|
|
|
extern pgoff_t linear_hugepage_index(struct vm_area_struct *vma,
|
|
|
|
unsigned long address);
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
|
|
|
|
unsigned long address)
|
|
|
|
{
|
2010-05-28 08:29:16 +08:00
|
|
|
pgoff_t pgoff;
|
|
|
|
if (unlikely(is_vm_hugetlb_page(vma)))
|
|
|
|
return linear_hugepage_index(vma, address);
|
|
|
|
pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
|
2005-04-17 06:20:36 +08:00
|
|
|
pgoff += vma->vm_pgoff;
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
return pgoff;
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2020-05-23 22:22:14 +08:00
|
|
|
struct wait_page_key {
|
2021-01-17 00:22:14 +08:00
|
|
|
struct folio *folio;
|
2020-05-23 22:22:14 +08:00
|
|
|
int bit_nr;
|
|
|
|
int page_match;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct wait_page_queue {
|
2021-01-17 00:22:14 +08:00
|
|
|
struct folio *folio;
|
2020-05-23 22:22:14 +08:00
|
|
|
int bit_nr;
|
|
|
|
wait_queue_entry_t wait;
|
|
|
|
};
|
|
|
|
|
2020-08-04 04:01:22 +08:00
|
|
|
static inline bool wake_page_match(struct wait_page_queue *wait_page,
|
2020-05-23 22:22:14 +08:00
|
|
|
struct wait_page_key *key)
|
|
|
|
{
|
2021-01-17 00:22:14 +08:00
|
|
|
if (wait_page->folio != key->folio)
|
2020-08-04 04:01:22 +08:00
|
|
|
return false;
|
2020-05-23 22:22:14 +08:00
|
|
|
key->page_match = 1;
|
|
|
|
|
|
|
|
if (wait_page->bit_nr != key->bit_nr)
|
2020-08-04 04:01:22 +08:00
|
|
|
return false;
|
2020-05-23 00:18:23 +08:00
|
|
|
|
2020-08-04 04:01:22 +08:00
|
|
|
return true;
|
2020-05-23 00:18:23 +08:00
|
|
|
}
|
|
|
|
|
2021-03-02 08:38:25 +08:00
|
|
|
void __folio_lock(struct folio *folio);
|
2020-12-08 13:07:31 +08:00
|
|
|
int __folio_lock_killable(struct folio *folio);
|
2021-03-19 09:39:45 +08:00
|
|
|
bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
|
2010-10-27 05:21:57 +08:00
|
|
|
unsigned int flags);
|
2020-12-08 04:44:35 +08:00
|
|
|
void unlock_page(struct page *page);
|
|
|
|
void folio_unlock(struct folio *folio);
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2021-03-02 08:38:25 +08:00
|
|
|
static inline bool folio_trylock(struct folio *folio)
|
|
|
|
{
|
|
|
|
return likely(!test_and_set_bit_lock(PG_locked, folio_flags(folio, 0)));
|
|
|
|
}
|
|
|
|
|
2019-07-12 11:54:59 +08:00
|
|
|
/*
|
|
|
|
* Return true if the page was successfully locked
|
|
|
|
*/
|
2008-08-02 18:01:03 +08:00
|
|
|
static inline int trylock_page(struct page *page)
|
|
|
|
{
|
2021-03-02 08:38:25 +08:00
|
|
|
return folio_trylock(page_folio(page));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void folio_lock(struct folio *folio)
|
|
|
|
{
|
|
|
|
might_sleep();
|
|
|
|
if (!folio_trylock(folio))
|
|
|
|
__folio_lock(folio);
|
2008-08-02 18:01:03 +08:00
|
|
|
}
|
|
|
|
|
2006-09-26 14:31:24 +08:00
|
|
|
/*
|
|
|
|
* lock_page may only be called if we have the page's inode pinned.
|
|
|
|
*/
|
2005-04-17 06:20:36 +08:00
|
|
|
static inline void lock_page(struct page *page)
|
|
|
|
{
|
2021-03-02 08:38:25 +08:00
|
|
|
struct folio *folio;
|
2005-04-17 06:20:36 +08:00
|
|
|
might_sleep();
|
2021-03-02 08:38:25 +08:00
|
|
|
|
|
|
|
folio = page_folio(page);
|
|
|
|
if (!folio_trylock(folio))
|
|
|
|
__folio_lock(folio);
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
2006-09-26 14:31:24 +08:00
|
|
|
|
2020-12-08 13:07:31 +08:00
|
|
|
static inline int folio_lock_killable(struct folio *folio)
|
|
|
|
{
|
|
|
|
might_sleep();
|
|
|
|
if (!folio_trylock(folio))
|
|
|
|
return __folio_lock_killable(folio);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2007-12-07 00:18:49 +08:00
|
|
|
/*
|
|
|
|
* lock_page_killable is like lock_page but can be interrupted by fatal
|
|
|
|
* signals. It returns 0 if it locked the page and -EINTR if it was
|
|
|
|
* killed while waiting.
|
|
|
|
*/
|
|
|
|
static inline int lock_page_killable(struct page *page)
|
|
|
|
{
|
2020-12-08 13:07:31 +08:00
|
|
|
return folio_lock_killable(page_folio(page));
|
2007-12-07 00:18:49 +08:00
|
|
|
}
|
|
|
|
|
2010-10-27 05:21:57 +08:00
|
|
|
/*
|
|
|
|
* lock_page_or_retry - Lock the page, unless this would block and the
|
|
|
|
* caller indicated that it can handle a retry.
|
2014-08-07 07:07:24 +08:00
|
|
|
*
|
2020-06-09 12:33:54 +08:00
|
|
|
* Return value and mmap_lock implications depend on flags; see
|
2021-03-19 09:39:45 +08:00
|
|
|
* __folio_lock_or_retry().
|
2010-10-27 05:21:57 +08:00
|
|
|
*/
|
2021-03-19 09:39:45 +08:00
|
|
|
static inline bool lock_page_or_retry(struct page *page, struct mm_struct *mm,
|
2010-10-27 05:21:57 +08:00
|
|
|
unsigned int flags)
|
|
|
|
{
|
2021-03-19 09:39:45 +08:00
|
|
|
struct folio *folio;
|
2010-10-27 05:21:57 +08:00
|
|
|
might_sleep();
|
2021-03-19 09:39:45 +08:00
|
|
|
|
|
|
|
folio = page_folio(page);
|
|
|
|
return folio_trylock(folio) || __folio_lock_or_retry(folio, mm, flags);
|
2010-10-27 05:21:57 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2021-03-05 01:02:54 +08:00
|
|
|
* This is exported only for folio_wait_locked/folio_wait_writeback, etc.,
|
2017-02-23 07:44:41 +08:00
|
|
|
* and should not be used directly.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2021-03-05 01:02:54 +08:00
|
|
|
void folio_wait_bit(struct folio *folio, int bit_nr);
|
|
|
|
int folio_wait_bit_killable(struct folio *folio, int bit_nr);
|
2014-09-24 09:28:32 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2021-03-04 23:21:02 +08:00
|
|
|
* Wait for a folio to be unlocked.
|
2005-04-17 06:20:36 +08:00
|
|
|
*
|
2021-03-04 23:21:02 +08:00
|
|
|
* This must be called with the caller "holding" the folio,
|
|
|
|
* ie with increased "page->count" so that the folio won't
|
2005-04-17 06:20:36 +08:00
|
|
|
* go away during the wait..
|
|
|
|
*/
|
2021-03-04 23:21:02 +08:00
|
|
|
static inline void folio_wait_locked(struct folio *folio)
|
|
|
|
{
|
|
|
|
if (folio_test_locked(folio))
|
2021-03-05 01:02:54 +08:00
|
|
|
folio_wait_bit(folio, PG_locked);
|
2021-03-04 23:21:02 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline int folio_wait_locked_killable(struct folio *folio)
|
|
|
|
{
|
|
|
|
if (!folio_test_locked(folio))
|
|
|
|
return 0;
|
2021-03-05 01:02:54 +08:00
|
|
|
return folio_wait_bit_killable(folio, PG_locked);
|
2021-03-04 23:21:02 +08:00
|
|
|
}
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
static inline void wait_on_page_locked(struct page *page)
|
|
|
|
{
|
2021-03-04 23:21:02 +08:00
|
|
|
folio_wait_locked(page_folio(page));
|
2005-04-17 06:20:36 +08:00
|
|
|
}
|
|
|
|
|
2016-12-25 11:00:30 +08:00
|
|
|
static inline int wait_on_page_locked_killable(struct page *page)
|
|
|
|
{
|
2021-03-04 23:21:02 +08:00
|
|
|
return folio_wait_locked_killable(page_folio(page));
|
2016-12-25 11:00:30 +08:00
|
|
|
}
|
|
|
|
|
2021-08-17 11:36:31 +08:00
|
|
|
int folio_put_wait_locked(struct folio *folio, int state);
|
2019-05-14 08:23:11 +08:00
|
|
|
void wait_on_page_writeback(struct page *page);
|
2021-03-05 00:09:17 +08:00
|
|
|
void folio_wait_writeback(struct folio *folio);
|
|
|
|
int folio_wait_writeback_killable(struct folio *folio);
|
2021-03-04 04:21:55 +08:00
|
|
|
void end_page_writeback(struct page *page);
|
|
|
|
void folio_end_writeback(struct folio *folio);
|
mm: only enforce stable page writes if the backing device requires it
Create a helper function to check if a backing device requires stable
page writes and, if so, performs the necessary wait. Then, make it so
that all points in the memory manager that handle making pages writable
use the helper function. This should provide stable page write support
to most filesystems, while eliminating unnecessary waiting for devices
that don't require the feature.
Before this patchset, all filesystems would block, regardless of whether
or not it was necessary. ext3 would wait, but still generate occasional
checksum errors. The network filesystems were left to do their own
thing, so they'd wait too.
After this patchset, all the disk filesystems except ext3 and btrfs will
wait only if the hardware requires it. ext3 (if necessary) snapshots
pages instead of blocking, and btrfs provides its own bdi so the mm will
never wait. Network filesystems haven't been touched, so either they
provide their own stable page guarantees or they don't block at all.
The blocking behavior is back to what it was before 3.0 if you don't
have a disk requiring stable page writes.
Here's the result of using dbench to test latency on ext2:
3.8.0-rc3:
Operation Count AvgLat MaxLat
----------------------------------------
WriteX 109347 0.028 59.817
ReadX 347180 0.004 3.391
Flush 15514 29.828 287.283
Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms
3.8.0-rc3 + patches:
WriteX 105556 0.029 4.273
ReadX 335004 0.005 4.112
Flush 14982 30.540 298.634
Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms
As you can see, the maximum write latency drops considerably with this
patch enabled. The other filesystems (ext3/ext4/xfs/btrfs) behave
similarly, but see the cover letter for those results.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-22 08:42:51 +08:00
|
|
|
void wait_for_stable_page(struct page *page);
|
2021-03-05 00:25:25 +08:00
|
|
|
void folio_wait_stable(struct folio *folio);
|
2021-05-04 23:01:10 +08:00
|
|
|
void __folio_mark_dirty(struct folio *folio, struct address_space *, int warn);
|
|
|
|
static inline void __set_page_dirty(struct page *page,
|
|
|
|
struct address_space *mapping, int warn)
|
|
|
|
{
|
|
|
|
__folio_mark_dirty(page_folio(page), mapping, warn);
|
|
|
|
}
|
2021-05-05 04:12:09 +08:00
|
|
|
void folio_account_cleaned(struct folio *folio, struct address_space *mapping,
|
|
|
|
struct bdi_writeback *wb);
|
2021-03-09 05:43:04 +08:00
|
|
|
void __folio_cancel_dirty(struct folio *folio);
|
|
|
|
static inline void folio_cancel_dirty(struct folio *folio)
|
|
|
|
{
|
|
|
|
/* Avoid atomic ops, locking, etc. when not actually needed. */
|
|
|
|
if (folio_test_dirty(folio))
|
|
|
|
__folio_cancel_dirty(folio);
|
|
|
|
}
|
|
|
|
static inline void cancel_dirty_page(struct page *page)
|
|
|
|
{
|
|
|
|
folio_cancel_dirty(page_folio(page));
|
|
|
|
}
|
2021-03-01 05:21:20 +08:00
|
|
|
bool folio_clear_dirty_for_io(struct folio *folio);
|
|
|
|
bool clear_page_dirty_for_io(struct page *page);
|
2021-03-10 02:48:03 +08:00
|
|
|
int __must_check folio_write_one(struct folio *folio);
|
|
|
|
static inline int __must_check write_one_page(struct page *page)
|
|
|
|
{
|
|
|
|
return folio_write_one(page_folio(page));
|
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2021-06-29 10:36:30 +08:00
|
|
|
int __set_page_dirty_nobuffers(struct page *page);
|
|
|
|
int __set_page_dirty_no_writeback(struct page *page);
|
|
|
|
|
2016-08-05 22:11:04 +08:00
|
|
|
void page_endio(struct page *page, bool is_write, int err);
|
2014-06-05 07:07:45 +08:00
|
|
|
|
2021-04-23 10:58:32 +08:00
|
|
|
void folio_end_private_2(struct folio *folio);
|
|
|
|
void folio_wait_private_2(struct folio *folio);
|
|
|
|
int folio_wait_private_2_killable(struct folio *folio);
|
2020-02-10 18:00:21 +08:00
|
|
|
|
2009-04-03 23:42:39 +08:00
|
|
|
/*
|
|
|
|
* Add an arbitrary waiter to a page's wait queue
|
|
|
|
*/
|
2021-01-17 00:22:14 +08:00
|
|
|
void folio_add_wait_queue(struct folio *folio, wait_queue_entry_t *waiter);
|
2009-04-03 23:42:39 +08:00
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
2021-08-02 19:44:20 +08:00
|
|
|
* Fault in userspace address range.
|
2005-04-17 06:20:36 +08:00
|
|
|
*/
|
2021-08-02 19:44:20 +08:00
|
|
|
size_t fault_in_writeable(char __user *uaddr, size_t size);
|
2021-07-05 23:26:28 +08:00
|
|
|
size_t fault_in_safe_writeable(const char __user *uaddr, size_t size);
|
2021-08-02 19:44:20 +08:00
|
|
|
size_t fault_in_readable(const char __user *uaddr, size_t size);
|
2012-03-26 01:47:41 +08:00
|
|
|
|
2008-08-02 18:01:03 +08:00
|
|
|
int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
|
2020-12-08 21:56:28 +08:00
|
|
|
pgoff_t index, gfp_t gfp);
|
2008-08-02 18:01:03 +08:00
|
|
|
int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
|
2020-12-08 21:56:28 +08:00
|
|
|
pgoff_t index, gfp_t gfp);
|
|
|
|
int filemap_add_folio(struct address_space *mapping, struct folio *folio,
|
|
|
|
pgoff_t index, gfp_t gfp);
|
2021-05-09 21:33:42 +08:00
|
|
|
void filemap_remove_folio(struct folio *folio);
|
|
|
|
void delete_from_page_cache(struct page *page);
|
|
|
|
void __filemap_remove_folio(struct folio *folio, void *shadow);
|
|
|
|
static inline void __delete_from_page_cache(struct page *page, void *shadow)
|
|
|
|
{
|
|
|
|
__filemap_remove_folio(page_folio(page), shadow);
|
|
|
|
}
|
2021-02-25 04:01:42 +08:00
|
|
|
void replace_page_cache_page(struct page *old, struct page *new);
|
2017-11-16 09:37:33 +08:00
|
|
|
void delete_from_page_cache_batch(struct address_space *mapping,
|
2021-12-08 03:15:07 +08:00
|
|
|
struct folio_batch *fbatch);
|
2021-07-29 03:14:48 +08:00
|
|
|
int try_to_release_page(struct page *page, gfp_t gfp);
|
|
|
|
bool filemap_release_folio(struct folio *folio, gfp_t gfp);
|
2021-02-26 09:15:48 +08:00
|
|
|
loff_t mapping_seek_hole_data(struct address_space *, loff_t start, loff_t end,
|
|
|
|
int whence);
|
2008-08-02 18:01:03 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Like add_to_page_cache_locked, but used to add newly allocated pages:
|
2016-01-16 08:51:24 +08:00
|
|
|
* the page is new, so we can just run __SetPageLocked() against it.
|
2008-08-02 18:01:03 +08:00
|
|
|
*/
|
|
|
|
static inline int add_to_page_cache(struct page *page,
|
|
|
|
struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask)
|
|
|
|
{
|
|
|
|
int error;
|
|
|
|
|
2016-01-16 08:51:24 +08:00
|
|
|
__SetPageLocked(page);
|
2008-08-02 18:01:03 +08:00
|
|
|
error = add_to_page_cache_locked(page, mapping, offset, gfp_mask);
|
|
|
|
if (unlikely(error))
|
2016-01-16 08:51:24 +08:00
|
|
|
__ClearPageLocked(page);
|
2008-08-02 18:01:03 +08:00
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
2020-12-08 21:56:28 +08:00
|
|
|
/* Must be non-static for BPF error injection */
|
|
|
|
int __filemap_add_folio(struct address_space *mapping, struct folio *folio,
|
|
|
|
pgoff_t index, gfp_t gfp, void **shadowp);
|
|
|
|
|
2021-10-28 22:47:05 +08:00
|
|
|
bool filemap_range_has_writeback(struct address_space *mapping,
|
|
|
|
loff_t start_byte, loff_t end_byte);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* filemap_range_needs_writeback - check if range potentially needs writeback
|
|
|
|
* @mapping: address space within which to check
|
|
|
|
* @start_byte: offset in bytes where the range starts
|
|
|
|
* @end_byte: offset in bytes where the range ends (inclusive)
|
|
|
|
*
|
|
|
|
* Find at least one page in the range supplied, usually used to check if
|
|
|
|
* direct writing in this range will trigger a writeback. Used by O_DIRECT
|
|
|
|
* read/write with IOCB_NOWAIT, to see if the caller needs to do
|
|
|
|
* filemap_write_and_wait_range() before proceeding.
|
|
|
|
*
|
|
|
|
* Return: %true if the caller should do filemap_write_and_wait_range() before
|
|
|
|
* doing O_DIRECT to a page in this range, %false otherwise.
|
|
|
|
*/
|
|
|
|
static inline bool filemap_range_needs_writeback(struct address_space *mapping,
|
|
|
|
loff_t start_byte,
|
|
|
|
loff_t end_byte)
|
|
|
|
{
|
|
|
|
if (!mapping->nrpages)
|
|
|
|
return false;
|
|
|
|
if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) &&
|
|
|
|
!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
|
|
|
|
return false;
|
|
|
|
return filemap_range_has_writeback(mapping, start_byte, end_byte);
|
|
|
|
}
|
|
|
|
|
2020-06-02 12:46:21 +08:00
|
|
|
/**
|
|
|
|
* struct readahead_control - Describes a readahead request.
|
|
|
|
*
|
|
|
|
* A readahead request is for consecutive pages. Filesystems which
|
|
|
|
* implement the ->readahead method should call readahead_page() or
|
|
|
|
* readahead_page_batch() in a loop and attempt to start I/O against
|
|
|
|
* each page in the request.
|
|
|
|
*
|
|
|
|
* Most of the fields in this struct are private and should be accessed
|
|
|
|
* by the functions below.
|
|
|
|
*
|
|
|
|
* @file: The file, used primarily by network filesystems for authentication.
|
|
|
|
* May be NULL if invoked internally by the filesystem.
|
|
|
|
* @mapping: Readahead this filesystem object.
|
2021-04-08 04:18:55 +08:00
|
|
|
* @ra: File readahead state. May be NULL.
|
2020-06-02 12:46:21 +08:00
|
|
|
*/
|
|
|
|
struct readahead_control {
|
|
|
|
struct file *file;
|
|
|
|
struct address_space *mapping;
|
2021-04-08 04:18:55 +08:00
|
|
|
struct file_ra_state *ra;
|
2020-06-02 12:46:21 +08:00
|
|
|
/* private: use the readahead_* accessors instead */
|
|
|
|
pgoff_t _index;
|
|
|
|
unsigned int _nr_pages;
|
|
|
|
unsigned int _batch_count;
|
|
|
|
};
|
|
|
|
|
2021-04-08 04:18:55 +08:00
|
|
|
#define DEFINE_READAHEAD(ractl, f, r, m, i) \
|
|
|
|
struct readahead_control ractl = { \
|
2020-10-16 11:06:10 +08:00
|
|
|
.file = f, \
|
|
|
|
.mapping = m, \
|
2021-04-08 04:18:55 +08:00
|
|
|
.ra = r, \
|
2020-10-16 11:06:10 +08:00
|
|
|
._index = i, \
|
|
|
|
}
|
|
|
|
|
2020-10-16 11:06:28 +08:00
|
|
|
#define VM_READAHEAD_PAGES (SZ_128K / PAGE_SIZE)
|
|
|
|
|
|
|
|
void page_cache_ra_unbounded(struct readahead_control *,
|
|
|
|
unsigned long nr_to_read, unsigned long lookahead_count);
|
2021-04-08 04:18:55 +08:00
|
|
|
void page_cache_sync_ra(struct readahead_control *, unsigned long req_count);
|
2021-05-28 00:30:54 +08:00
|
|
|
void page_cache_async_ra(struct readahead_control *, struct folio *,
|
2020-10-16 11:06:28 +08:00
|
|
|
unsigned long req_count);
|
2020-09-10 21:03:27 +08:00
|
|
|
void readahead_expand(struct readahead_control *ractl,
|
|
|
|
loff_t new_start, size_t new_len);
|
2020-10-16 11:06:28 +08:00
|
|
|
|
|
|
|
/**
|
|
|
|
* page_cache_sync_readahead - generic file readahead
|
|
|
|
* @mapping: address_space which holds the pagecache and I/O vectors
|
|
|
|
* @ra: file_ra_state which holds the readahead state
|
|
|
|
* @file: Used by the filesystem for authentication.
|
|
|
|
* @index: Index of first page to be read.
|
|
|
|
* @req_count: Total number of pages being read by the caller.
|
|
|
|
*
|
|
|
|
* page_cache_sync_readahead() should be called when a cache miss happened:
|
|
|
|
* it will submit the read. The readahead logic may decide to piggyback more
|
|
|
|
* pages onto the read request if access patterns suggest it will improve
|
|
|
|
* performance.
|
|
|
|
*/
|
|
|
|
static inline
|
|
|
|
void page_cache_sync_readahead(struct address_space *mapping,
|
|
|
|
struct file_ra_state *ra, struct file *file, pgoff_t index,
|
|
|
|
unsigned long req_count)
|
|
|
|
{
|
2021-04-08 04:18:55 +08:00
|
|
|
DEFINE_READAHEAD(ractl, file, ra, mapping, index);
|
|
|
|
page_cache_sync_ra(&ractl, req_count);
|
2020-10-16 11:06:28 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* page_cache_async_readahead - file readahead for marked pages
|
|
|
|
* @mapping: address_space which holds the pagecache and I/O vectors
|
|
|
|
* @ra: file_ra_state which holds the readahead state
|
|
|
|
* @file: Used by the filesystem for authentication.
|
|
|
|
* @page: The page at @index which triggered the readahead call.
|
|
|
|
* @index: Index of first page to be read.
|
|
|
|
* @req_count: Total number of pages being read by the caller.
|
|
|
|
*
|
|
|
|
* page_cache_async_readahead() should be called when a page is used which
|
|
|
|
* is marked as PageReadahead; this is a marker to suggest that the application
|
|
|
|
* has used up enough of the readahead window that we should start pulling in
|
|
|
|
* more pages.
|
|
|
|
*/
|
|
|
|
static inline
|
|
|
|
void page_cache_async_readahead(struct address_space *mapping,
|
|
|
|
struct file_ra_state *ra, struct file *file,
|
|
|
|
struct page *page, pgoff_t index, unsigned long req_count)
|
|
|
|
{
|
2021-04-08 04:18:55 +08:00
|
|
|
DEFINE_READAHEAD(ractl, file, ra, mapping, index);
|
2021-05-28 00:30:54 +08:00
|
|
|
page_cache_async_ra(&ractl, page_folio(page), req_count);
|
2020-10-16 11:06:28 +08:00
|
|
|
}
|
|
|
|
|
2021-04-28 04:37:09 +08:00
|
|
|
static inline struct folio *__readahead_folio(struct readahead_control *ractl)
|
|
|
|
{
|
|
|
|
struct folio *folio;
|
|
|
|
|
|
|
|
BUG_ON(ractl->_batch_count > ractl->_nr_pages);
|
|
|
|
ractl->_nr_pages -= ractl->_batch_count;
|
|
|
|
ractl->_index += ractl->_batch_count;
|
|
|
|
|
|
|
|
if (!ractl->_nr_pages) {
|
|
|
|
ractl->_batch_count = 0;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
folio = xa_load(&ractl->mapping->i_pages, ractl->_index);
|
|
|
|
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
|
|
|
|
ractl->_batch_count = folio_nr_pages(folio);
|
|
|
|
|
|
|
|
return folio;
|
|
|
|
}
|
|
|
|
|
2020-06-02 12:46:21 +08:00
|
|
|
/**
|
|
|
|
* readahead_page - Get the next page to read.
|
2021-04-28 04:37:09 +08:00
|
|
|
* @ractl: The current readahead request.
|
2020-06-02 12:46:21 +08:00
|
|
|
*
|
|
|
|
* Context: The page is locked and has an elevated refcount. The caller
|
|
|
|
* should decreases the refcount once the page has been submitted for I/O
|
|
|
|
* and unlock the page once all I/O to that page has completed.
|
|
|
|
* Return: A pointer to the next page, or %NULL if we are done.
|
|
|
|
*/
|
2021-04-28 04:37:09 +08:00
|
|
|
static inline struct page *readahead_page(struct readahead_control *ractl)
|
2020-06-02 12:46:21 +08:00
|
|
|
{
|
2021-04-28 04:37:09 +08:00
|
|
|
struct folio *folio = __readahead_folio(ractl);
|
2020-06-02 12:46:21 +08:00
|
|
|
|
2021-04-28 04:37:09 +08:00
|
|
|
return &folio->page;
|
|
|
|
}
|
2020-06-02 12:46:21 +08:00
|
|
|
|
2021-04-28 04:37:09 +08:00
|
|
|
/**
|
|
|
|
* readahead_folio - Get the next folio to read.
|
|
|
|
* @ractl: The current readahead request.
|
|
|
|
*
|
|
|
|
* Context: The folio is locked. The caller should unlock the folio once
|
|
|
|
* all I/O to that folio has completed.
|
|
|
|
* Return: A pointer to the next folio, or %NULL if we are done.
|
|
|
|
*/
|
|
|
|
static inline struct folio *readahead_folio(struct readahead_control *ractl)
|
|
|
|
{
|
|
|
|
struct folio *folio = __readahead_folio(ractl);
|
2020-06-02 12:46:21 +08:00
|
|
|
|
2021-04-28 04:37:09 +08:00
|
|
|
if (folio)
|
|
|
|
folio_put(folio);
|
|
|
|
return folio;
|
2020-06-02 12:46:21 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int __readahead_batch(struct readahead_control *rac,
|
|
|
|
struct page **array, unsigned int array_sz)
|
|
|
|
{
|
|
|
|
unsigned int i = 0;
|
|
|
|
XA_STATE(xas, &rac->mapping->i_pages, 0);
|
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
BUG_ON(rac->_batch_count > rac->_nr_pages);
|
|
|
|
rac->_nr_pages -= rac->_batch_count;
|
|
|
|
rac->_index += rac->_batch_count;
|
|
|
|
rac->_batch_count = 0;
|
|
|
|
|
|
|
|
xas_set(&xas, rac->_index);
|
|
|
|
rcu_read_lock();
|
|
|
|
xas_for_each(&xas, page, rac->_index + rac->_nr_pages - 1) {
|
2020-11-22 14:17:08 +08:00
|
|
|
if (xas_retry(&xas, page))
|
|
|
|
continue;
|
2020-06-02 12:46:21 +08:00
|
|
|
VM_BUG_ON_PAGE(!PageLocked(page), page);
|
|
|
|
VM_BUG_ON_PAGE(PageTail(page), page);
|
|
|
|
array[i++] = page;
|
2020-08-15 08:30:37 +08:00
|
|
|
rac->_batch_count += thp_nr_pages(page);
|
2020-06-02 12:46:21 +08:00
|
|
|
if (i == array_sz)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return i;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* readahead_page_batch - Get a batch of pages to read.
|
|
|
|
* @rac: The current readahead request.
|
|
|
|
* @array: An array of pointers to struct page.
|
|
|
|
*
|
|
|
|
* Context: The pages are locked and have an elevated refcount. The caller
|
|
|
|
* should decreases the refcount once the page has been submitted for I/O
|
|
|
|
* and unlock the page once all I/O to that page has completed.
|
|
|
|
* Return: The number of pages placed in the array. 0 indicates the request
|
|
|
|
* is complete.
|
|
|
|
*/
|
|
|
|
#define readahead_page_batch(rac, array) \
|
|
|
|
__readahead_batch(rac, array, ARRAY_SIZE(array))
|
|
|
|
|
|
|
|
/**
|
|
|
|
* readahead_pos - The byte offset into the file of this readahead request.
|
|
|
|
* @rac: The readahead request.
|
|
|
|
*/
|
|
|
|
static inline loff_t readahead_pos(struct readahead_control *rac)
|
|
|
|
{
|
|
|
|
return (loff_t)rac->_index * PAGE_SIZE;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* readahead_length - The number of bytes in this readahead request.
|
|
|
|
* @rac: The readahead request.
|
|
|
|
*/
|
2021-05-15 08:27:30 +08:00
|
|
|
static inline size_t readahead_length(struct readahead_control *rac)
|
2020-06-02 12:46:21 +08:00
|
|
|
{
|
2021-05-15 08:27:30 +08:00
|
|
|
return rac->_nr_pages * PAGE_SIZE;
|
2020-06-02 12:46:21 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* readahead_index - The index of the first page in this readahead request.
|
|
|
|
* @rac: The readahead request.
|
|
|
|
*/
|
|
|
|
static inline pgoff_t readahead_index(struct readahead_control *rac)
|
|
|
|
{
|
|
|
|
return rac->_index;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* readahead_count - The number of pages in this readahead request.
|
|
|
|
* @rac: The readahead request.
|
|
|
|
*/
|
|
|
|
static inline unsigned int readahead_count(struct readahead_control *rac)
|
|
|
|
{
|
|
|
|
return rac->_nr_pages;
|
|
|
|
}
|
|
|
|
|
2021-03-22 05:03:11 +08:00
|
|
|
/**
|
|
|
|
* readahead_batch_length - The number of bytes in the current batch.
|
|
|
|
* @rac: The readahead request.
|
|
|
|
*/
|
2021-05-15 08:27:30 +08:00
|
|
|
static inline size_t readahead_batch_length(struct readahead_control *rac)
|
2021-03-22 05:03:11 +08:00
|
|
|
{
|
|
|
|
return rac->_batch_count * PAGE_SIZE;
|
|
|
|
}
|
|
|
|
|
2015-05-24 23:19:41 +08:00
|
|
|
static inline unsigned long dir_pages(struct inode *inode)
|
|
|
|
{
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 20:29:47 +08:00
|
|
|
return (unsigned long)(inode->i_size + PAGE_SIZE - 1) >>
|
|
|
|
PAGE_SHIFT;
|
2015-05-24 23:19:41 +08:00
|
|
|
}
|
|
|
|
|
2021-04-29 10:30:06 +08:00
|
|
|
/**
|
|
|
|
* folio_mkwrite_check_truncate - check if folio was truncated
|
|
|
|
* @folio: the folio to check
|
|
|
|
* @inode: the inode to check the folio against
|
|
|
|
*
|
|
|
|
* Return: the number of bytes in the folio up to EOF,
|
|
|
|
* or -EFAULT if the folio was truncated.
|
|
|
|
*/
|
|
|
|
static inline ssize_t folio_mkwrite_check_truncate(struct folio *folio,
|
|
|
|
struct inode *inode)
|
|
|
|
{
|
|
|
|
loff_t size = i_size_read(inode);
|
|
|
|
pgoff_t index = size >> PAGE_SHIFT;
|
|
|
|
size_t offset = offset_in_folio(folio, size);
|
|
|
|
|
|
|
|
if (!folio->mapping)
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
/* folio is wholly inside EOF */
|
|
|
|
if (folio_next_index(folio) - 1 < index)
|
|
|
|
return folio_size(folio);
|
|
|
|
/* folio is wholly past EOF */
|
|
|
|
if (folio->index > index || !offset)
|
|
|
|
return -EFAULT;
|
|
|
|
/* folio is partially inside EOF */
|
|
|
|
return offset;
|
|
|
|
}
|
|
|
|
|
2020-01-07 00:58:23 +08:00
|
|
|
/**
|
|
|
|
* page_mkwrite_check_truncate - check if page was truncated
|
|
|
|
* @page: the page to check
|
|
|
|
* @inode: the inode to check the page against
|
|
|
|
*
|
|
|
|
* Returns the number of bytes in the page up to EOF,
|
|
|
|
* or -EFAULT if the page was truncated.
|
|
|
|
*/
|
|
|
|
static inline int page_mkwrite_check_truncate(struct page *page,
|
|
|
|
struct inode *inode)
|
|
|
|
{
|
|
|
|
loff_t size = i_size_read(inode);
|
|
|
|
pgoff_t index = size >> PAGE_SHIFT;
|
|
|
|
int offset = offset_in_page(size);
|
|
|
|
|
|
|
|
if (page->mapping != inode->i_mapping)
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
/* page is wholly inside EOF */
|
|
|
|
if (page->index < index)
|
|
|
|
return PAGE_SIZE;
|
|
|
|
/* page is wholly past EOF */
|
|
|
|
if (page->index > index || !offset)
|
|
|
|
return -EFAULT;
|
|
|
|
/* page is partially inside EOF */
|
|
|
|
return offset;
|
|
|
|
}
|
|
|
|
|
2020-09-21 23:58:39 +08:00
|
|
|
/**
|
2021-04-28 11:11:28 +08:00
|
|
|
* i_blocks_per_folio - How many blocks fit in this folio.
|
2020-09-21 23:58:39 +08:00
|
|
|
* @inode: The inode which contains the blocks.
|
2021-04-28 11:11:28 +08:00
|
|
|
* @folio: The folio.
|
2020-09-21 23:58:39 +08:00
|
|
|
*
|
2021-04-28 11:11:28 +08:00
|
|
|
* If the block size is larger than the size of this folio, return zero.
|
2020-09-21 23:58:39 +08:00
|
|
|
*
|
2021-04-28 11:11:28 +08:00
|
|
|
* Context: The caller should hold a refcount on the folio to prevent it
|
2020-09-21 23:58:39 +08:00
|
|
|
* from being split.
|
2021-04-28 11:11:28 +08:00
|
|
|
* Return: The number of filesystem blocks covered by this folio.
|
2020-09-21 23:58:39 +08:00
|
|
|
*/
|
2021-04-28 11:11:28 +08:00
|
|
|
static inline
|
|
|
|
unsigned int i_blocks_per_folio(struct inode *inode, struct folio *folio)
|
|
|
|
{
|
|
|
|
return folio_size(folio) >> inode->i_blkbits;
|
|
|
|
}
|
|
|
|
|
2020-09-21 23:58:39 +08:00
|
|
|
static inline
|
|
|
|
unsigned int i_blocks_per_page(struct inode *inode, struct page *page)
|
|
|
|
{
|
2021-04-28 11:11:28 +08:00
|
|
|
return i_blocks_per_folio(inode, page_folio(page));
|
2020-09-21 23:58:39 +08:00
|
|
|
}
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif /* _LINUX_PAGEMAP_H */
|