- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of

switching from a user process to a kernel thread.
 
 - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav.
 
 - zsmalloc performance improvements from Sergey Senozhatsky.
 
 - Yue Zhao has found and fixed some data race issues around the
   alteration of memcg userspace tunables.
 
 - VFS rationalizations from Christoph Hellwig:
 
   - removal of most of the callers of write_one_page().
 
   - make __filemap_get_folio()'s return value more useful
 
 - Luis Chamberlain has changed tmpfs so it no longer requires swap
   backing.  Use `mount -o noswap'.
 
 - Qi Zheng has made the slab shrinkers operate locklessly, providing
   some scalability benefits.
 
 - Keith Busch has improved dmapool's performance, making part of its
   operations O(1) rather than O(n).
 
 - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
   permitting userspace to wr-protect anon memory unpopulated ptes.
 
 - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather
   than exclusive, and has fixed a bunch of errors which were caused by its
   unintuitive meaning.
 
 - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
   which causes minor faults to install a write-protected pte.
 
 - Vlastimil Babka has done some maintenance work on vma_merge():
   cleanups to the kernel code and improvements to our userspace test
   harness.
 
 - Cleanups to do_fault_around() by Lorenzo Stoakes.
 
 - Mike Rapoport has moved a lot of initialization code out of various
   mm/ files and into mm/mm_init.c.
 
 - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
   DRM, but DRM doesn't use it any more.
 
 - Lorenzo has also coverted read_kcore() and vread() to use iterators
   and has thereby removed the use of bounce buffers in some cases.
 
 - Lorenzo has also contributed further cleanups of vma_merge().
 
 - Chaitanya Prakash provides some fixes to the mmap selftesting code.
 
 - Matthew Wilcox changes xfs and afs so they no longer take sleeping
   locks in ->map_page(), a step towards RCUification of pagefaults.
 
 - Suren Baghdasaryan has improved mmap_lock scalability by switching to
   per-VMA locking.
 
 - Frederic Weisbecker has reworked the percpu cache draining so that it
   no longer causes latency glitches on cpu isolated workloads.
 
 - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
   logic.
 
 - Liu Shixin has changed zswap's initialization so we no longer waste a
   chunk of memory if zswap is not being used.
 
 - Yosry Ahmed has improved the performance of memcg statistics flushing.
 
 - David Stevens has fixed several issues involving khugepaged,
   userfaultfd and shmem.
 
 - Christoph Hellwig has provided some cleanup work to zram's IO-related
   code paths.
 
 - David Hildenbrand has fixed up some issues in the selftest code's
   testing of our pte state changing.
 
 - Pankaj Raghav has made page_endio() unneeded and has removed it.
 
 - Peter Xu contributed some rationalizations of the userfaultfd
   selftests.
 
 - Yosry Ahmed has fixed an issue around memcg's page recalim accounting.
 
 - Chaitanya Prakash has fixed some arm-related issues in the
   selftests/mm code.
 
 - Longlong Xia has improved the way in which KSM handles hwpoisoned
   pages.
 
 - Peter Xu fixes a few issues with uffd-wp at fork() time.
 
 - Stefan Roesch has changed KSM so that it may now be used on a
   per-process and per-cgroup basis.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA
 jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96
 eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY=
 =J+Dj
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
   switching from a user process to a kernel thread.

 - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
   Raghav.

 - zsmalloc performance improvements from Sergey Senozhatsky.

 - Yue Zhao has found and fixed some data race issues around the
   alteration of memcg userspace tunables.

 - VFS rationalizations from Christoph Hellwig:
     - removal of most of the callers of write_one_page()
     - make __filemap_get_folio()'s return value more useful

 - Luis Chamberlain has changed tmpfs so it no longer requires swap
   backing. Use `mount -o noswap'.

 - Qi Zheng has made the slab shrinkers operate locklessly, providing
   some scalability benefits.

 - Keith Busch has improved dmapool's performance, making part of its
   operations O(1) rather than O(n).

 - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
   permitting userspace to wr-protect anon memory unpopulated ptes.

 - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
   rather than exclusive, and has fixed a bunch of errors which were
   caused by its unintuitive meaning.

 - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
   which causes minor faults to install a write-protected pte.

 - Vlastimil Babka has done some maintenance work on vma_merge():
   cleanups to the kernel code and improvements to our userspace test
   harness.

 - Cleanups to do_fault_around() by Lorenzo Stoakes.

 - Mike Rapoport has moved a lot of initialization code out of various
   mm/ files and into mm/mm_init.c.

 - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
   DRM, but DRM doesn't use it any more.

 - Lorenzo has also coverted read_kcore() and vread() to use iterators
   and has thereby removed the use of bounce buffers in some cases.

 - Lorenzo has also contributed further cleanups of vma_merge().

 - Chaitanya Prakash provides some fixes to the mmap selftesting code.

 - Matthew Wilcox changes xfs and afs so they no longer take sleeping
   locks in ->map_page(), a step towards RCUification of pagefaults.

 - Suren Baghdasaryan has improved mmap_lock scalability by switching to
   per-VMA locking.

 - Frederic Weisbecker has reworked the percpu cache draining so that it
   no longer causes latency glitches on cpu isolated workloads.

 - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
   logic.

 - Liu Shixin has changed zswap's initialization so we no longer waste a
   chunk of memory if zswap is not being used.

 - Yosry Ahmed has improved the performance of memcg statistics
   flushing.

 - David Stevens has fixed several issues involving khugepaged,
   userfaultfd and shmem.

 - Christoph Hellwig has provided some cleanup work to zram's IO-related
   code paths.

 - David Hildenbrand has fixed up some issues in the selftest code's
   testing of our pte state changing.

 - Pankaj Raghav has made page_endio() unneeded and has removed it.

 - Peter Xu contributed some rationalizations of the userfaultfd
   selftests.

 - Yosry Ahmed has fixed an issue around memcg's page recalim
   accounting.

 - Chaitanya Prakash has fixed some arm-related issues in the
   selftests/mm code.

 - Longlong Xia has improved the way in which KSM handles hwpoisoned
   pages.

 - Peter Xu fixes a few issues with uffd-wp at fork() time.

 - Stefan Roesch has changed KSM so that it may now be used on a
   per-process and per-cgroup basis.

* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
  mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
  shmem: restrict noswap option to initial user namespace
  mm/khugepaged: fix conflicting mods to collapse_file()
  sparse: remove unnecessary 0 values from rc
  mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
  hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
  maple_tree: fix allocation in mas_sparse_area()
  mm: do not increment pgfault stats when page fault handler retries
  zsmalloc: allow only one active pool compaction context
  selftests/mm: add new selftests for KSM
  mm: add new KSM process and sysfs knobs
  mm: add new api to enable ksm per process
  mm: shrinkers: fix debugfs file permissions
  mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
  migrate_pages_batch: fix statistics for longterm pin retry
  userfaultfd: use helper function range_in_vma()
  lib/show_mem.c: use for_each_populated_zone() simplify code
  mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
  fs/buffer: convert create_page_buffers to folio_create_buffers
  fs/buffer: add folio_create_empty_buffers helper
  ...
This commit is contained in:
Linus Torvalds 2023-04-27 19:42:02 -07:00
commit 7fa8a8ee94
306 changed files with 11567 additions and 7985 deletions

View File

@ -51,3 +51,11 @@ Description: Control merging pages across different NUMA nodes.
When it is set to 0 only pages from the same node are merged,
otherwise pages from all nodes can be merged together (default).
What: /sys/kernel/mm/ksm/general_profit
Date: April 2023
KernelVersion: 6.4
Contact: Linux memory management mailing list <linux-mm@kvack.org>
Description: Measure how effective KSM is.
general_profit: how effective is KSM. The formula for the
calculation is in Documentation/admin-guide/mm/ksm.rst.

View File

@ -172,7 +172,7 @@ variables.
Offset of the free_list's member. This value is used to compute the number
of free pages.
Each zone has a free_area structure array called free_area[MAX_ORDER].
Each zone has a free_area structure array called free_area[MAX_ORDER + 1].
The free_list represents a linked list of free page blocks.
(list_head, next|prev)
@ -189,8 +189,8 @@ Offsets of the vmap_area's members. They carry vmalloc-specific
information. Makedumpfile gets the start address of the vmalloc region
from this.
(zone.free_area, MAX_ORDER)
---------------------------
(zone.free_area, MAX_ORDER + 1)
-------------------------------
Free areas descriptor. User-space tools use this value to iterate the
free_area ranges. MAX_ORDER is used by the zone buddy allocator.

View File

@ -4012,7 +4012,7 @@
[KNL] Minimal page reporting order
Format: <integer>
Adjust the minimal page reporting order. The page
reporting is disabled when it exceeds (MAX_ORDER-1).
reporting is disabled when it exceeds MAX_ORDER.
panic= [KNL] Kernel behaviour on panic: delay <timeout>
timeout > 0: seconds before rebooting

View File

@ -157,6 +157,8 @@ stable_node_chains_prune_millisecs
The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
general_profit
how effective is KSM. The calculation is explained below.
pages_shared
how many shared pages are being used
pages_sharing
@ -207,7 +209,8 @@ several times, which are unprofitable memory consumed.
ksm_rmap_items * sizeof(rmap_item).
where ksm_merging_pages is shown under the directory ``/proc/<pid>/``,
and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``.
and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``. The process profit
is also shown in ``/proc/<pid>/ksm_stat`` as ksm_process_profit.
From the perspective of application, a high ratio of ``ksm_rmap_items`` to
``ksm_merging_pages`` means a bad madvise-applied policy, so developers or

View File

@ -219,6 +219,31 @@ former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter
you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was
used.
Userfaultfd write-protect mode currently behave differently on none ptes
(when e.g. page is missing) over different types of memories.
For anonymous memory, ``ioctl(UFFDIO_WRITEPROTECT)`` will ignore none ptes
(e.g. when pages are missing and not populated). For file-backed memories
like shmem and hugetlbfs, none ptes will be write protected just like a
present pte. In other words, there will be a userfaultfd write fault
message generated when writing to a missing page on file typed memories,
as long as the page range was write-protected before. Such a message will
not be generated on anonymous memories by default.
If the application wants to be able to write protect none ptes on anonymous
memory, one can pre-populate the memory with e.g. MADV_POPULATE_READ. On
newer kernels, one can also detect the feature UFFD_FEATURE_WP_UNPOPULATED
and set the feature bit in advance to make sure none ptes will also be
write protected even upon anonymous memory.
When using ``UFFDIO_REGISTER_MODE_WP`` in combination with either
``UFFDIO_REGISTER_MODE_MISSING`` or ``UFFDIO_REGISTER_MODE_MINOR``, when
resolving missing / minor faults with ``UFFDIO_COPY`` or ``UFFDIO_CONTINUE``
respectively, it may be desirable for the new page / mapping to be
write-protected (so future writes will also result in a WP fault). These ioctls
support a mode flag (``UFFDIO_COPY_MODE_WP`` or ``UFFDIO_CONTINUE_MODE_WP``
respectively) to configure the mapping this way.
QEMU/KVM
========

View File

@ -575,20 +575,26 @@ The field width is passed by value, the bitmap is passed by reference.
Helper macros cpumask_pr_args() and nodemask_pr_args() are available to ease
printing cpumask and nodemask.
Flags bitfields such as page flags, gfp_flags
---------------------------------------------
Flags bitfields such as page flags, page_type, gfp_flags
--------------------------------------------------------
::
%pGp 0x17ffffc0002036(referenced|uptodate|lru|active|private|node=0|zone=2|lastcpupid=0x1fffff)
%pGt 0xffffff7f(buddy)
%pGg GFP_USER|GFP_DMA32|GFP_NOWARN
%pGv read|exec|mayread|maywrite|mayexec|denywrite
For printing flags bitfields as a collection of symbolic constants that
would construct the value. The type of flags is given by the third
character. Currently supported are [p]age flags, [v]ma_flags (both
expect ``unsigned long *``) and [g]fp_flags (expects ``gfp_t *``). The flag
names and print order depends on the particular type.
character. Currently supported are:
- p - [p]age flags, expects value of type (``unsigned long *``)
- t - page [t]ype, expects value of type (``unsigned int *``)
- v - [v]ma_flags, expects value of type (``unsigned long *``)
- g - [g]fp_flags, expects value of type (``gfp_t *``)
The flag names and print order depends on the particular type.
Note that this format should not be used directly in the
:c:func:`TP_printk()` part of a tracepoint. Instead, use the show_*_flags()

View File

@ -645,7 +645,7 @@ ops mmap_lock PageLocked(page)
open: yes
close: yes
fault: yes can return with page locked
map_pages: yes
map_pages: read
page_mkwrite: yes can return with page locked
pfn_mkwrite: yes
access: yes
@ -661,7 +661,7 @@ locked. The VM will unlock the page.
->map_pages() is called when VM asks to map easy accessible pages.
Filesystem should find and map pages associated with offsets from "start_pgoff"
till "end_pgoff". ->map_pages() is called with page table locked and must
till "end_pgoff". ->map_pages() is called with the RCU lock held and must
not block. If it's not possible to reach a page without blocking,
filesystem should skip it. Filesystem should use do_set_pte() to setup
page table entry. Pointer to entry associated with the page is passed in

View File

@ -996,6 +996,7 @@ Example output. You may not have all of these fields.
VmallocUsed: 40444 kB
VmallocChunk: 0 kB
Percpu: 29312 kB
EarlyMemtestBad: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 4149248 kB
ShmemHugePages: 0 kB
@ -1146,6 +1147,13 @@ VmallocChunk
Percpu
Memory allocated to the percpu allocator used to back percpu
allocations. This stat excludes the cost of metadata.
EarlyMemtestBad
The amount of RAM/memory in kB, that was identified as corrupted
by early memtest. If memtest was not run, this field will not
be displayed at all. Size is never rounded down to 0 kB.
That means if 0 kB is reported, you can safely assume
there was at least one pass of memtest and none of the passes
found a single faulty byte of RAM.
HardwareCorrupted
The amount of RAM/memory in KB, the kernel identifies as
corrupted.

View File

@ -13,17 +13,29 @@ everything stored therein is lost.
tmpfs puts everything into the kernel internal caches and grows and
shrinks to accommodate the files it contains and is able to swap
unneeded pages out to swap space. It has maximum size limits which can
be adjusted on the fly via 'mount -o remount ...'
unneeded pages out to swap space, if swap was enabled for the tmpfs
mount. tmpfs also supports THP.
If you compare it to ramfs (which was the template to create tmpfs)
you gain swapping and limit checking. Another similar thing is the RAM
disk (/dev/ram*), which simulates a fixed size hard disk in physical
RAM, where you have to create an ordinary filesystem on top. Ramdisks
cannot swap and you do not have the possibility to resize them.
tmpfs extends ramfs with a few userspace configurable options listed and
explained further below, some of which can be reconfigured dynamically on the
fly using a remount ('mount -o remount ...') of the filesystem. A tmpfs
filesystem can be resized but it cannot be resized to a size below its current
usage. tmpfs also supports POSIX ACLs, and extended attributes for the
trusted.* and security.* namespaces. ramfs does not use swap and you cannot
modify any parameter for a ramfs filesystem. The size limit of a ramfs
filesystem is how much memory you have available, and so care must be taken if
used so to not run out of memory.
Since tmpfs lives completely in the page cache and on swap, all tmpfs
pages will be shown as "Shmem" in /proc/meminfo and "Shared" in
An alternative to tmpfs and ramfs is to use brd to create RAM disks
(/dev/ram*), which allows you to simulate a block device disk in physical RAM.
To write data you would just then need to create an regular filesystem on top
this ramdisk. As with ramfs, brd ramdisks cannot swap. brd ramdisks are also
configured in size at initialization and you cannot dynamically resize them.
Contrary to brd ramdisks, tmpfs has its own filesystem, it does not rely on the
block layer at all.
Since tmpfs lives completely in the page cache and optionally on swap,
all tmpfs pages will be shown as "Shmem" in /proc/meminfo and "Shared" in
free(1). Notice that these counters also include shared memory
(shmem, see ipcs(1)). The most reliable way to get the count is
using df(1) and du(1).
@ -72,6 +84,8 @@ nr_inodes The maximum number of inodes for this instance. The default
is half of the number of your physical RAM pages, or (on a
machine with highmem) the number of lowmem RAM pages,
whichever is the lower.
noswap Disables swap. Remounts must respect the original settings.
By default swap is enabled.
========= ============================================================
These parameters accept a suffix k, m or g for kilo, mega and giga and
@ -85,6 +99,36 @@ mount with such options, since it allows any user with write access to
use up all the memory on the machine; but enhances the scalability of
that instance in a system with many CPUs making intensive use of it.
tmpfs also supports Transparent Huge Pages which requires a kernel
configured with CONFIG_TRANSPARENT_HUGEPAGE and with huge supported for
your system (has_transparent_hugepage(), which is architecture specific).
The mount options for this are:
====== ============================================================
huge=0 never: disables huge pages for the mount
huge=1 always: enables huge pages for the mount
huge=2 within_size: only allocate huge pages if the page will be
fully within i_size, also respect fadvise()/madvise() hints.
huge=3 advise: only allocate huge pages if requested with
fadvise()/madvise()
====== ============================================================
There is a sysfs file which you can also use to control system wide THP
configuration for all tmpfs mounts, the file is:
/sys/kernel/mm/transparent_hugepage/shmem_enabled
This sysfs file is placed on top of THP sysfs directory and so is registered
by THP code. It is however only used to control all tmpfs mounts with one
single knob. Since it controls all tmpfs mounts it should only be used either
for emergency or testing purposes. The values you can set for shmem_enabled are:
== ============================================================
-1 deny: disables huge on shm_mnt and all mounts, for
emergency use
-2 force: enables huge on shm_mnt and all mounts, w/o needing
option, for testing
== ============================================================
tmpfs has a mount option to set the NUMA memory allocation policy for
all files in that instance (if CONFIG_NUMA is enabled) - which can be

View File

@ -2,6 +2,12 @@
Active MM
=========
Note, the mm_count refcount may no longer include the "lazy" users
(running tasks with ->active_mm == mm && ->mm == NULL) on kernels
with CONFIG_MMU_LAZY_TLB_REFCOUNT=n. Taking and releasing these lazy
references must be done with mmgrab_lazy_tlb() and mmdrop_lazy_tlb()
helpers, which abstract this config option.
::
List: linux-kernel

View File

@ -214,7 +214,7 @@ HugeTLB Page Table Helpers
+---------------------------+--------------------------------------------------+
| pte_huge | Tests a HugeTLB |
+---------------------------+--------------------------------------------------+
| pte_mkhuge | Creates a HugeTLB |
| arch_make_huge_pte | Creates a HugeTLB |
+---------------------------+--------------------------------------------------+
| huge_pte_dirty | Tests a dirty HugeTLB |
+---------------------------+--------------------------------------------------+

View File

@ -103,7 +103,8 @@ moving across tiers only involves atomic operations on
``folio->flags`` and therefore has a negligible cost. A feedback loop
modeled after the PID controller monitors refaults over all the tiers
from anon and file types and decides which tiers from which types to
evict or protect.
evict or protect. The desired effect is to balance refault percentages
between anon and file types proportional to the swappiness level.
There are two conceptually independent procedures: the aging and the
eviction. They form a closed-loop system, i.e., the page reclaim.
@ -156,6 +157,27 @@ This time-based approach has the following advantages:
and memory sizes.
2. It is more reliable because it is directly wired to the OOM killer.
``mm_struct`` list
------------------
An ``mm_struct`` list is maintained for each memcg, and an
``mm_struct`` follows its owner task to the new memcg when this task
is migrated.
A page table walker iterates ``lruvec_memcg()->mm_list`` and calls
``walk_page_range()`` with each ``mm_struct`` on this list to scan
PTEs. When multiple page table walkers iterate the same list, each of
them gets a unique ``mm_struct``, and therefore they can run in
parallel.
Page table walkers ignore any misplaced pages, e.g., if an
``mm_struct`` was migrated, pages left in the previous memcg will be
ignored when the current memcg is under reclaim. Similarly, page table
walkers will ignore pages from nodes other than the one under reclaim.
This infrastructure also tracks the usage of ``mm_struct`` between
context switches so that page table walkers can skip processes that
have been sleeping since the last iteration.
Rmap/PT walk feedback
---------------------
Searching the rmap for PTEs mapping each page on an LRU list (to test
@ -170,7 +192,7 @@ promotes hot pages. If the scan was done cacheline efficiently, it
adds the PMD entry pointing to the PTE table to the Bloom filter. This
forms a feedback loop between the eviction and the aging.
Bloom Filters
Bloom filters
-------------
Bloom filters are a space and memory efficient data structure for set
membership test, i.e., test if an element is not in the set or may be
@ -186,6 +208,18 @@ is false positive, the cost is an additional scan of a range of PTEs,
which may yield hot pages anyway. Parameters of the filter itself can
control the false positive rate in the limit.
PID controller
--------------
A feedback loop modeled after the Proportional-Integral-Derivative
(PID) controller monitors refaults over anon and file types and
decides which type to evict when both types are available from the
same generation.
The PID controller uses generations rather than the wall clock as the
time domain because a CPU can scan pages at different rates under
varying memory pressure. It calculates a moving average for each new
generation to avoid being permanently locked in a suboptimal state.
Memcg LRU
---------
An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
@ -223,9 +257,9 @@ parts:
* Generations
* Rmap walks
* Page table walks
* Bloom filters
* PID controller
* Page table walks via ``mm_struct`` list
* Bloom filters for rmap/PT walk feedback
* PID controller for refault feedback
The aging and the eviction form a producer-consumer model;
specifically, the latter drives the former by the sliding window over

View File

@ -42,6 +42,8 @@ The unevictable list addresses the following classes of unevictable pages:
* Those owned by ramfs.
* Those owned by tmpfs with the noswap mount option.
* Those mapped into SHM_LOCK'd shared memory regions.
* Those mapped into VM_LOCKED [mlock()ed] VMAs.

View File

@ -13457,13 +13457,14 @@ F: arch/powerpc/include/asm/membarrier.h
F: include/uapi/linux/membarrier.h
F: kernel/sched/membarrier.c
MEMBLOCK
MEMBLOCK AND MEMORY MANAGEMENT INITIALIZATION
M: Mike Rapoport <rppt@kernel.org>
L: linux-mm@kvack.org
S: Maintained
F: Documentation/core-api/boot-time-mm.rst
F: include/linux/memblock.h
F: mm/memblock.c
F: mm/mm_init.c
F: tools/testing/memblock/
MEMORY CONTROLLER DRIVERS
@ -13498,6 +13499,7 @@ F: include/linux/memory_hotplug.h
F: include/linux/mm.h
F: include/linux/mmzone.h
F: include/linux/pagewalk.h
F: include/trace/events/ksm.h
F: mm/
F: tools/mm/
F: tools/testing/selftests/mm/
@ -13506,6 +13508,7 @@ VMALLOC
M: Andrew Morton <akpm@linux-foundation.org>
R: Uladzislau Rezki <urezki@gmail.com>
R: Christoph Hellwig <hch@infradead.org>
R: Lorenzo Stoakes <lstoakes@gmail.com>
L: linux-mm@kvack.org
S: Maintained
W: http://www.linux-mm.org

View File

@ -465,6 +465,38 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
irqs disabled over activate_mm. Architectures that do IPI based TLB
shootdowns should enable this.
# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
# MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching
# to/from kernel threads when the same mm is running on a lot of CPUs (a large
# multi-threaded application), by reducing contention on the mm refcount.
#
# This can be disabled if the architecture ensures no CPUs are using an mm as a
# "lazy tlb" beyond its final refcount (i.e., by the time __mmdrop frees the mm
# or its kernel page tables). This could be arranged by arch_exit_mmap(), or
# final exit(2) TLB flush, for example.
#
# To implement this, an arch *must*:
# Ensure the _lazy_tlb variants of mmgrab/mmdrop are used when manipulating
# the lazy tlb reference of a kthread's ->active_mm (non-arch code has been
# converted already).
config MMU_LAZY_TLB_REFCOUNT
def_bool y
depends on !MMU_LAZY_TLB_SHOOTDOWN
# This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
# mm as a lazy tlb beyond its last reference count, by shooting down these
# users before the mm is deallocated. __mmdrop() first IPIs all CPUs that may
# be using the mm as a lazy tlb, so that they may switch themselves to using
# init_mm for their active mm. mm_cpumask(mm) is used to determine which CPUs
# may be using mm as a lazy tlb mm.
#
# To implement this, an arch *must*:
# - At the time of the final mmdrop of the mm, ensure mm_cpumask(mm) contains
# at least all possible CPUs in which the mm is lazy.
# - It must meet the requirements for MMU_LAZY_TLB_REFCOUNT=n (see above).
config MMU_LAZY_TLB_SHOOTDOWN
bool
config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool

View File

@ -556,7 +556,7 @@ endmenu # "ARC Architecture Configuration"
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
default "12" if ARC_HUGEPAGE_16M
default "11"
default "11" if ARC_HUGEPAGE_16M
default "10"
source "kernel/power/Kconfig"

View File

@ -74,11 +74,6 @@ void __init early_init_dt_add_memory_arch(u64 base, u64 size)
base, TO_MB(size), !in_use ? "Not used":"");
}
bool arch_has_descending_max_zone_pfns(void)
{
return !IS_ENABLED(CONFIG_ARC_HAS_PAE40);
}
/*
* First memory setup routine called from setup_arch()
* 1. setup swapper's mm @init_mm

View File

@ -1352,20 +1352,19 @@ config ARM_MODULE_PLTS
configurations. If unsure, say y.
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
default "12" if SOC_AM33XX
default "9" if SA1111
default "11"
int "Order of maximal physically contiguous allocations"
default "11" if SOC_AM33XX
default "8" if SA1111
default "10"
help
The kernel memory allocator divides physically contiguous memory
blocks into "zones", where each zone is a power of two number of
pages. This option selects the largest power of two that the kernel
keeps in the memory allocator. If you need to allocate very large
blocks of physically contiguous memory, then you may need to
increase this value.
The kernel page allocator limits the size of maximal physically
contiguous allocations. The limit is called MAX_ORDER and it
defines the maximal power of two of number of pages that can be
allocated as a single contiguous block. This option allows
overriding the default setting when ability to allocate very
large blocks of physically contiguous memory is required.
This config option is actually maximum order plus one. For example,
a value of 11 means that the largest free memory block is 2^10 pages.
Don't change if unsure.
config ALIGNMENT_TRAP
def_bool CPU_CP15_MMU

View File

@ -31,7 +31,7 @@ CONFIG_SOC_VF610=y
CONFIG_SMP=y
CONFIG_ARM_PSCI=y
CONFIG_HIGHMEM=y
CONFIG_ARCH_FORCE_MAX_ORDER=14
CONFIG_ARCH_FORCE_MAX_ORDER=13
CONFIG_CMDLINE="noinitrd console=ttymxc0,115200"
CONFIG_KEXEC=y
CONFIG_CPU_FREQ=y

View File

@ -26,7 +26,7 @@ CONFIG_THUMB2_KERNEL=y
# CONFIG_THUMB2_AVOID_R_ARM_THM_JUMP11 is not set
# CONFIG_ARM_PATCH_IDIV is not set
CONFIG_HIGHMEM=y
CONFIG_ARCH_FORCE_MAX_ORDER=12
CONFIG_ARCH_FORCE_MAX_ORDER=11
CONFIG_SECCOMP=y
CONFIG_KEXEC=y
CONFIG_EFI=y

View File

@ -20,7 +20,7 @@ CONFIG_PXA_SHARPSL=y
CONFIG_MACH_AKITA=y
CONFIG_MACH_BORZOI=y
CONFIG_AEABI=y
CONFIG_ARCH_FORCE_MAX_ORDER=9
CONFIG_ARCH_FORCE_MAX_ORDER=8
CONFIG_CMDLINE="root=/dev/ram0 ro"
CONFIG_KEXEC=y
CONFIG_CPU_FREQ=y

View File

@ -19,7 +19,7 @@ CONFIG_ATMEL_CLOCKSOURCE_TCB=y
# CONFIG_CACHE_L2X0 is not set
# CONFIG_ARM_PATCH_IDIV is not set
# CONFIG_CPU_SW_DOMAIN_PAN is not set
CONFIG_ARCH_FORCE_MAX_ORDER=15
CONFIG_ARCH_FORCE_MAX_ORDER=14
CONFIG_UACCESS_WITH_MEMCPY=y
# CONFIG_ATAGS is not set
CONFIG_CMDLINE="console=ttyS0,115200 earlyprintk ignore_loglevel"

View File

@ -17,7 +17,7 @@ CONFIG_ARCH_SUNPLUS=y
# CONFIG_VDSO is not set
CONFIG_SMP=y
CONFIG_THUMB2_KERNEL=y
CONFIG_ARCH_FORCE_MAX_ORDER=12
CONFIG_ARCH_FORCE_MAX_ORDER=11
CONFIG_VFP=y
CONFIG_NEON=y
CONFIG_MODULES=y

View File

@ -253,7 +253,7 @@ static int ecard_init_mm(void)
current->mm = mm;
current->active_mm = mm;
activate_mm(active_mm, mm);
mmdrop(active_mm);
mmdrop_lazy_tlb(active_mm);
ecard_init_pgtables(mm);
return 0;
}

View File

@ -95,6 +95,7 @@ config ARM64
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_SUPPORTS_PAGE_TABLE_CHECK
select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
select ARCH_WANT_DEFAULT_BPF_JIT
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
@ -1505,39 +1506,34 @@ config XEN
# include/linux/mmzone.h requires the following to be true:
#
# MAX_ORDER - 1 + PAGE_SHIFT <= SECTION_SIZE_BITS
# MAX_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS
#
# so the maximum value of MAX_ORDER is SECTION_SIZE_BITS + 1 - PAGE_SHIFT:
# so the maximum value of MAX_ORDER is SECTION_SIZE_BITS - PAGE_SHIFT:
#
# | SECTION_SIZE_BITS | PAGE_SHIFT | max MAX_ORDER | default MAX_ORDER |
# ----+-------------------+--------------+-----------------+--------------------+
# 4K | 27 | 12 | 16 | 11 |
# 16K | 27 | 14 | 14 | 12 |
# 64K | 29 | 16 | 14 | 14 |
# 4K | 27 | 12 | 15 | 10 |
# 16K | 27 | 14 | 13 | 11 |
# 64K | 29 | 16 | 13 | 13 |
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order" if ARM64_4K_PAGES || ARM64_16K_PAGES
default "14" if ARM64_64K_PAGES
range 12 14 if ARM64_16K_PAGES
default "12" if ARM64_16K_PAGES
range 11 16 if ARM64_4K_PAGES
default "11"
int "Order of maximal physically contiguous allocations" if EXPERT && (ARM64_4K_PAGES || ARM64_16K_PAGES)
default "13" if ARM64_64K_PAGES
default "11" if ARM64_16K_PAGES
default "10"
help
The kernel memory allocator divides physically contiguous memory
blocks into "zones", where each zone is a power of two number of
pages. This option selects the largest power of two that the kernel
keeps in the memory allocator. If you need to allocate very large
blocks of physically contiguous memory, then you may need to
increase this value.
The kernel page allocator limits the size of maximal physically
contiguous allocations. The limit is called MAX_ORDER and it
defines the maximal power of two of number of pages that can be
allocated as a single contiguous block. This option allows
overriding the default setting when ability to allocate very
large blocks of physically contiguous memory is required.
This config option is actually maximum order plus one. For example,
a value of 11 means that the largest free memory block is 2^10 pages.
The maximal size of allocation cannot exceed the size of the
section, so the value of MAX_ORDER should satisfy
We make sure that we can allocate up to a HugePage size for each configuration.
Hence we have :
MAX_ORDER = (PMD_SHIFT - PAGE_SHIFT) + 1 => PAGE_SHIFT - 2
MAX_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS
However for 4K, we choose a higher default value, 11 as opposed to 10, giving us
4M allocations matching the default size used by generic code.
Don't change if unsure.
config UNMAP_KERNEL_AT_EL0
bool "Unmap kernel when running in userspace (aka \"KAISER\")" if EXPERT

View File

@ -261,9 +261,11 @@ static inline const void *__tag_set(const void *addr, u8 tag)
}
#ifdef CONFIG_KASAN_HW_TAGS
#define arch_enable_tagging_sync() mte_enable_kernel_sync()
#define arch_enable_tagging_async() mte_enable_kernel_async()
#define arch_enable_tagging_asymm() mte_enable_kernel_asymm()
#define arch_enable_tag_checks_sync() mte_enable_kernel_sync()
#define arch_enable_tag_checks_async() mte_enable_kernel_async()
#define arch_enable_tag_checks_asymm() mte_enable_kernel_asymm()
#define arch_suppress_tag_checks_start() mte_enable_tco()
#define arch_suppress_tag_checks_stop() mte_disable_tco()
#define arch_force_async_tag_fault() mte_check_tfsr_exit()
#define arch_get_random_tag() mte_get_random_tag()
#define arch_get_mem_tag(addr) mte_get_mem_tag(addr)

View File

@ -13,8 +13,73 @@
#include <linux/types.h>
#ifdef CONFIG_KASAN_HW_TAGS
/* Whether the MTE asynchronous mode is enabled. */
DECLARE_STATIC_KEY_FALSE(mte_async_or_asymm_mode);
static inline bool system_uses_mte_async_or_asymm_mode(void)
{
return static_branch_unlikely(&mte_async_or_asymm_mode);
}
#else /* CONFIG_KASAN_HW_TAGS */
static inline bool system_uses_mte_async_or_asymm_mode(void)
{
return false;
}
#endif /* CONFIG_KASAN_HW_TAGS */
#ifdef CONFIG_ARM64_MTE
/*
* The Tag Check Flag (TCF) mode for MTE is per EL, hence TCF0
* affects EL0 and TCF affects EL1 irrespective of which TTBR is
* used.
* The kernel accesses TTBR0 usually with LDTR/STTR instructions
* when UAO is available, so these would act as EL0 accesses using
* TCF0.
* However futex.h code uses exclusives which would be executed as
* EL1, this can potentially cause a tag check fault even if the
* user disables TCF0.
*
* To address the problem we set the PSTATE.TCO bit in uaccess_enable()
* and reset it in uaccess_disable().
*
* The Tag check override (TCO) bit disables temporarily the tag checking
* preventing the issue.
*/
static inline void mte_disable_tco(void)
{
asm volatile(ALTERNATIVE("nop", SET_PSTATE_TCO(0),
ARM64_MTE, CONFIG_KASAN_HW_TAGS));
}
static inline void mte_enable_tco(void)
{
asm volatile(ALTERNATIVE("nop", SET_PSTATE_TCO(1),
ARM64_MTE, CONFIG_KASAN_HW_TAGS));
}
/*
* These functions disable tag checking only if in MTE async mode
* since the sync mode generates exceptions synchronously and the
* nofault or load_unaligned_zeropad can handle them.
*/
static inline void __mte_disable_tco_async(void)
{
if (system_uses_mte_async_or_asymm_mode())
mte_disable_tco();
}
static inline void __mte_enable_tco_async(void)
{
if (system_uses_mte_async_or_asymm_mode())
mte_enable_tco();
}
/*
* These functions are meant to be only used from KASAN runtime through
* the arch_*() interface defined in asm/memory.h.
@ -138,6 +203,22 @@ void mte_enable_kernel_asymm(void);
#else /* CONFIG_ARM64_MTE */
static inline void mte_disable_tco(void)
{
}
static inline void mte_enable_tco(void)
{
}
static inline void __mte_disable_tco_async(void)
{
}
static inline void __mte_enable_tco_async(void)
{
}
static inline u8 mte_get_ptr_tag(void *ptr)
{
return 0xFF;

View File

@ -178,14 +178,6 @@ static inline void mte_disable_tco_entry(struct task_struct *task)
}
#ifdef CONFIG_KASAN_HW_TAGS
/* Whether the MTE asynchronous mode is enabled. */
DECLARE_STATIC_KEY_FALSE(mte_async_or_asymm_mode);
static inline bool system_uses_mte_async_or_asymm_mode(void)
{
return static_branch_unlikely(&mte_async_or_asymm_mode);
}
void mte_check_tfsr_el1(void);
static inline void mte_check_tfsr_entry(void)
@ -212,10 +204,6 @@ static inline void mte_check_tfsr_exit(void)
mte_check_tfsr_el1();
}
#else
static inline bool system_uses_mte_async_or_asymm_mode(void)
{
return false;
}
static inline void mte_check_tfsr_el1(void)
{
}

View File

@ -57,7 +57,7 @@ static inline bool arch_thp_swp_supported(void)
* fault on one CPU which has been handled concurrently by another CPU
* does not need to perform additional invalidation.
*/
#define flush_tlb_fix_spurious_fault(vma, address) do { } while (0)
#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
/*
* ZERO_PAGE is a global shared page that is always zero: used

View File

@ -10,7 +10,7 @@
/*
* Section size must be at least 512MB for 64K base
* page size config. Otherwise it will be less than
* (MAX_ORDER - 1) and the build process will fail.
* MAX_ORDER and the build process will fail.
*/
#ifdef CONFIG_ARM64_64K_PAGES
#define SECTION_SIZE_BITS 29

View File

@ -136,55 +136,9 @@ static inline void __uaccess_enable_hw_pan(void)
CONFIG_ARM64_PAN));
}
/*
* The Tag Check Flag (TCF) mode for MTE is per EL, hence TCF0
* affects EL0 and TCF affects EL1 irrespective of which TTBR is
* used.
* The kernel accesses TTBR0 usually with LDTR/STTR instructions
* when UAO is available, so these would act as EL0 accesses using
* TCF0.
* However futex.h code uses exclusives which would be executed as
* EL1, this can potentially cause a tag check fault even if the
* user disables TCF0.
*
* To address the problem we set the PSTATE.TCO bit in uaccess_enable()
* and reset it in uaccess_disable().
*
* The Tag check override (TCO) bit disables temporarily the tag checking
* preventing the issue.
*/
static inline void __uaccess_disable_tco(void)
{
asm volatile(ALTERNATIVE("nop", SET_PSTATE_TCO(0),
ARM64_MTE, CONFIG_KASAN_HW_TAGS));
}
static inline void __uaccess_enable_tco(void)
{
asm volatile(ALTERNATIVE("nop", SET_PSTATE_TCO(1),
ARM64_MTE, CONFIG_KASAN_HW_TAGS));
}
/*
* These functions disable tag checking only if in MTE async mode
* since the sync mode generates exceptions synchronously and the
* nofault or load_unaligned_zeropad can handle them.
*/
static inline void __uaccess_disable_tco_async(void)
{
if (system_uses_mte_async_or_asymm_mode())
__uaccess_disable_tco();
}
static inline void __uaccess_enable_tco_async(void)
{
if (system_uses_mte_async_or_asymm_mode())
__uaccess_enable_tco();
}
static inline void uaccess_disable_privileged(void)
{
__uaccess_disable_tco();
mte_disable_tco();
if (uaccess_ttbr0_disable())
return;
@ -194,7 +148,7 @@ static inline void uaccess_disable_privileged(void)
static inline void uaccess_enable_privileged(void)
{
__uaccess_enable_tco();
mte_enable_tco();
if (uaccess_ttbr0_enable())
return;
@ -302,8 +256,8 @@ do { \
#define get_user __get_user
/*
* We must not call into the scheduler between __uaccess_enable_tco_async() and
* __uaccess_disable_tco_async(). As `dst` and `src` may contain blocking
* We must not call into the scheduler between __mte_enable_tco_async() and
* __mte_disable_tco_async(). As `dst` and `src` may contain blocking
* functions, we must evaluate these outside of the critical section.
*/
#define __get_kernel_nofault(dst, src, type, err_label) \
@ -312,10 +266,10 @@ do { \
__typeof__(src) __gkn_src = (src); \
int __gkn_err = 0; \
\
__uaccess_enable_tco_async(); \
__mte_enable_tco_async(); \
__raw_get_mem("ldr", *((type *)(__gkn_dst)), \
(__force type *)(__gkn_src), __gkn_err, K); \
__uaccess_disable_tco_async(); \
__mte_disable_tco_async(); \
\
if (unlikely(__gkn_err)) \
goto err_label; \
@ -388,8 +342,8 @@ do { \
#define put_user __put_user
/*
* We must not call into the scheduler between __uaccess_enable_tco_async() and
* __uaccess_disable_tco_async(). As `dst` and `src` may contain blocking
* We must not call into the scheduler between __mte_enable_tco_async() and
* __mte_disable_tco_async(). As `dst` and `src` may contain blocking
* functions, we must evaluate these outside of the critical section.
*/
#define __put_kernel_nofault(dst, src, type, err_label) \
@ -398,10 +352,10 @@ do { \
__typeof__(src) __pkn_src = (src); \
int __pkn_err = 0; \
\
__uaccess_enable_tco_async(); \
__mte_enable_tco_async(); \
__raw_put_mem("str", *((type *)(__pkn_src)), \
(__force type *)(__pkn_dst), __pkn_err, K); \
__uaccess_disable_tco_async(); \
__mte_disable_tco_async(); \
\
if (unlikely(__pkn_err)) \
goto err_label; \

View File

@ -55,7 +55,7 @@ static inline unsigned long load_unaligned_zeropad(const void *addr)
{
unsigned long ret;
__uaccess_enable_tco_async();
__mte_enable_tco_async();
/* Load word from unaligned pointer addr */
asm(
@ -65,7 +65,7 @@ static inline unsigned long load_unaligned_zeropad(const void *addr)
: "=&r" (ret)
: "r" (addr), "Q" (*(unsigned long *)addr));
__uaccess_disable_tco_async();
__mte_disable_tco_async();
return ret;
}

View File

@ -16,7 +16,7 @@ struct hyp_pool {
* API at EL2.
*/
hyp_spinlock_t lock;
struct list_head free_area[MAX_ORDER];
struct list_head free_area[MAX_ORDER + 1];
phys_addr_t range_start;
phys_addr_t range_end;
unsigned short max_order;

View File

@ -110,7 +110,7 @@ static void __hyp_attach_page(struct hyp_pool *pool,
* after coalescing, so make sure to mark it HYP_NO_ORDER proactively.
*/
p->order = HYP_NO_ORDER;
for (; (order + 1) < pool->max_order; order++) {
for (; (order + 1) <= pool->max_order; order++) {
buddy = __find_buddy_avail(pool, p, order);
if (!buddy)
break;
@ -203,9 +203,9 @@ void *hyp_alloc_pages(struct hyp_pool *pool, unsigned short order)
hyp_spin_lock(&pool->lock);
/* Look for a high-enough-order page */
while (i < pool->max_order && list_empty(&pool->free_area[i]))
while (i <= pool->max_order && list_empty(&pool->free_area[i]))
i++;
if (i >= pool->max_order) {
if (i > pool->max_order) {
hyp_spin_unlock(&pool->lock);
return NULL;
}
@ -228,8 +228,8 @@ int hyp_pool_init(struct hyp_pool *pool, u64 pfn, unsigned int nr_pages,
int i;
hyp_spin_lock_init(&pool->lock);
pool->max_order = min(MAX_ORDER, get_order((nr_pages + 1) << PAGE_SHIFT));
for (i = 0; i < pool->max_order; i++)
pool->max_order = min(MAX_ORDER, get_order(nr_pages << PAGE_SHIFT));
for (i = 0; i <= pool->max_order; i++)
INIT_LIST_HEAD(&pool->free_area[i]);
pool->range_start = phys;
pool->range_end = phys + (nr_pages << PAGE_SHIFT);

View File

@ -535,6 +535,9 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
unsigned long vm_flags;
unsigned int mm_flags = FAULT_FLAG_DEFAULT;
unsigned long addr = untagged_addr(far);
#ifdef CONFIG_PER_VMA_LOCK
struct vm_area_struct *vma;
#endif
if (kprobe_page_fault(regs, esr))
return 0;
@ -585,6 +588,36 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
#ifdef CONFIG_PER_VMA_LOCK
if (!(mm_flags & FAULT_FLAG_USER))
goto lock_mmap;
vma = lock_vma_under_rcu(mm, addr);
if (!vma)
goto lock_mmap;
if (!(vma->vm_flags & vm_flags)) {
vma_end_read(vma);
goto lock_mmap;
}
fault = handle_mm_fault(vma, addr & PAGE_MASK,
mm_flags | FAULT_FLAG_VMA_LOCK, regs);
vma_end_read(vma);
if (!(fault & VM_FAULT_RETRY)) {
count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
goto done;
}
count_vm_vma_lock_event(VMA_LOCK_RETRY);
/* Quick path to respond to signals */
if (fault_signal_pending(fault, regs)) {
if (!user_mode(regs))
goto no_context;
return 0;
}
lock_mmap:
#endif /* CONFIG_PER_VMA_LOCK */
/*
* As per x86, we may deadlock here. However, since the kernel only
* validly references user space from well defined areas of the code,
@ -628,6 +661,9 @@ retry:
}
mmap_read_unlock(mm);
#ifdef CONFIG_PER_VMA_LOCK
done:
#endif
/*
* Handle the "normal" (no error) case first.
*/

View File

@ -332,10 +332,6 @@ config HIGHMEM
select KMAP_LOCAL
default y
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
default "11"
config DRAM_BASE
hex "DRAM start addr (the same with memory-section in dts)"
default 0x0

View File

@ -203,10 +203,9 @@ config IA64_CYCLONE
If you're unsure, answer N.
config ARCH_FORCE_MAX_ORDER
int "MAX_ORDER (11 - 17)" if !HUGETLB_PAGE
range 11 17 if !HUGETLB_PAGE
default "17" if HUGETLB_PAGE
default "11"
int
default "16" if HUGETLB_PAGE
default "10"
config SMP
bool "Symmetric multi-processing support"

View File

@ -12,9 +12,9 @@
#define SECTION_SIZE_BITS (30)
#define MAX_PHYSMEM_BITS (50)
#ifdef CONFIG_ARCH_FORCE_MAX_ORDER
#if ((CONFIG_ARCH_FORCE_MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS)
#if (CONFIG_ARCH_FORCE_MAX_ORDER + PAGE_SHIFT > SECTION_SIZE_BITS)
#undef SECTION_SIZE_BITS
#define SECTION_SIZE_BITS (CONFIG_ARCH_FORCE_MAX_ORDER - 1 + PAGE_SHIFT)
#define SECTION_SIZE_BITS (CONFIG_ARCH_FORCE_MAX_ORDER + PAGE_SHIFT)
#endif
#endif

View File

@ -170,7 +170,7 @@ static int __init hugetlb_setup_sz(char *str)
size = memparse(str, &str);
if (*str || !is_power_of_2(size) || !(tr_pages & size) ||
size <= PAGE_SIZE ||
size >= (1UL << PAGE_SHIFT << MAX_ORDER)) {
size > (1UL << PAGE_SHIFT << MAX_ORDER)) {
printk(KERN_WARNING "Invalid huge page size specified\n");
return 1;
}

View File

@ -53,8 +53,8 @@ config LOONGARCH
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
select ARCH_WANT_LD_ORPHAN_WARN
select ARCH_WANT_OPTIMIZE_VMEMMAP
select ARCH_WANTS_NO_INSTR
select BUILDTIME_TABLE_SORT
select COMMON_CLK
@ -421,12 +421,9 @@ config NODES_SHIFT
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
range 14 64 if PAGE_SIZE_64KB
default "14" if PAGE_SIZE_64KB
range 12 64 if PAGE_SIZE_16KB
default "12" if PAGE_SIZE_16KB
range 11 64
default "11"
default "13" if PAGE_SIZE_64KB
default "11" if PAGE_SIZE_16KB
default "10"
help
The kernel memory allocator divides physically contiguous memory
blocks into "zones", where each zone is a power of two number of
@ -435,9 +432,6 @@ config ARCH_FORCE_MAX_ORDER
blocks of physically contiguous memory, then you may need to
increase this value.
This config option is actually maximum order plus one. For example,
a value of 11 means that the largest free memory block is 2^10 pages.
The page size is not necessarily 4KB. Keep this in mind
when choosing a value for this option.

View File

@ -397,23 +397,22 @@ config SINGLE_MEMORY_CHUNK
Say N if not sure.
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order" if ADVANCED
int "Order of maximal physically contiguous allocations" if ADVANCED
depends on !SINGLE_MEMORY_CHUNK
default "11"
default "10"
help
The kernel memory allocator divides physically contiguous memory
blocks into "zones", where each zone is a power of two number of
pages. This option selects the largest power of two that the kernel
keeps in the memory allocator. If you need to allocate very large
blocks of physically contiguous memory, then you may need to
increase this value.
The kernel page allocator limits the size of maximal physically
contiguous allocations. The limit is called MAX_ORDER and it
defines the maximal power of two of number of pages that can be
allocated as a single contiguous block. This option allows
overriding the default setting when ability to allocate very
large blocks of physically contiguous memory is required.
For systems that have holes in their physical address space this
value also defines the minimal size of the hole that allows
freeing unused memory map.
This config option is actually maximum order plus one. For example,
a value of 11 means that the largest free memory block is 2^10 pages.
Don't change if unsure.
config 060_WRITETHROUGH
bool "Use write-through caching for 68060 supervisor accesses"

View File

@ -46,7 +46,7 @@
#define _CACHEMASK040 (~0x060)
#define _PAGE_GLOBAL040 0x400 /* 68040 global bit, used for kva descs */
/* We borrow bit 24 to store the exclusive marker in swap PTEs. */
/* We borrow bit 7 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE CF_PAGE_NOCACHE
/*

View File

@ -2099,14 +2099,10 @@ endchoice
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
range 14 64 if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_64KB
default "14" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_64KB
range 13 64 if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_32KB
default "13" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_32KB
range 12 64 if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_16KB
default "12" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_16KB
range 0 64
default "11"
default "13" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_64KB
default "12" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_32KB
default "11" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_16KB
default "10"
help
The kernel memory allocator divides physically contiguous memory
blocks into "zones", where each zone is a power of two number of
@ -2115,9 +2111,6 @@ config ARCH_FORCE_MAX_ORDER
blocks of physically contiguous memory, then you may need to
increase this value.
This config option is actually maximum order plus one. For example,
a value of 11 means that the largest free memory block is 2^10 pages.
The page size is not necessarily 4KB. Keep this in mind
when choosing a value for this option.

View File

@ -70,7 +70,7 @@ enum fixed_addresses {
#include <asm-generic/fixmap.h>
/*
* Called from pgtable_init()
* Called from pagetable_init()
*/
extern void fixrange_init(unsigned long start, unsigned long end,
pgd_t *pgd_base);

View File

@ -469,7 +469,8 @@ static inline pgprot_t pgprot_writecombine(pgprot_t _prot)
}
static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
unsigned long address)
unsigned long address,
pte_t *ptep)
{
}

View File

@ -45,19 +45,17 @@ menu "Kernel features"
source "kernel/Kconfig.hz"
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
range 9 20
default "11"
int "Order of maximal physically contiguous allocations"
default "10"
help
The kernel memory allocator divides physically contiguous memory
blocks into "zones", where each zone is a power of two number of
pages. This option selects the largest power of two that the kernel
keeps in the memory allocator. If you need to allocate very large
blocks of physically contiguous memory, then you may need to
increase this value.
The kernel page allocator limits the size of maximal physically
contiguous allocations. The limit is called MAX_ORDER and it
defines the maximal power of two of number of pages that can be
allocated as a single contiguous block. This option allows
overriding the default setting when ability to allocate very
large blocks of physically contiguous memory is required.
This config option is actually maximum order plus one. For example,
a value of 11 means that the largest free memory block is 2^10 pages.
Don't change if unsure.
endmenu

View File

@ -267,6 +267,7 @@ config PPC
select MMU_GATHER_PAGE_SIZE
select MMU_GATHER_RCU_TABLE_FREE
select MMU_GATHER_MERGE_VMAS
select MMU_LAZY_TLB_SHOOTDOWN if PPC_BOOK3S_64
select MODULES_USE_ELF_RELA
select NEED_DMA_MAP_STATE if PPC64 || NOT_COHERENT_CACHE
select NEED_PER_CPU_EMBED_FIRST_CHUNK if PPC64
@ -896,34 +897,27 @@ config DATA_SHIFT
8M pages will be pinned.
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
range 8 9 if PPC64 && PPC_64K_PAGES
default "9" if PPC64 && PPC_64K_PAGES
range 13 13 if PPC64 && !PPC_64K_PAGES
default "13" if PPC64 && !PPC_64K_PAGES
range 9 64 if PPC32 && PPC_16K_PAGES
default "9" if PPC32 && PPC_16K_PAGES
range 7 64 if PPC32 && PPC_64K_PAGES
default "7" if PPC32 && PPC_64K_PAGES
range 5 64 if PPC32 && PPC_256K_PAGES
default "5" if PPC32 && PPC_256K_PAGES
range 11 64
default "11"
int "Order of maximal physically contiguous allocations"
default "8" if PPC64 && PPC_64K_PAGES
default "12" if PPC64 && !PPC_64K_PAGES
default "8" if PPC32 && PPC_16K_PAGES
default "6" if PPC32 && PPC_64K_PAGES
default "4" if PPC32 && PPC_256K_PAGES
default "10"
help
The kernel memory allocator divides physically contiguous memory
blocks into "zones", where each zone is a power of two number of
pages. This option selects the largest power of two that the kernel
keeps in the memory allocator. If you need to allocate very large
blocks of physically contiguous memory, then you may need to
increase this value.
This config option is actually maximum order plus one. For example,
a value of 11 means that the largest free memory block is 2^10 pages.
The kernel page allocator limits the size of maximal physically
contiguous allocations. The limit is called MAX_ORDER and it
defines the maximal power of two of number of pages that can be
allocated as a single contiguous block. This option allows
overriding the default setting when ability to allocate very
large blocks of physically contiguous memory is required.
The page size is not necessarily 4KB. For example, on 64-bit
systems, 64KB pages can be enabled via CONFIG_PPC_64K_PAGES. Keep
this in mind when choosing a value for this option.
Don't change if unsure.
config PPC_SUBPAGE_PROT
bool "Support setting protections for 4k subpages (subpage_prot syscall)"
default n

View File

@ -30,7 +30,7 @@ CONFIG_PREEMPT=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
CONFIG_BINFMT_MISC=m
CONFIG_MATH_EMULATION=y
CONFIG_ARCH_FORCE_MAX_ORDER=17
CONFIG_ARCH_FORCE_MAX_ORDER=16
CONFIG_PCI=y
CONFIG_PCIEPORTBUS=y
CONFIG_PCI_MSI=y

View File

@ -41,7 +41,7 @@ CONFIG_FIXED_PHY=y
CONFIG_FONT_8x16=y
CONFIG_FONT_8x8=y
CONFIG_FONTS=y
CONFIG_ARCH_FORCE_MAX_ORDER=13
CONFIG_ARCH_FORCE_MAX_ORDER=12
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_FRAME_WARN=1024
CONFIG_FTL=y

View File

@ -121,7 +121,8 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
#define flush_tlb_fix_spurious_fault flush_tlb_fix_spurious_fault
static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
unsigned long address)
unsigned long address,
pte_t *ptep)
{
/*
* Book3S 64 does not require spurious fault flushes because the PTE

View File

@ -1611,7 +1611,7 @@ void start_secondary(void *unused)
if (IS_ENABLED(CONFIG_PPC32))
setup_kup();
mmgrab(&init_mm);
mmgrab_lazy_tlb(&init_mm);
current->active_mm = &init_mm;
smp_store_cpu_info(cpu);

View File

@ -97,7 +97,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
}
mmap_read_lock(mm);
chunk = (1UL << (PAGE_SHIFT + MAX_ORDER - 1)) /
chunk = (1UL << (PAGE_SHIFT + MAX_ORDER)) /
sizeof(struct vm_area_struct *);
chunk = min(chunk, entries);
for (entry = 0; entry < entries; entry += chunk) {

View File

@ -797,10 +797,10 @@ void exit_lazy_flush_tlb(struct mm_struct *mm, bool always_flush)
if (current->active_mm == mm) {
WARN_ON_ONCE(current->mm != NULL);
/* Is a kernel thread and is using mm as the lazy tlb */
mmgrab(&init_mm);
mmgrab_lazy_tlb(&init_mm);
current->active_mm = &init_mm;
switch_mm_irqs_off(mm, &init_mm, current);
mmdrop(mm);
mmdrop_lazy_tlb(mm);
}
/*

View File

@ -474,6 +474,40 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
if (is_exec)
flags |= FAULT_FLAG_INSTRUCTION;
#ifdef CONFIG_PER_VMA_LOCK
if (!(flags & FAULT_FLAG_USER))
goto lock_mmap;
vma = lock_vma_under_rcu(mm, address);
if (!vma)
goto lock_mmap;
if (unlikely(access_pkey_error(is_write, is_exec,
(error_code & DSISR_KEYFAULT), vma))) {
vma_end_read(vma);
goto lock_mmap;
}
if (unlikely(access_error(is_write, is_exec, vma))) {
vma_end_read(vma);
goto lock_mmap;
}
fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
vma_end_read(vma);
if (!(fault & VM_FAULT_RETRY)) {
count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
goto done;
}
count_vm_vma_lock_event(VMA_LOCK_RETRY);
if (fault_signal_pending(fault, regs))
return user_mode(regs) ? 0 : SIGBUS;
lock_mmap:
#endif /* CONFIG_PER_VMA_LOCK */
/* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
* kernel and should generate an OOPS. Unfortunately, in the case of an
@ -550,6 +584,9 @@ retry:
mmap_read_unlock(current->mm);
#ifdef CONFIG_PER_VMA_LOCK
done:
#endif
if (unlikely(fault & VM_FAULT_ERROR))
return mm_fault_error(regs, address, fault);

View File

@ -615,7 +615,7 @@ void __init gigantic_hugetlb_cma_reserve(void)
order = mmu_psize_to_shift(MMU_PAGE_16G) - PAGE_SHIFT;
if (order) {
VM_WARN_ON(order < MAX_ORDER);
VM_WARN_ON(order <= MAX_ORDER);
hugetlb_cma_reserve(order);
}
}

View File

@ -16,6 +16,7 @@ config PPC_POWERNV
select PPC_DOORBELL
select MMU_NOTIFIER
select FORCE_SMP
select ARCH_SUPPORTS_PER_VMA_LOCK
default y
config OPAL_PRD

View File

@ -1740,7 +1740,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
* DMA window can be larger than available memory, which will
* cause errors later.
*/
const u64 maxblock = 1UL << (PAGE_SHIFT + MAX_ORDER - 1);
const u64 maxblock = 1UL << (PAGE_SHIFT + MAX_ORDER);
/*
* We create the default window as big as we can. The constraint is

View File

@ -22,6 +22,7 @@ config PPC_PSERIES
select HOTPLUG_CPU
select FORCE_SMP
select SWIOTLB
select ARCH_SUPPORTS_PER_VMA_LOCK
default y
config PARAVIRT

View File

@ -120,13 +120,14 @@ config S390
select ARCH_SUPPORTS_DEBUG_PAGEALLOC
select ARCH_SUPPORTS_HUGETLBFS
select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF
select ARCH_WANTS_DYNAMIC_TASK_STRUCT
select ARCH_WANTS_NO_INSTR
select ARCH_WANT_DEFAULT_BPF_JIT
select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
select ARCH_WANT_IPC_PARSE_VERSION
select ARCH_WANT_OPTIMIZE_VMEMMAP
select BUILDTIME_TABLE_SORT
select CLONE_BACKWARDS2
select DMA_OPS if PCI

View File

@ -1239,7 +1239,8 @@ static inline int pte_allow_rdp(pte_t old, pte_t new)
}
static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
unsigned long address)
unsigned long address,
pte_t *ptep)
{
/*
* RDP might not have propagated the PTE protection reset to all CPUs,
@ -1247,11 +1248,12 @@ static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
* NOTE: This will also be called when a racing pagetable update on
* another thread already installed the correct PTE. Both cases cannot
* really be distinguished.
* Therefore, only do the local TLB flush when RDP can be used, to avoid
* unnecessary overhead.
* Therefore, only do the local TLB flush when RDP can be used, and the
* PTE does not have _PAGE_PROTECT set, to avoid unnecessary overhead.
* A local RDP can be used to do the flush.
*/
if (MACHINE_HAS_RDP)
asm volatile("ptlb" : : : "memory");
if (MACHINE_HAS_RDP && !(pte_val(*ptep) & _PAGE_PROTECT))
__ptep_rdp(address, ptep, 0, 0, 1);
}
#define flush_tlb_fix_spurious_fault flush_tlb_fix_spurious_fault

View File

@ -407,6 +407,30 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
access = VM_WRITE;
if (access == VM_WRITE)
flags |= FAULT_FLAG_WRITE;
#ifdef CONFIG_PER_VMA_LOCK
if (!(flags & FAULT_FLAG_USER))
goto lock_mmap;
vma = lock_vma_under_rcu(mm, address);
if (!vma)
goto lock_mmap;
if (!(vma->vm_flags & access)) {
vma_end_read(vma);
goto lock_mmap;
}
fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
vma_end_read(vma);
if (!(fault & VM_FAULT_RETRY)) {
count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
goto out;
}
count_vm_vma_lock_event(VMA_LOCK_RETRY);
/* Quick path to respond to signals */
if (fault_signal_pending(fault, regs)) {
fault = VM_FAULT_SIGNAL;
goto out;
}
lock_mmap:
#endif /* CONFIG_PER_VMA_LOCK */
mmap_read_lock(mm);
gmap = NULL;

View File

@ -2591,6 +2591,13 @@ int gmap_mark_unmergeable(void)
int ret;
VMA_ITERATOR(vmi, mm, 0);
/*
* Make sure to disable KSM (if enabled for the whole process or
* individual VMAs). Note that nothing currently hinders user space
* from re-enabling it.
*/
clear_bit(MMF_VM_MERGE_ANY, &mm->flags);
for_each_vma(vmi, vma) {
/* Copy vm_flags to avoid partial modifications in ksm_madvise */
vm_flags = vma->vm_flags;

View File

@ -273,7 +273,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
info.flags = VM_UNMAPPED_AREA_TOPDOWN;
info.length = len;
info.low_limit = max(PAGE_SIZE, mmap_min_addr);
info.low_limit = PAGE_SIZE;
info.high_limit = current->mm->mmap_base;
info.align_mask = PAGE_MASK & ~huge_page_mask(h);
info.align_offset = 0;

View File

@ -136,7 +136,7 @@ unsigned long arch_get_unmapped_area_topdown(struct file *filp, unsigned long ad
info.flags = VM_UNMAPPED_AREA_TOPDOWN;
info.length = len;
info.low_limit = max(PAGE_SIZE, mmap_min_addr);
info.low_limit = PAGE_SIZE;
info.high_limit = mm->mmap_base;
if (filp || (flags & MAP_SHARED))
info.align_mask = MMAP_ALIGN_MASK << PAGE_SHIFT;

View File

@ -8,7 +8,7 @@ CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_BLK_DEV_BSG is not set
CONFIG_CPU_SUBTYPE_SH7724=y
CONFIG_ARCH_FORCE_MAX_ORDER=12
CONFIG_ARCH_FORCE_MAX_ORDER=11
CONFIG_MEMORY_SIZE=0x10000000
CONFIG_FLATMEM_MANUAL=y
CONFIG_SH_ECOVEC=y

View File

@ -19,28 +19,24 @@ config PAGE_OFFSET
default "0x00000000"
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
range 9 64 if PAGE_SIZE_16KB
default "9" if PAGE_SIZE_16KB
range 7 64 if PAGE_SIZE_64KB
default "7" if PAGE_SIZE_64KB
range 11 64
default "14" if !MMU
default "11"
int "Order of maximal physically contiguous allocations"
default "8" if PAGE_SIZE_16KB
default "6" if PAGE_SIZE_64KB
default "13" if !MMU
default "10"
help
The kernel memory allocator divides physically contiguous memory
blocks into "zones", where each zone is a power of two number of
pages. This option selects the largest power of two that the kernel
keeps in the memory allocator. If you need to allocate very large
blocks of physically contiguous memory, then you may need to
increase this value.
This config option is actually maximum order plus one. For example,
a value of 11 means that the largest free memory block is 2^10 pages.
The kernel page allocator limits the size of maximal physically
contiguous allocations. The limit is called MAX_ORDER and it
defines the maximal power of two of number of pages that can be
allocated as a single contiguous block. This option allows
overriding the default setting when ability to allocate very
large blocks of physically contiguous memory is required.
The page size is not necessarily 4KB. Keep this in mind when
choosing a value for this option.
Don't change if unsure.
config MEMORY_START
hex "Physical memory start address"
default "0x08000000"

View File

@ -271,18 +271,17 @@ config ARCH_SPARSEMEM_DEFAULT
def_bool y if SPARC64
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
default "13"
int "Order of maximal physically contiguous allocations"
default "12"
help
The kernel memory allocator divides physically contiguous memory
blocks into "zones", where each zone is a power of two number of
pages. This option selects the largest power of two that the kernel
keeps in the memory allocator. If you need to allocate very large
blocks of physically contiguous memory, then you may need to
increase this value.
The kernel page allocator limits the size of maximal physically
contiguous allocations. The limit is called MAX_ORDER and it
defines the maximal power of two of number of pages that can be
allocated as a single contiguous block. This option allows
overriding the default setting when ability to allocate very
large blocks of physically contiguous memory is required.
This config option is actually maximum order plus one. For example,
a value of 13 means that the largest free memory block is 2^12 pages.
Don't change if unsure.
if SPARC64 || COMPILE_TEST
source "kernel/power/Kconfig"

View File

@ -357,6 +357,42 @@ static inline pgprot_t pgprot_noncached(pgprot_t prot)
*/
#define pgprot_noncached pgprot_noncached
static inline unsigned long pte_dirty(pte_t pte)
{
unsigned long mask;
__asm__ __volatile__(
"\n661: mov %1, %0\n"
" nop\n"
" .section .sun4v_2insn_patch, \"ax\"\n"
" .word 661b\n"
" sethi %%uhi(%2), %0\n"
" sllx %0, 32, %0\n"
" .previous\n"
: "=r" (mask)
: "i" (_PAGE_MODIFIED_4U), "i" (_PAGE_MODIFIED_4V));
return (pte_val(pte) & mask);
}
static inline unsigned long pte_write(pte_t pte)
{
unsigned long mask;
__asm__ __volatile__(
"\n661: mov %1, %0\n"
" nop\n"
" .section .sun4v_2insn_patch, \"ax\"\n"
" .word 661b\n"
" sethi %%uhi(%2), %0\n"
" sllx %0, 32, %0\n"
" .previous\n"
: "=r" (mask)
: "i" (_PAGE_WRITE_4U), "i" (_PAGE_WRITE_4V));
return (pte_val(pte) & mask);
}
#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags);
#define arch_make_huge_pte arch_make_huge_pte
@ -418,28 +454,43 @@ static inline bool is_hugetlb_pte(pte_t pte)
}
#endif
static inline pte_t __pte_mkhwwrite(pte_t pte)
{
unsigned long val = pte_val(pte);
/*
* Note: we only want to set the HW writable bit if the SW writable bit
* and the SW dirty bit are set.
*/
__asm__ __volatile__(
"\n661: or %0, %2, %0\n"
" .section .sun4v_1insn_patch, \"ax\"\n"
" .word 661b\n"
" or %0, %3, %0\n"
" .previous\n"
: "=r" (val)
: "0" (val), "i" (_PAGE_W_4U), "i" (_PAGE_W_4V));
return __pte(val);
}
static inline pte_t pte_mkdirty(pte_t pte)
{
unsigned long val = pte_val(pte), tmp;
unsigned long val = pte_val(pte), mask;
__asm__ __volatile__(
"\n661: or %0, %3, %0\n"
" nop\n"
"\n662: nop\n"
"\n661: mov %1, %0\n"
" nop\n"
" .section .sun4v_2insn_patch, \"ax\"\n"
" .word 661b\n"
" sethi %%uhi(%4), %1\n"
" sllx %1, 32, %1\n"
" .word 662b\n"
" or %1, %%lo(%4), %1\n"
" or %0, %1, %0\n"
" sethi %%uhi(%2), %0\n"
" sllx %0, 32, %0\n"
" .previous\n"
: "=r" (val), "=r" (tmp)
: "0" (val), "i" (_PAGE_MODIFIED_4U | _PAGE_W_4U),
"i" (_PAGE_MODIFIED_4V | _PAGE_W_4V));
: "=r" (mask)
: "i" (_PAGE_MODIFIED_4U), "i" (_PAGE_MODIFIED_4V));
return __pte(val);
pte = __pte(val | mask);
return pte_write(pte) ? __pte_mkhwwrite(pte) : pte;
}
static inline pte_t pte_mkclean(pte_t pte)
@ -481,7 +532,8 @@ static inline pte_t pte_mkwrite(pte_t pte)
: "=r" (mask)
: "i" (_PAGE_WRITE_4U), "i" (_PAGE_WRITE_4V));
return __pte(val | mask);
pte = __pte(val | mask);
return pte_dirty(pte) ? __pte_mkhwwrite(pte) : pte;
}
static inline pte_t pte_wrprotect(pte_t pte)
@ -584,42 +636,6 @@ static inline unsigned long pte_young(pte_t pte)
return (pte_val(pte) & mask);
}
static inline unsigned long pte_dirty(pte_t pte)
{
unsigned long mask;
__asm__ __volatile__(
"\n661: mov %1, %0\n"
" nop\n"
" .section .sun4v_2insn_patch, \"ax\"\n"
" .word 661b\n"
" sethi %%uhi(%2), %0\n"
" sllx %0, 32, %0\n"
" .previous\n"
: "=r" (mask)
: "i" (_PAGE_MODIFIED_4U), "i" (_PAGE_MODIFIED_4V));
return (pte_val(pte) & mask);
}
static inline unsigned long pte_write(pte_t pte)
{
unsigned long mask;
__asm__ __volatile__(
"\n661: mov %1, %0\n"
" nop\n"
" .section .sun4v_2insn_patch, \"ax\"\n"
" .word 661b\n"
" sethi %%uhi(%2), %0\n"
" sllx %0, 32, %0\n"
" .previous\n"
: "=r" (mask)
: "i" (_PAGE_WRITE_4U), "i" (_PAGE_WRITE_4V));
return (pte_val(pte) & mask);
}
static inline unsigned long pte_exec(pte_t pte)
{
unsigned long mask;

View File

@ -193,7 +193,7 @@ static void *dma_4v_alloc_coherent(struct device *dev, size_t size,
size = IO_PAGE_ALIGN(size);
order = get_order(size);
if (unlikely(order >= MAX_ORDER))
if (unlikely(order > MAX_ORDER))
return NULL;
npages = size >> IO_PAGE_SHIFT;

View File

@ -897,7 +897,7 @@ void __init cheetah_ecache_flush_init(void)
/* Now allocate error trap reporting scoreboard. */
sz = NR_CPUS * (2 * sizeof(struct cheetah_err_info));
for (order = 0; order < MAX_ORDER; order++) {
for (order = 0; order <= MAX_ORDER; order++) {
if ((PAGE_SIZE << order) >= sz)
break;
}

View File

@ -402,8 +402,8 @@ void tsb_grow(struct mm_struct *mm, unsigned long tsb_index, unsigned long rss)
unsigned long new_rss_limit;
gfp_t gfp_flags;
if (max_tsb_size > (PAGE_SIZE << MAX_ORDER))
max_tsb_size = (PAGE_SIZE << MAX_ORDER);
if (max_tsb_size > PAGE_SIZE << MAX_ORDER)
max_tsb_size = PAGE_SIZE << MAX_ORDER;
new_cache_index = 0;
for (new_size = 8192; new_size < max_tsb_size; new_size <<= 1UL) {

View File

@ -27,6 +27,7 @@ config X86_64
# Options that are inherently 64-bit kernel only:
select ARCH_HAS_GIGANTIC_PAGE
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_USE_CMPXCHG_LOCKREF
select HAVE_ARCH_SOFT_DIRTY
select MODULES_USE_ELF_RELA
@ -125,8 +126,8 @@ config X86
select ARCH_WANTS_NO_INSTR
select ARCH_WANT_GENERAL_HUGETLB
select ARCH_WANT_HUGE_PMD_SHARE
select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP if X86_64
select ARCH_WANT_LD_ORPHAN_WARN
select ARCH_WANT_OPTIMIZE_VMEMMAP if X86_64
select ARCH_WANTS_THP_SWAP if X86_64
select ARCH_HAS_PARANOID_L1D_FLUSH
select BUILDTIME_TABLE_SORT

View File

@ -1097,7 +1097,7 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm,
clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
}
#define flush_tlb_fix_spurious_fault(vma, address) do { } while (0)
#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
#define mk_pmd(page, pgprot) pfn_pmd(page_to_pfn(page), (pgprot))

View File

@ -15,24 +15,18 @@
#endif
#define __HAVE_ARCH_MEMCPY 1
#if defined(__SANITIZE_MEMORY__) && defined(__NO_FORTIFY)
#undef memcpy
#define memcpy __msan_memcpy
#else
extern void *memcpy(void *to, const void *from, size_t len);
#endif
extern void *__memcpy(void *to, const void *from, size_t len);
#define __HAVE_ARCH_MEMSET
#if defined(__SANITIZE_MEMORY__) && defined(__NO_FORTIFY)
extern void *__msan_memset(void *s, int c, size_t n);
#undef memset
#define memset __msan_memset
#else
void *memset(void *s, int c, size_t n);
#endif
void *__memset(void *s, int c, size_t n);
/*
* KMSAN needs to instrument as much code as possible. Use C versions of
* memsetXX() from lib/string.c under KMSAN.
*/
#if !defined(CONFIG_KMSAN)
#define __HAVE_ARCH_MEMSET16
static inline void *memset16(uint16_t *s, uint16_t v, size_t n)
{
@ -68,15 +62,10 @@ static inline void *memset64(uint64_t *s, uint64_t v, size_t n)
: "memory");
return s;
}
#endif
#define __HAVE_ARCH_MEMMOVE
#if defined(__SANITIZE_MEMORY__) && defined(__NO_FORTIFY)
#undef memmove
void *__msan_memmove(void *dest, const void *src, size_t len);
#define memmove __msan_memmove
#else
void *memmove(void *dest, const void *src, size_t count);
#endif
void *__memmove(void *dest, const void *src, size_t count);
int memcmp(const void *cs, const void *ct, size_t count);

View File

@ -19,6 +19,7 @@
#include <linux/uaccess.h> /* faulthandler_disabled() */
#include <linux/efi.h> /* efi_crash_gracefully_on_page_fault()*/
#include <linux/mm_types.h>
#include <linux/mm.h> /* find_and_lock_vma() */
#include <asm/cpufeature.h> /* boot_cpu_has, ... */
#include <asm/traps.h> /* dotraplinkage, ... */
@ -1333,6 +1334,38 @@ void do_user_addr_fault(struct pt_regs *regs,
}
#endif
#ifdef CONFIG_PER_VMA_LOCK
if (!(flags & FAULT_FLAG_USER))
goto lock_mmap;
vma = lock_vma_under_rcu(mm, address);
if (!vma)
goto lock_mmap;
if (unlikely(access_error(error_code, vma))) {
vma_end_read(vma);
goto lock_mmap;
}
fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
vma_end_read(vma);
if (!(fault & VM_FAULT_RETRY)) {
count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
goto done;
}
count_vm_vma_lock_event(VMA_LOCK_RETRY);
/* Quick path to respond to signals */
if (fault_signal_pending(fault, regs)) {
if (!user_mode(regs))
kernelmode_fixup_or_oops(regs, error_code, address,
SIGBUS, BUS_ADRERR,
ARCH_DEFAULT_PKEY);
return;
}
lock_mmap:
#endif /* CONFIG_PER_VMA_LOCK */
/*
* Kernel-mode access to the user address space should only occur
* on well-defined single instructions listed in the exception
@ -1433,6 +1466,9 @@ good_area:
}
mmap_read_unlock(mm);
#ifdef CONFIG_PER_VMA_LOCK
done:
#endif
if (likely(!(fault & VM_FAULT_ERROR)))
return;

View File

@ -1073,11 +1073,15 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
}
/*
* untrack_pfn_moved is called, while mremapping a pfnmap for a new region,
* with the old vma after its pfnmap page table has been removed. The new
* vma has a new pfnmap to the same pfn & cache type with VM_PAT set.
* untrack_pfn_clear is called if the following situation fits:
*
* 1) while mremapping a pfnmap for a new region, with the old vma after
* its pfnmap page table has been removed. The new vma has a new pfnmap
* to the same pfn & cache type with VM_PAT set.
* 2) while duplicating vm area, the new vma fails to copy the pgtable from
* old vma.
*/
void untrack_pfn_moved(struct vm_area_struct *vma)
void untrack_pfn_clear(struct vm_area_struct *vma)
{
vm_flags_clear(vma, VM_PAT);
}

View File

@ -772,18 +772,17 @@ config HIGHMEM
If unsure, say Y.
config ARCH_FORCE_MAX_ORDER
int "Maximum zone order"
default "11"
int "Order of maximal physically contiguous allocations"
default "10"
help
The kernel memory allocator divides physically contiguous memory
blocks into "zones", where each zone is a power of two number of
pages. This option selects the largest power of two that the kernel
keeps in the memory allocator. If you need to allocate very large
blocks of physically contiguous memory, then you may need to
increase this value.
The kernel page allocator limits the size of maximal physically
contiguous allocations. The limit is called MAX_ORDER and it
defines the maximal power of two of number of pages that can be
allocated as a single contiguous block. This option allows
overriding the default setting when ability to allocate very
large blocks of physically contiguous memory is required.
This config option is actually maximum order plus one. For example,
a value of 11 means that the largest free memory block is 2^10 pages.
Don't change if unsure.
endmenu

View File

@ -226,8 +226,8 @@ static ssize_t regmap_read_debugfs(struct regmap *map, unsigned int from,
if (*ppos < 0 || !count)
return -EINVAL;
if (count > (PAGE_SIZE << (MAX_ORDER - 1)))
count = PAGE_SIZE << (MAX_ORDER - 1);
if (count > (PAGE_SIZE << MAX_ORDER))
count = PAGE_SIZE << MAX_ORDER;
buf = kmalloc(count, GFP_KERNEL);
if (!buf)
@ -373,8 +373,8 @@ static ssize_t regmap_reg_ranges_read_file(struct file *file,
if (*ppos < 0 || !count)
return -EINVAL;
if (count > (PAGE_SIZE << (MAX_ORDER - 1)))
count = PAGE_SIZE << (MAX_ORDER - 1);
if (count > (PAGE_SIZE << MAX_ORDER))
count = PAGE_SIZE << MAX_ORDER;
buf = kmalloc(count, GFP_KERNEL);
if (!buf)

View File

@ -3108,7 +3108,7 @@ loop:
ptr->resultcode = 0;
if (ptr->flags & (FD_RAW_READ | FD_RAW_WRITE)) {
if (ptr->length <= 0 || ptr->length >= MAX_LEN)
if (ptr->length <= 0 || ptr->length > MAX_LEN)
return -EINVAL;
ptr->kernel_data = (char *)fd_dma_mem_alloc(ptr->length);
fallback_on_nodma_alloc(&ptr->kernel_data, ptr->length);

View File

@ -54,9 +54,8 @@ static size_t huge_class_size;
static const struct block_device_operations zram_devops;
static void zram_free_page(struct zram *zram, size_t index);
static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
u32 index, int offset, struct bio *bio);
static int zram_read_page(struct zram *zram, struct page *page, u32 index,
struct bio *parent);
static int zram_slot_trylock(struct zram *zram, u32 index)
{
@ -148,6 +147,7 @@ static inline bool is_partial_io(struct bio_vec *bvec)
{
return bvec->bv_len != PAGE_SIZE;
}
#define ZRAM_PARTIAL_IO 1
#else
static inline bool is_partial_io(struct bio_vec *bvec)
{
@ -174,36 +174,6 @@ static inline u32 zram_get_priority(struct zram *zram, u32 index)
return prio & ZRAM_COMP_PRIORITY_MASK;
}
/*
* Check if request is within bounds and aligned on zram logical blocks.
*/
static inline bool valid_io_request(struct zram *zram,
sector_t start, unsigned int size)
{
u64 end, bound;
/* unaligned request */
if (unlikely(start & (ZRAM_SECTOR_PER_LOGICAL_BLOCK - 1)))
return false;
if (unlikely(size & (ZRAM_LOGICAL_BLOCK_SIZE - 1)))
return false;
end = start + (size >> SECTOR_SHIFT);
bound = zram->disksize >> SECTOR_SHIFT;
/* out of range */
if (unlikely(start >= bound || end > bound || start > end))
return false;
/* I/O request is valid */
return true;
}
static void update_position(u32 *index, int *offset, struct bio_vec *bvec)
{
*index += (*offset + bvec->bv_len) / PAGE_SIZE;
*offset = (*offset + bvec->bv_len) % PAGE_SIZE;
}
static inline void update_used_max(struct zram *zram,
const unsigned long pages)
{
@ -606,41 +576,16 @@ static void free_block_bdev(struct zram *zram, unsigned long blk_idx)
atomic64_dec(&zram->stats.bd_count);
}
static void zram_page_end_io(struct bio *bio)
{
struct page *page = bio_first_page_all(bio);
page_endio(page, op_is_write(bio_op(bio)),
blk_status_to_errno(bio->bi_status));
bio_put(bio);
}
/*
* Returns 1 if the submission is successful.
*/
static int read_from_bdev_async(struct zram *zram, struct bio_vec *bvec,
static void read_from_bdev_async(struct zram *zram, struct page *page,
unsigned long entry, struct bio *parent)
{
struct bio *bio;
bio = bio_alloc(zram->bdev, 1, parent ? parent->bi_opf : REQ_OP_READ,
GFP_NOIO);
if (!bio)
return -ENOMEM;
bio = bio_alloc(zram->bdev, 1, parent->bi_opf, GFP_NOIO);
bio->bi_iter.bi_sector = entry * (PAGE_SIZE >> 9);
if (!bio_add_page(bio, bvec->bv_page, bvec->bv_len, bvec->bv_offset)) {
bio_put(bio);
return -EIO;
}
if (!parent)
bio->bi_end_io = zram_page_end_io;
else
bio_chain(bio, parent);
__bio_add_page(bio, page, PAGE_SIZE, 0);
bio_chain(bio, parent);
submit_bio(bio);
return 1;
}
#define PAGE_WB_SIG "page_index="
@ -701,10 +646,6 @@ static ssize_t writeback_store(struct device *dev,
}
for (; nr_pages != 0; index++, nr_pages--) {
struct bio_vec bvec;
bvec_set_page(&bvec, page, PAGE_SIZE, 0);
spin_lock(&zram->wb_limit_lock);
if (zram->wb_limit_enable && !zram->bd_wb_limit) {
spin_unlock(&zram->wb_limit_lock);
@ -748,7 +689,7 @@ static ssize_t writeback_store(struct device *dev,
/* Need for hugepage writeback racing */
zram_set_flag(zram, index, ZRAM_IDLE);
zram_slot_unlock(zram, index);
if (zram_bvec_read(zram, &bvec, index, 0, NULL)) {
if (zram_read_page(zram, page, index, NULL)) {
zram_slot_lock(zram, index);
zram_clear_flag(zram, index, ZRAM_UNDER_WB);
zram_clear_flag(zram, index, ZRAM_IDLE);
@ -759,9 +700,8 @@ static ssize_t writeback_store(struct device *dev,
bio_init(&bio, zram->bdev, &bio_vec, 1,
REQ_OP_WRITE | REQ_SYNC);
bio.bi_iter.bi_sector = blk_idx * (PAGE_SIZE >> 9);
bio_add_page(&bio, page, PAGE_SIZE, 0);
bio_add_page(&bio, bvec.bv_page, bvec.bv_len,
bvec.bv_offset);
/*
* XXX: A single page IO would be inefficient for write
* but it would be not bad as starter.
@ -829,19 +769,20 @@ struct zram_work {
struct work_struct work;
struct zram *zram;
unsigned long entry;
struct bio *bio;
struct bio_vec bvec;
struct page *page;
int error;
};
#if PAGE_SIZE != 4096
static void zram_sync_read(struct work_struct *work)
{
struct zram_work *zw = container_of(work, struct zram_work, work);
struct zram *zram = zw->zram;
unsigned long entry = zw->entry;
struct bio *bio = zw->bio;
struct bio_vec bv;
struct bio bio;
read_from_bdev_async(zram, &zw->bvec, entry, bio);
bio_init(&bio, zw->zram->bdev, &bv, 1, REQ_OP_READ);
bio.bi_iter.bi_sector = zw->entry * (PAGE_SIZE >> 9);
__bio_add_page(&bio, zw->page, PAGE_SIZE, 0);
zw->error = submit_bio_wait(&bio);
}
/*
@ -849,45 +790,39 @@ static void zram_sync_read(struct work_struct *work)
* chained IO with parent IO in same context, it's a deadlock. To avoid that,
* use a worker thread context.
*/
static int read_from_bdev_sync(struct zram *zram, struct bio_vec *bvec,
unsigned long entry, struct bio *bio)
static int read_from_bdev_sync(struct zram *zram, struct page *page,
unsigned long entry)
{
struct zram_work work;
work.bvec = *bvec;
work.page = page;
work.zram = zram;
work.entry = entry;
work.bio = bio;
INIT_WORK_ONSTACK(&work.work, zram_sync_read);
queue_work(system_unbound_wq, &work.work);
flush_work(&work.work);
destroy_work_on_stack(&work.work);
return 1;
return work.error;
}
#else
static int read_from_bdev_sync(struct zram *zram, struct bio_vec *bvec,
unsigned long entry, struct bio *bio)
{
WARN_ON(1);
return -EIO;
}
#endif
static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
unsigned long entry, struct bio *parent, bool sync)
static int read_from_bdev(struct zram *zram, struct page *page,
unsigned long entry, struct bio *parent)
{
atomic64_inc(&zram->stats.bd_reads);
if (sync)
return read_from_bdev_sync(zram, bvec, entry, parent);
else
return read_from_bdev_async(zram, bvec, entry, parent);
if (!parent) {
if (WARN_ON_ONCE(!IS_ENABLED(ZRAM_PARTIAL_IO)))
return -EIO;
return read_from_bdev_sync(zram, page, entry);
}
read_from_bdev_async(zram, page, entry, parent);
return 0;
}
#else
static inline void reset_bdev(struct zram *zram) {};
static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
unsigned long entry, struct bio *parent, bool sync)
static int read_from_bdev(struct zram *zram, struct page *page,
unsigned long entry, struct bio *parent)
{
return -EIO;
}
@ -1190,10 +1125,9 @@ static ssize_t io_stat_show(struct device *dev,
down_read(&zram->init_lock);
ret = scnprintf(buf, PAGE_SIZE,
"%8llu %8llu %8llu %8llu\n",
"%8llu %8llu 0 %8llu\n",
(u64)atomic64_read(&zram->stats.failed_reads),
(u64)atomic64_read(&zram->stats.failed_writes),
(u64)atomic64_read(&zram->stats.invalid_io),
(u64)atomic64_read(&zram->stats.notify_free));
up_read(&zram->init_lock);
@ -1371,20 +1305,6 @@ out:
~(1UL << ZRAM_LOCK | 1UL << ZRAM_UNDER_WB));
}
/*
* Reads a page from the writeback devices. Corresponding ZRAM slot
* should be unlocked.
*/
static int zram_bvec_read_from_bdev(struct zram *zram, struct page *page,
u32 index, struct bio *bio, bool partial_io)
{
struct bio_vec bvec;
bvec_set_page(&bvec, page, PAGE_SIZE, 0);
return read_from_bdev(zram, &bvec, zram_get_element(zram, index), bio,
partial_io);
}
/*
* Reads (decompresses if needed) a page from zspool (zsmalloc).
* Corresponding ZRAM slot should be locked.
@ -1434,8 +1354,8 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,
return ret;
}
static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
struct bio *bio, bool partial_io)
static int zram_read_page(struct zram *zram, struct page *page, u32 index,
struct bio *parent)
{
int ret;
@ -1445,11 +1365,14 @@ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
ret = zram_read_from_zspool(zram, page, index);
zram_slot_unlock(zram, index);
} else {
/* Slot should be unlocked before the function call */
/*
* The slot should be unlocked before reading from the backing
* device.
*/
zram_slot_unlock(zram, index);
ret = zram_bvec_read_from_bdev(zram, page, index, bio,
partial_io);
ret = read_from_bdev(zram, page, zram_get_element(zram, index),
parent);
}
/* Should NEVER happen. Return bio error if it does. */
@ -1459,39 +1382,34 @@ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
return ret;
}
static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
u32 index, int offset, struct bio *bio)
/*
* Use a temporary buffer to decompress the page, as the decompressor
* always expects a full page for the output.
*/
static int zram_bvec_read_partial(struct zram *zram, struct bio_vec *bvec,
u32 index, int offset)
{
struct page *page = alloc_page(GFP_NOIO);
int ret;
struct page *page;
page = bvec->bv_page;
if (is_partial_io(bvec)) {
/* Use a temporary buffer to decompress the page */
page = alloc_page(GFP_NOIO|__GFP_HIGHMEM);
if (!page)
return -ENOMEM;
}
ret = __zram_bvec_read(zram, page, index, bio, is_partial_io(bvec));
if (unlikely(ret))
goto out;
if (is_partial_io(bvec)) {
void *src = kmap_atomic(page);
memcpy_to_bvec(bvec, src + offset);
kunmap_atomic(src);
}
out:
if (is_partial_io(bvec))
__free_page(page);
if (!page)
return -ENOMEM;
ret = zram_read_page(zram, page, index, NULL);
if (likely(!ret))
memcpy_to_bvec(bvec, page_address(page) + offset);
__free_page(page);
return ret;
}
static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
u32 index, struct bio *bio)
static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
u32 index, int offset, struct bio *bio)
{
if (is_partial_io(bvec))
return zram_bvec_read_partial(zram, bvec, index, offset);
return zram_read_page(zram, bvec->bv_page, index, bio);
}
static int zram_write_page(struct zram *zram, struct page *page, u32 index)
{
int ret = 0;
unsigned long alloced_pages;
@ -1499,7 +1417,6 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
unsigned int comp_len = 0;
void *src, *dst, *mem;
struct zcomp_strm *zstrm;
struct page *page = bvec->bv_page;
unsigned long element = 0;
enum zram_pageflags flags = 0;
@ -1617,42 +1534,35 @@ out:
return ret;
}
static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
u32 index, int offset, struct bio *bio)
/*
* This is a partial IO. Read the full page before writing the changes.
*/
static int zram_bvec_write_partial(struct zram *zram, struct bio_vec *bvec,
u32 index, int offset, struct bio *bio)
{
struct page *page = alloc_page(GFP_NOIO);
int ret;
struct page *page = NULL;
struct bio_vec vec;
vec = *bvec;
if (is_partial_io(bvec)) {
void *dst;
/*
* This is a partial IO. We need to read the full page
* before to write the changes.
*/
page = alloc_page(GFP_NOIO|__GFP_HIGHMEM);
if (!page)
return -ENOMEM;
if (!page)
return -ENOMEM;
ret = __zram_bvec_read(zram, page, index, bio, true);
if (ret)
goto out;
dst = kmap_atomic(page);
memcpy_from_bvec(dst + offset, bvec);
kunmap_atomic(dst);
bvec_set_page(&vec, page, PAGE_SIZE, 0);
ret = zram_read_page(zram, page, index, bio);
if (!ret) {
memcpy_from_bvec(page_address(page) + offset, bvec);
ret = zram_write_page(zram, page, index);
}
ret = __zram_bvec_write(zram, &vec, index, bio);
out:
if (is_partial_io(bvec))
__free_page(page);
__free_page(page);
return ret;
}
static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
u32 index, int offset, struct bio *bio)
{
if (is_partial_io(bvec))
return zram_bvec_write_partial(zram, bvec, index, offset, bio);
return zram_write_page(zram, bvec->bv_page, index);
}
#ifdef CONFIG_ZRAM_MULTI_COMP
/*
* This function will decompress (unless it's ZRAM_HUGE) the page and then
@ -1761,7 +1671,7 @@ static int zram_recompress(struct zram *zram, u32 index, struct page *page,
/*
* No direct reclaim (slow path) for handle allocation and no
* re-compression attempt (unlike in __zram_bvec_write()) since
* re-compression attempt (unlike in zram_write_bvec()) since
* we already have stored that object in zsmalloc. If we cannot
* alloc memory for recompressed object then we bail out and
* simply keep the old (existing) object in zsmalloc.
@ -1921,15 +1831,12 @@ release_init_lock:
}
#endif
/*
* zram_bio_discard - handler on discard request
* @index: physical block index in PAGE_SIZE units
* @offset: byte offset within physical block
*/
static void zram_bio_discard(struct zram *zram, u32 index,
int offset, struct bio *bio)
static void zram_bio_discard(struct zram *zram, struct bio *bio)
{
size_t n = bio->bi_iter.bi_size;
u32 index = bio->bi_iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
u32 offset = (bio->bi_iter.bi_sector & (SECTORS_PER_PAGE - 1)) <<
SECTOR_SHIFT;
/*
* zram manages data in physical block size units. Because logical block
@ -1957,80 +1864,58 @@ static void zram_bio_discard(struct zram *zram, u32 index,
index++;
n -= PAGE_SIZE;
}
bio_endio(bio);
}
/*
* Returns errno if it has some problem. Otherwise return 0 or 1.
* Returns 0 if IO request was done synchronously
* Returns 1 if IO request was successfully submitted.
*/
static int zram_bvec_rw(struct zram *zram, struct bio_vec *bvec, u32 index,
int offset, enum req_op op, struct bio *bio)
static void zram_bio_read(struct zram *zram, struct bio *bio)
{
int ret;
if (!op_is_write(op)) {
ret = zram_bvec_read(zram, bvec, index, offset, bio);
flush_dcache_page(bvec->bv_page);
} else {
ret = zram_bvec_write(zram, bvec, index, offset, bio);
}
zram_slot_lock(zram, index);
zram_accessed(zram, index);
zram_slot_unlock(zram, index);
if (unlikely(ret < 0)) {
if (!op_is_write(op))
atomic64_inc(&zram->stats.failed_reads);
else
atomic64_inc(&zram->stats.failed_writes);
}
return ret;
}
static void __zram_make_request(struct zram *zram, struct bio *bio)
{
int offset;
u32 index;
struct bio_vec bvec;
struct bvec_iter iter;
struct bio_vec bv;
unsigned long start_time;
index = bio->bi_iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
offset = (bio->bi_iter.bi_sector &
(SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT;
start_time = bio_start_io_acct(bio);
bio_for_each_segment(bv, bio, iter) {
u32 index = iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
u32 offset = (iter.bi_sector & (SECTORS_PER_PAGE - 1)) <<
SECTOR_SHIFT;
switch (bio_op(bio)) {
case REQ_OP_DISCARD:
case REQ_OP_WRITE_ZEROES:
zram_bio_discard(zram, index, offset, bio);
bio_endio(bio);
return;
default:
break;
if (zram_bvec_read(zram, &bv, index, offset, bio) < 0) {
atomic64_inc(&zram->stats.failed_reads);
bio->bi_status = BLK_STS_IOERR;
break;
}
flush_dcache_page(bv.bv_page);
zram_slot_lock(zram, index);
zram_accessed(zram, index);
zram_slot_unlock(zram, index);
}
bio_end_io_acct(bio, start_time);
bio_endio(bio);
}
static void zram_bio_write(struct zram *zram, struct bio *bio)
{
struct bvec_iter iter;
struct bio_vec bv;
unsigned long start_time;
start_time = bio_start_io_acct(bio);
bio_for_each_segment(bvec, bio, iter) {
struct bio_vec bv = bvec;
unsigned int unwritten = bvec.bv_len;
bio_for_each_segment(bv, bio, iter) {
u32 index = iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
u32 offset = (iter.bi_sector & (SECTORS_PER_PAGE - 1)) <<
SECTOR_SHIFT;
do {
bv.bv_len = min_t(unsigned int, PAGE_SIZE - offset,
unwritten);
if (zram_bvec_rw(zram, &bv, index, offset,
bio_op(bio), bio) < 0) {
bio->bi_status = BLK_STS_IOERR;
break;
}
if (zram_bvec_write(zram, &bv, index, offset, bio) < 0) {
atomic64_inc(&zram->stats.failed_writes);
bio->bi_status = BLK_STS_IOERR;
break;
}
bv.bv_offset += bv.bv_len;
unwritten -= bv.bv_len;
update_position(&index, &offset, &bv);
} while (unwritten);
zram_slot_lock(zram, index);
zram_accessed(zram, index);
zram_slot_unlock(zram, index);
}
bio_end_io_acct(bio, start_time);
bio_endio(bio);
@ -2043,14 +1928,21 @@ static void zram_submit_bio(struct bio *bio)
{
struct zram *zram = bio->bi_bdev->bd_disk->private_data;
if (!valid_io_request(zram, bio->bi_iter.bi_sector,
bio->bi_iter.bi_size)) {
atomic64_inc(&zram->stats.invalid_io);
bio_io_error(bio);
return;
switch (bio_op(bio)) {
case REQ_OP_READ:
zram_bio_read(zram, bio);
break;
case REQ_OP_WRITE:
zram_bio_write(zram, bio);
break;
case REQ_OP_DISCARD:
case REQ_OP_WRITE_ZEROES:
zram_bio_discard(zram, bio);
break;
default:
WARN_ON_ONCE(1);
bio_endio(bio);
}
__zram_make_request(zram, bio);
}
static void zram_slot_free_notify(struct block_device *bdev,

View File

@ -78,7 +78,6 @@ struct zram_stats {
atomic64_t compr_data_size; /* compressed size of pages stored */
atomic64_t failed_reads; /* can happen when memory is too low */
atomic64_t failed_writes; /* can happen when memory is too low */
atomic64_t invalid_io; /* non-page-aligned I/O requests */
atomic64_t notify_free; /* no. of swap slot free notifications */
atomic64_t same_pages; /* no. of same element filled pages */
atomic64_t huge_pages; /* no. of huge pages */

View File

@ -892,7 +892,7 @@ static int sev_ioctl_do_get_id2(struct sev_issue_cmd *argp)
/*
* The length of the ID shouldn't be assumed by software since
* it may change in the future. The allocation size is limited
* to 1 << (PAGE_SHIFT + MAX_ORDER - 1) by the page allocator.
* to 1 << (PAGE_SHIFT + MAX_ORDER) by the page allocator.
* If the allocation fails, simply return ENOMEM rather than
* warning in the kernel log.
*/

View File

@ -70,11 +70,11 @@ struct hisi_acc_sgl_pool *hisi_acc_create_sgl_pool(struct device *dev,
HISI_ACC_SGL_ALIGN_SIZE);
/*
* the pool may allocate a block of memory of size PAGE_SIZE * 2^(MAX_ORDER - 1),
* the pool may allocate a block of memory of size PAGE_SIZE * 2^MAX_ORDER,
* block size may exceed 2^31 on ia64, so the max of block size is 2^31
*/
block_size = 1 << (PAGE_SHIFT + MAX_ORDER <= 32 ?
PAGE_SHIFT + MAX_ORDER - 1 : 31);
block_size = 1 << (PAGE_SHIFT + MAX_ORDER < 32 ?
PAGE_SHIFT + MAX_ORDER : 31);
sgl_num_per_block = block_size / sgl_size;
block_num = count / sgl_num_per_block;
remain_sgl = count % sgl_num_per_block;

View File

@ -41,12 +41,11 @@ struct dma_heap_attachment {
bool mapped;
};
#define LOW_ORDER_GFP (GFP_HIGHUSER | __GFP_ZERO | __GFP_COMP)
#define MID_ORDER_GFP (LOW_ORDER_GFP | __GFP_NOWARN)
#define LOW_ORDER_GFP (GFP_HIGHUSER | __GFP_ZERO)
#define HIGH_ORDER_GFP (((GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN \
| __GFP_NORETRY) & ~__GFP_RECLAIM) \
| __GFP_COMP)
static gfp_t order_flags[] = {HIGH_ORDER_GFP, MID_ORDER_GFP, LOW_ORDER_GFP};
static gfp_t order_flags[] = {HIGH_ORDER_GFP, HIGH_ORDER_GFP, LOW_ORDER_GFP};
/*
* The selection of the orders used for allocation (1MB, 64K, 4K) is designed
* to match with the sizes often found in IOMMUs. Using order 4 pages instead

View File

@ -115,7 +115,7 @@ static int get_huge_pages(struct drm_i915_gem_object *obj)
do {
struct page *page;
GEM_BUG_ON(order >= MAX_ORDER);
GEM_BUG_ON(order > MAX_ORDER);
page = alloc_pages(GFP | __GFP_ZERO, order);
if (!page)
goto err;

View File

@ -261,7 +261,7 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
* encryption bits. This is because the exact location of the
* data may not be known at mmap() time and may also change
* at arbitrary times while the data is mmap'ed.
* See vmf_insert_mixed_prot() for a discussion.
* See vmf_insert_pfn_prot() for a discussion.
*/
ret = vmf_insert_pfn_prot(vma, address, pfn, prot);

View File

@ -65,11 +65,11 @@ module_param(page_pool_size, ulong, 0644);
static atomic_long_t allocated_pages;
static struct ttm_pool_type global_write_combined[MAX_ORDER];
static struct ttm_pool_type global_uncached[MAX_ORDER];
static struct ttm_pool_type global_write_combined[MAX_ORDER + 1];
static struct ttm_pool_type global_uncached[MAX_ORDER + 1];
static struct ttm_pool_type global_dma32_write_combined[MAX_ORDER];
static struct ttm_pool_type global_dma32_uncached[MAX_ORDER];
static struct ttm_pool_type global_dma32_write_combined[MAX_ORDER + 1];
static struct ttm_pool_type global_dma32_uncached[MAX_ORDER + 1];
static spinlock_t shrinker_lock;
static struct list_head shrinker_list;
@ -444,7 +444,7 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
else
gfp_flags |= GFP_HIGHUSER;
for (order = min_t(unsigned int, MAX_ORDER - 1, __fls(num_pages));
for (order = min_t(unsigned int, MAX_ORDER, __fls(num_pages));
num_pages;
order = min_t(unsigned int, order, __fls(num_pages))) {
struct ttm_pool_type *pt;
@ -563,7 +563,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
if (use_dma_alloc) {
for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
for (j = 0; j < MAX_ORDER; ++j)
for (j = 0; j <= MAX_ORDER; ++j)
ttm_pool_type_init(&pool->caching[i].orders[j],
pool, i, j);
}
@ -583,7 +583,7 @@ void ttm_pool_fini(struct ttm_pool *pool)
if (pool->use_dma_alloc) {
for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
for (j = 0; j < MAX_ORDER; ++j)
for (j = 0; j <= MAX_ORDER; ++j)
ttm_pool_type_fini(&pool->caching[i].orders[j]);
}
@ -637,7 +637,7 @@ static void ttm_pool_debugfs_header(struct seq_file *m)
unsigned int i;
seq_puts(m, "\t ");
for (i = 0; i < MAX_ORDER; ++i)
for (i = 0; i <= MAX_ORDER; ++i)
seq_printf(m, " ---%2u---", i);
seq_puts(m, "\n");
}
@ -648,7 +648,7 @@ static void ttm_pool_debugfs_orders(struct ttm_pool_type *pt,
{
unsigned int i;
for (i = 0; i < MAX_ORDER; ++i)
for (i = 0; i <= MAX_ORDER; ++i)
seq_printf(m, " %8u", ttm_pool_type_count(&pt[i]));
seq_puts(m, "\n");
}
@ -757,7 +757,7 @@ int ttm_pool_mgr_init(unsigned long num_pages)
spin_lock_init(&shrinker_lock);
INIT_LIST_HEAD(&shrinker_list);
for (i = 0; i < MAX_ORDER; ++i) {
for (i = 0; i <= MAX_ORDER; ++i) {
ttm_pool_type_init(&global_write_combined[i], NULL,
ttm_write_combined, i);
ttm_pool_type_init(&global_uncached[i], NULL, ttm_uncached, i);
@ -790,7 +790,7 @@ void ttm_pool_mgr_fini(void)
{
unsigned int i;
for (i = 0; i < MAX_ORDER; ++i) {
for (i = 0; i <= MAX_ORDER; ++i) {
ttm_pool_type_fini(&global_write_combined[i]);
ttm_pool_type_fini(&global_uncached[i]);

View File

@ -182,7 +182,7 @@
#ifdef CONFIG_CMA_ALIGNMENT
#define Q_MAX_SZ_SHIFT (PAGE_SHIFT + CONFIG_CMA_ALIGNMENT)
#else
#define Q_MAX_SZ_SHIFT (PAGE_SHIFT + MAX_ORDER - 1)
#define Q_MAX_SZ_SHIFT (PAGE_SHIFT + MAX_ORDER)
#endif
/*

View File

@ -736,7 +736,7 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
struct page **pages;
unsigned int i = 0, nid = dev_to_node(dev);
order_mask &= (2U << MAX_ORDER) - 1;
order_mask &= GENMASK(MAX_ORDER, 0);
if (!order_mask)
return NULL;
@ -756,7 +756,7 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
* than a necessity, hence using __GFP_NORETRY until
* falling back to minimum-order allocations.
*/
for (order_mask &= (2U << __fls(count)) - 1;
for (order_mask &= GENMASK(__fls(count), 0);
order_mask; order_mask &= ~order_size) {
unsigned int order = __fls(order_mask);
gfp_t alloc_flags = gfp;

View File

@ -2445,8 +2445,8 @@ static bool its_parse_indirect_baser(struct its_node *its,
* feature is not supported by hardware.
*/
new_order = max_t(u32, get_order(esz << ids), new_order);
if (new_order >= MAX_ORDER) {
new_order = MAX_ORDER - 1;
if (new_order > MAX_ORDER) {
new_order = MAX_ORDER;
ids = ilog2(PAGE_ORDER_TO_SIZE(new_order) / (int)esz);
pr_warn("ITS@%pa: %s Table too large, reduce ids %llu->%u\n",
&its->phys_base, its_base_type_string[type],

View File

@ -1134,7 +1134,7 @@ static void __cache_size_refresh(void)
* If the allocation may fail we use __get_free_pages. Memory fragmentation
* won't have a fatal effect here, but it just causes flushes of some other
* buffers and more I/O will be performed. Don't use __get_free_pages if it
* always fails (i.e. order >= MAX_ORDER).
* always fails (i.e. order > MAX_ORDER).
*
* If the allocation shouldn't fail we use __vmalloc. This is only for the
* initial reserve allocation, so there's no risk of wasting all vmalloc

View File

@ -1828,7 +1828,7 @@ int dm_cache_metadata_abort(struct dm_cache_metadata *cmd)
* Replacement block manager (new_bm) is created and old_bm destroyed outside of
* cmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
* shrinker associated with the block manager's bufio client vs cmd root_lock).
* - must take shrinker_rwsem without holding cmd->root_lock
* - must take shrinker_mutex without holding cmd->root_lock
*/
new_bm = dm_block_manager_create(cmd->bdev, DM_CACHE_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
CACHE_MAX_CONCURRENT_LOCKS);

View File

@ -1887,7 +1887,7 @@ int dm_pool_abort_metadata(struct dm_pool_metadata *pmd)
* Replacement block manager (new_bm) is created and old_bm destroyed outside of
* pmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
* shrinker associated with the block manager's bufio client vs pmd root_lock).
* - must take shrinker_rwsem without holding pmd->root_lock
* - must take shrinker_mutex without holding pmd->root_lock
*/
new_bm = dm_block_manager_create(pmd->bdev, THIN_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
THIN_MAX_CONCURRENT_LOCKS);

View File

@ -210,7 +210,7 @@ u32 genwqe_crc32(u8 *buff, size_t len, u32 init)
void *__genwqe_alloc_consistent(struct genwqe_dev *cd, size_t size,
dma_addr_t *dma_handle)
{
if (get_order(size) >= MAX_ORDER)
if (get_order(size) > MAX_ORDER)
return NULL;
return dma_alloc_coherent(&cd->pci_dev->dev, size, dma_handle,

View File

@ -1040,7 +1040,7 @@ static void hns3_init_tx_spare_buffer(struct hns3_enet_ring *ring)
return;
order = get_order(alloc_size);
if (order >= MAX_ORDER) {
if (order > MAX_ORDER) {
if (net_ratelimit())
dev_warn(ring_to_dev(ring), "failed to allocate tx spare buffer, exceed to max order\n");
return;

View File

@ -75,7 +75,7 @@
* pool for the 4MB. Thus the 16 Rx and Tx queues require 32 * 5 = 160
* plus 16 for the TSO pools for a total of 176 LTB mappings per VNIC.
*/
#define IBMVNIC_ONE_LTB_MAX ((u32)((1 << (MAX_ORDER - 1)) * PAGE_SIZE))
#define IBMVNIC_ONE_LTB_MAX ((u32)((1 << MAX_ORDER) * PAGE_SIZE))
#define IBMVNIC_ONE_LTB_SIZE min((u32)(8 << 20), IBMVNIC_ONE_LTB_MAX)
#define IBMVNIC_LTB_SET_SIZE (38 << 20)

View File

@ -946,7 +946,7 @@ static phys_addr_t hvfb_get_phymem(struct hv_device *hdev,
if (request_size == 0)
return -1;
if (order < MAX_ORDER) {
if (order <= MAX_ORDER) {
/* Call alloc_pages if the size is less than 2^MAX_ORDER */
page = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
if (!page)
@ -977,7 +977,7 @@ static void hvfb_release_phymem(struct hv_device *hdev,
{
unsigned int order = get_order(size);
if (order < MAX_ORDER)
if (order <= MAX_ORDER)
__free_pages(pfn_to_page(paddr >> PAGE_SHIFT), order);
else
dma_free_coherent(&hdev->device,

View File

@ -197,7 +197,7 @@ static int vmlfb_alloc_vram(struct vml_info *vinfo,
va = &vinfo->vram[i];
order = 0;
while (requested > (PAGE_SIZE << order) && order < MAX_ORDER)
while (requested > (PAGE_SIZE << order) && order <= MAX_ORDER)
order++;
err = vmlfb_alloc_vram_area(va, order, 0);

View File

@ -33,7 +33,7 @@
#define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \
__GFP_NOMEMALLOC)
/* The order of free page blocks to report to host */
#define VIRTIO_BALLOON_HINT_BLOCK_ORDER (MAX_ORDER - 1)
#define VIRTIO_BALLOON_HINT_BLOCK_ORDER MAX_ORDER
/* The size of a free page block in bytes */
#define VIRTIO_BALLOON_HINT_BLOCK_BYTES \
(1 << (VIRTIO_BALLOON_HINT_BLOCK_ORDER + PAGE_SHIFT))

View File

@ -1120,13 +1120,13 @@ static void virtio_mem_clear_fake_offline(unsigned long pfn,
*/
static void virtio_mem_fake_online(unsigned long pfn, unsigned long nr_pages)
{
unsigned long order = MAX_ORDER - 1;
unsigned long order = MAX_ORDER;
unsigned long i;
/*
* We might get called for ranges that don't cover properly aligned
* MAX_ORDER - 1 pages; however, we can only online properly aligned
* pages with an order of MAX_ORDER - 1 at maximum.
* MAX_ORDER pages; however, we can only online properly aligned
* pages with an order of MAX_ORDER at maximum.
*/
while (!IS_ALIGNED(pfn | nr_pages, 1 << order))
order--;
@ -1237,9 +1237,9 @@ static void virtio_mem_online_page(struct virtio_mem *vm,
bool do_online;
/*
* We can get called with any order up to MAX_ORDER - 1. If our
* subblock size is smaller than that and we have a mixture of plugged
* and unplugged subblocks within such a page, we have to process in
* We can get called with any order up to MAX_ORDER. If our subblock
* size is smaller than that and we have a mixture of plugged and
* unplugged subblocks within such a page, we have to process in
* smaller granularity. In that case we'll adjust the order exactly once
* within the loop.
*/

Some files were not shown because too many files have changed in this diff Show More