- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page(). - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96 eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY= =J+Dj -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ...
This commit is contained in:
commit
7fa8a8ee94
|
@ -51,3 +51,11 @@ Description: Control merging pages across different NUMA nodes.
|
|||
|
||||
When it is set to 0 only pages from the same node are merged,
|
||||
otherwise pages from all nodes can be merged together (default).
|
||||
|
||||
What: /sys/kernel/mm/ksm/general_profit
|
||||
Date: April 2023
|
||||
KernelVersion: 6.4
|
||||
Contact: Linux memory management mailing list <linux-mm@kvack.org>
|
||||
Description: Measure how effective KSM is.
|
||||
general_profit: how effective is KSM. The formula for the
|
||||
calculation is in Documentation/admin-guide/mm/ksm.rst.
|
||||
|
|
|
@ -172,7 +172,7 @@ variables.
|
|||
Offset of the free_list's member. This value is used to compute the number
|
||||
of free pages.
|
||||
|
||||
Each zone has a free_area structure array called free_area[MAX_ORDER].
|
||||
Each zone has a free_area structure array called free_area[MAX_ORDER + 1].
|
||||
The free_list represents a linked list of free page blocks.
|
||||
|
||||
(list_head, next|prev)
|
||||
|
@ -189,8 +189,8 @@ Offsets of the vmap_area's members. They carry vmalloc-specific
|
|||
information. Makedumpfile gets the start address of the vmalloc region
|
||||
from this.
|
||||
|
||||
(zone.free_area, MAX_ORDER)
|
||||
---------------------------
|
||||
(zone.free_area, MAX_ORDER + 1)
|
||||
-------------------------------
|
||||
|
||||
Free areas descriptor. User-space tools use this value to iterate the
|
||||
free_area ranges. MAX_ORDER is used by the zone buddy allocator.
|
||||
|
|
|
@ -4012,7 +4012,7 @@
|
|||
[KNL] Minimal page reporting order
|
||||
Format: <integer>
|
||||
Adjust the minimal page reporting order. The page
|
||||
reporting is disabled when it exceeds (MAX_ORDER-1).
|
||||
reporting is disabled when it exceeds MAX_ORDER.
|
||||
|
||||
panic= [KNL] Kernel behaviour on panic: delay <timeout>
|
||||
timeout > 0: seconds before rebooting
|
||||
|
|
|
@ -157,6 +157,8 @@ stable_node_chains_prune_millisecs
|
|||
|
||||
The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
|
||||
|
||||
general_profit
|
||||
how effective is KSM. The calculation is explained below.
|
||||
pages_shared
|
||||
how many shared pages are being used
|
||||
pages_sharing
|
||||
|
@ -207,7 +209,8 @@ several times, which are unprofitable memory consumed.
|
|||
ksm_rmap_items * sizeof(rmap_item).
|
||||
|
||||
where ksm_merging_pages is shown under the directory ``/proc/<pid>/``,
|
||||
and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``.
|
||||
and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``. The process profit
|
||||
is also shown in ``/proc/<pid>/ksm_stat`` as ksm_process_profit.
|
||||
|
||||
From the perspective of application, a high ratio of ``ksm_rmap_items`` to
|
||||
``ksm_merging_pages`` means a bad madvise-applied policy, so developers or
|
||||
|
|
|
@ -219,6 +219,31 @@ former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter
|
|||
you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was
|
||||
used.
|
||||
|
||||
Userfaultfd write-protect mode currently behave differently on none ptes
|
||||
(when e.g. page is missing) over different types of memories.
|
||||
|
||||
For anonymous memory, ``ioctl(UFFDIO_WRITEPROTECT)`` will ignore none ptes
|
||||
(e.g. when pages are missing and not populated). For file-backed memories
|
||||
like shmem and hugetlbfs, none ptes will be write protected just like a
|
||||
present pte. In other words, there will be a userfaultfd write fault
|
||||
message generated when writing to a missing page on file typed memories,
|
||||
as long as the page range was write-protected before. Such a message will
|
||||
not be generated on anonymous memories by default.
|
||||
|
||||
If the application wants to be able to write protect none ptes on anonymous
|
||||
memory, one can pre-populate the memory with e.g. MADV_POPULATE_READ. On
|
||||
newer kernels, one can also detect the feature UFFD_FEATURE_WP_UNPOPULATED
|
||||
and set the feature bit in advance to make sure none ptes will also be
|
||||
write protected even upon anonymous memory.
|
||||
|
||||
When using ``UFFDIO_REGISTER_MODE_WP`` in combination with either
|
||||
``UFFDIO_REGISTER_MODE_MISSING`` or ``UFFDIO_REGISTER_MODE_MINOR``, when
|
||||
resolving missing / minor faults with ``UFFDIO_COPY`` or ``UFFDIO_CONTINUE``
|
||||
respectively, it may be desirable for the new page / mapping to be
|
||||
write-protected (so future writes will also result in a WP fault). These ioctls
|
||||
support a mode flag (``UFFDIO_COPY_MODE_WP`` or ``UFFDIO_CONTINUE_MODE_WP``
|
||||
respectively) to configure the mapping this way.
|
||||
|
||||
QEMU/KVM
|
||||
========
|
||||
|
||||
|
|
|
@ -575,20 +575,26 @@ The field width is passed by value, the bitmap is passed by reference.
|
|||
Helper macros cpumask_pr_args() and nodemask_pr_args() are available to ease
|
||||
printing cpumask and nodemask.
|
||||
|
||||
Flags bitfields such as page flags, gfp_flags
|
||||
---------------------------------------------
|
||||
Flags bitfields such as page flags, page_type, gfp_flags
|
||||
--------------------------------------------------------
|
||||
|
||||
::
|
||||
|
||||
%pGp 0x17ffffc0002036(referenced|uptodate|lru|active|private|node=0|zone=2|lastcpupid=0x1fffff)
|
||||
%pGt 0xffffff7f(buddy)
|
||||
%pGg GFP_USER|GFP_DMA32|GFP_NOWARN
|
||||
%pGv read|exec|mayread|maywrite|mayexec|denywrite
|
||||
|
||||
For printing flags bitfields as a collection of symbolic constants that
|
||||
would construct the value. The type of flags is given by the third
|
||||
character. Currently supported are [p]age flags, [v]ma_flags (both
|
||||
expect ``unsigned long *``) and [g]fp_flags (expects ``gfp_t *``). The flag
|
||||
names and print order depends on the particular type.
|
||||
character. Currently supported are:
|
||||
|
||||
- p - [p]age flags, expects value of type (``unsigned long *``)
|
||||
- t - page [t]ype, expects value of type (``unsigned int *``)
|
||||
- v - [v]ma_flags, expects value of type (``unsigned long *``)
|
||||
- g - [g]fp_flags, expects value of type (``gfp_t *``)
|
||||
|
||||
The flag names and print order depends on the particular type.
|
||||
|
||||
Note that this format should not be used directly in the
|
||||
:c:func:`TP_printk()` part of a tracepoint. Instead, use the show_*_flags()
|
||||
|
|
|
@ -645,7 +645,7 @@ ops mmap_lock PageLocked(page)
|
|||
open: yes
|
||||
close: yes
|
||||
fault: yes can return with page locked
|
||||
map_pages: yes
|
||||
map_pages: read
|
||||
page_mkwrite: yes can return with page locked
|
||||
pfn_mkwrite: yes
|
||||
access: yes
|
||||
|
@ -661,7 +661,7 @@ locked. The VM will unlock the page.
|
|||
|
||||
->map_pages() is called when VM asks to map easy accessible pages.
|
||||
Filesystem should find and map pages associated with offsets from "start_pgoff"
|
||||
till "end_pgoff". ->map_pages() is called with page table locked and must
|
||||
till "end_pgoff". ->map_pages() is called with the RCU lock held and must
|
||||
not block. If it's not possible to reach a page without blocking,
|
||||
filesystem should skip it. Filesystem should use do_set_pte() to setup
|
||||
page table entry. Pointer to entry associated with the page is passed in
|
||||
|
|
|
@ -996,6 +996,7 @@ Example output. You may not have all of these fields.
|
|||
VmallocUsed: 40444 kB
|
||||
VmallocChunk: 0 kB
|
||||
Percpu: 29312 kB
|
||||
EarlyMemtestBad: 0 kB
|
||||
HardwareCorrupted: 0 kB
|
||||
AnonHugePages: 4149248 kB
|
||||
ShmemHugePages: 0 kB
|
||||
|
@ -1146,6 +1147,13 @@ VmallocChunk
|
|||
Percpu
|
||||
Memory allocated to the percpu allocator used to back percpu
|
||||
allocations. This stat excludes the cost of metadata.
|
||||
EarlyMemtestBad
|
||||
The amount of RAM/memory in kB, that was identified as corrupted
|
||||
by early memtest. If memtest was not run, this field will not
|
||||
be displayed at all. Size is never rounded down to 0 kB.
|
||||
That means if 0 kB is reported, you can safely assume
|
||||
there was at least one pass of memtest and none of the passes
|
||||
found a single faulty byte of RAM.
|
||||
HardwareCorrupted
|
||||
The amount of RAM/memory in KB, the kernel identifies as
|
||||
corrupted.
|
||||
|
|
|
@ -13,17 +13,29 @@ everything stored therein is lost.
|
|||
|
||||
tmpfs puts everything into the kernel internal caches and grows and
|
||||
shrinks to accommodate the files it contains and is able to swap
|
||||
unneeded pages out to swap space. It has maximum size limits which can
|
||||
be adjusted on the fly via 'mount -o remount ...'
|
||||
unneeded pages out to swap space, if swap was enabled for the tmpfs
|
||||
mount. tmpfs also supports THP.
|
||||
|
||||
If you compare it to ramfs (which was the template to create tmpfs)
|
||||
you gain swapping and limit checking. Another similar thing is the RAM
|
||||
disk (/dev/ram*), which simulates a fixed size hard disk in physical
|
||||
RAM, where you have to create an ordinary filesystem on top. Ramdisks
|
||||
cannot swap and you do not have the possibility to resize them.
|
||||
tmpfs extends ramfs with a few userspace configurable options listed and
|
||||
explained further below, some of which can be reconfigured dynamically on the
|
||||
fly using a remount ('mount -o remount ...') of the filesystem. A tmpfs
|
||||
filesystem can be resized but it cannot be resized to a size below its current
|
||||
usage. tmpfs also supports POSIX ACLs, and extended attributes for the
|
||||
trusted.* and security.* namespaces. ramfs does not use swap and you cannot
|
||||
modify any parameter for a ramfs filesystem. The size limit of a ramfs
|
||||
filesystem is how much memory you have available, and so care must be taken if
|
||||
used so to not run out of memory.
|
||||
|
||||
Since tmpfs lives completely in the page cache and on swap, all tmpfs
|
||||
pages will be shown as "Shmem" in /proc/meminfo and "Shared" in
|
||||
An alternative to tmpfs and ramfs is to use brd to create RAM disks
|
||||
(/dev/ram*), which allows you to simulate a block device disk in physical RAM.
|
||||
To write data you would just then need to create an regular filesystem on top
|
||||
this ramdisk. As with ramfs, brd ramdisks cannot swap. brd ramdisks are also
|
||||
configured in size at initialization and you cannot dynamically resize them.
|
||||
Contrary to brd ramdisks, tmpfs has its own filesystem, it does not rely on the
|
||||
block layer at all.
|
||||
|
||||
Since tmpfs lives completely in the page cache and optionally on swap,
|
||||
all tmpfs pages will be shown as "Shmem" in /proc/meminfo and "Shared" in
|
||||
free(1). Notice that these counters also include shared memory
|
||||
(shmem, see ipcs(1)). The most reliable way to get the count is
|
||||
using df(1) and du(1).
|
||||
|
@ -72,6 +84,8 @@ nr_inodes The maximum number of inodes for this instance. The default
|
|||
is half of the number of your physical RAM pages, or (on a
|
||||
machine with highmem) the number of lowmem RAM pages,
|
||||
whichever is the lower.
|
||||
noswap Disables swap. Remounts must respect the original settings.
|
||||
By default swap is enabled.
|
||||
========= ============================================================
|
||||
|
||||
These parameters accept a suffix k, m or g for kilo, mega and giga and
|
||||
|
@ -85,6 +99,36 @@ mount with such options, since it allows any user with write access to
|
|||
use up all the memory on the machine; but enhances the scalability of
|
||||
that instance in a system with many CPUs making intensive use of it.
|
||||
|
||||
tmpfs also supports Transparent Huge Pages which requires a kernel
|
||||
configured with CONFIG_TRANSPARENT_HUGEPAGE and with huge supported for
|
||||
your system (has_transparent_hugepage(), which is architecture specific).
|
||||
The mount options for this are:
|
||||
|
||||
====== ============================================================
|
||||
huge=0 never: disables huge pages for the mount
|
||||
huge=1 always: enables huge pages for the mount
|
||||
huge=2 within_size: only allocate huge pages if the page will be
|
||||
fully within i_size, also respect fadvise()/madvise() hints.
|
||||
huge=3 advise: only allocate huge pages if requested with
|
||||
fadvise()/madvise()
|
||||
====== ============================================================
|
||||
|
||||
There is a sysfs file which you can also use to control system wide THP
|
||||
configuration for all tmpfs mounts, the file is:
|
||||
|
||||
/sys/kernel/mm/transparent_hugepage/shmem_enabled
|
||||
|
||||
This sysfs file is placed on top of THP sysfs directory and so is registered
|
||||
by THP code. It is however only used to control all tmpfs mounts with one
|
||||
single knob. Since it controls all tmpfs mounts it should only be used either
|
||||
for emergency or testing purposes. The values you can set for shmem_enabled are:
|
||||
|
||||
== ============================================================
|
||||
-1 deny: disables huge on shm_mnt and all mounts, for
|
||||
emergency use
|
||||
-2 force: enables huge on shm_mnt and all mounts, w/o needing
|
||||
option, for testing
|
||||
== ============================================================
|
||||
|
||||
tmpfs has a mount option to set the NUMA memory allocation policy for
|
||||
all files in that instance (if CONFIG_NUMA is enabled) - which can be
|
||||
|
|
|
@ -2,6 +2,12 @@
|
|||
Active MM
|
||||
=========
|
||||
|
||||
Note, the mm_count refcount may no longer include the "lazy" users
|
||||
(running tasks with ->active_mm == mm && ->mm == NULL) on kernels
|
||||
with CONFIG_MMU_LAZY_TLB_REFCOUNT=n. Taking and releasing these lazy
|
||||
references must be done with mmgrab_lazy_tlb() and mmdrop_lazy_tlb()
|
||||
helpers, which abstract this config option.
|
||||
|
||||
::
|
||||
|
||||
List: linux-kernel
|
||||
|
|
|
@ -214,7 +214,7 @@ HugeTLB Page Table Helpers
|
|||
+---------------------------+--------------------------------------------------+
|
||||
| pte_huge | Tests a HugeTLB |
|
||||
+---------------------------+--------------------------------------------------+
|
||||
| pte_mkhuge | Creates a HugeTLB |
|
||||
| arch_make_huge_pte | Creates a HugeTLB |
|
||||
+---------------------------+--------------------------------------------------+
|
||||
| huge_pte_dirty | Tests a dirty HugeTLB |
|
||||
+---------------------------+--------------------------------------------------+
|
||||
|
|
|
@ -103,7 +103,8 @@ moving across tiers only involves atomic operations on
|
|||
``folio->flags`` and therefore has a negligible cost. A feedback loop
|
||||
modeled after the PID controller monitors refaults over all the tiers
|
||||
from anon and file types and decides which tiers from which types to
|
||||
evict or protect.
|
||||
evict or protect. The desired effect is to balance refault percentages
|
||||
between anon and file types proportional to the swappiness level.
|
||||
|
||||
There are two conceptually independent procedures: the aging and the
|
||||
eviction. They form a closed-loop system, i.e., the page reclaim.
|
||||
|
@ -156,6 +157,27 @@ This time-based approach has the following advantages:
|
|||
and memory sizes.
|
||||
2. It is more reliable because it is directly wired to the OOM killer.
|
||||
|
||||
``mm_struct`` list
|
||||
------------------
|
||||
An ``mm_struct`` list is maintained for each memcg, and an
|
||||
``mm_struct`` follows its owner task to the new memcg when this task
|
||||
is migrated.
|
||||
|
||||
A page table walker iterates ``lruvec_memcg()->mm_list`` and calls
|
||||
``walk_page_range()`` with each ``mm_struct`` on this list to scan
|
||||
PTEs. When multiple page table walkers iterate the same list, each of
|
||||
them gets a unique ``mm_struct``, and therefore they can run in
|
||||
parallel.
|
||||
|
||||
Page table walkers ignore any misplaced pages, e.g., if an
|
||||
``mm_struct`` was migrated, pages left in the previous memcg will be
|
||||
ignored when the current memcg is under reclaim. Similarly, page table
|
||||
walkers will ignore pages from nodes other than the one under reclaim.
|
||||
|
||||
This infrastructure also tracks the usage of ``mm_struct`` between
|
||||
context switches so that page table walkers can skip processes that
|
||||
have been sleeping since the last iteration.
|
||||
|
||||
Rmap/PT walk feedback
|
||||
---------------------
|
||||
Searching the rmap for PTEs mapping each page on an LRU list (to test
|
||||
|
@ -170,7 +192,7 @@ promotes hot pages. If the scan was done cacheline efficiently, it
|
|||
adds the PMD entry pointing to the PTE table to the Bloom filter. This
|
||||
forms a feedback loop between the eviction and the aging.
|
||||
|
||||
Bloom Filters
|
||||
Bloom filters
|
||||
-------------
|
||||
Bloom filters are a space and memory efficient data structure for set
|
||||
membership test, i.e., test if an element is not in the set or may be
|
||||
|
@ -186,6 +208,18 @@ is false positive, the cost is an additional scan of a range of PTEs,
|
|||
which may yield hot pages anyway. Parameters of the filter itself can
|
||||
control the false positive rate in the limit.
|
||||
|
||||
PID controller
|
||||
--------------
|
||||
A feedback loop modeled after the Proportional-Integral-Derivative
|
||||
(PID) controller monitors refaults over anon and file types and
|
||||
decides which type to evict when both types are available from the
|
||||
same generation.
|
||||
|
||||
The PID controller uses generations rather than the wall clock as the
|
||||
time domain because a CPU can scan pages at different rates under
|
||||
varying memory pressure. It calculates a moving average for each new
|
||||
generation to avoid being permanently locked in a suboptimal state.
|
||||
|
||||
Memcg LRU
|
||||
---------
|
||||
An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
|
||||
|
@ -223,9 +257,9 @@ parts:
|
|||
|
||||
* Generations
|
||||
* Rmap walks
|
||||
* Page table walks
|
||||
* Bloom filters
|
||||
* PID controller
|
||||
* Page table walks via ``mm_struct`` list
|
||||
* Bloom filters for rmap/PT walk feedback
|
||||
* PID controller for refault feedback
|
||||
|
||||
The aging and the eviction form a producer-consumer model;
|
||||
specifically, the latter drives the former by the sliding window over
|
||||
|
|
|
@ -42,6 +42,8 @@ The unevictable list addresses the following classes of unevictable pages:
|
|||
|
||||
* Those owned by ramfs.
|
||||
|
||||
* Those owned by tmpfs with the noswap mount option.
|
||||
|
||||
* Those mapped into SHM_LOCK'd shared memory regions.
|
||||
|
||||
* Those mapped into VM_LOCKED [mlock()ed] VMAs.
|
||||
|
|
|
@ -13457,13 +13457,14 @@ F: arch/powerpc/include/asm/membarrier.h
|
|||
F: include/uapi/linux/membarrier.h
|
||||
F: kernel/sched/membarrier.c
|
||||
|
||||
MEMBLOCK
|
||||
MEMBLOCK AND MEMORY MANAGEMENT INITIALIZATION
|
||||
M: Mike Rapoport <rppt@kernel.org>
|
||||
L: linux-mm@kvack.org
|
||||
S: Maintained
|
||||
F: Documentation/core-api/boot-time-mm.rst
|
||||
F: include/linux/memblock.h
|
||||
F: mm/memblock.c
|
||||
F: mm/mm_init.c
|
||||
F: tools/testing/memblock/
|
||||
|
||||
MEMORY CONTROLLER DRIVERS
|
||||
|
@ -13498,6 +13499,7 @@ F: include/linux/memory_hotplug.h
|
|||
F: include/linux/mm.h
|
||||
F: include/linux/mmzone.h
|
||||
F: include/linux/pagewalk.h
|
||||
F: include/trace/events/ksm.h
|
||||
F: mm/
|
||||
F: tools/mm/
|
||||
F: tools/testing/selftests/mm/
|
||||
|
@ -13506,6 +13508,7 @@ VMALLOC
|
|||
M: Andrew Morton <akpm@linux-foundation.org>
|
||||
R: Uladzislau Rezki <urezki@gmail.com>
|
||||
R: Christoph Hellwig <hch@infradead.org>
|
||||
R: Lorenzo Stoakes <lstoakes@gmail.com>
|
||||
L: linux-mm@kvack.org
|
||||
S: Maintained
|
||||
W: http://www.linux-mm.org
|
||||
|
|
32
arch/Kconfig
32
arch/Kconfig
|
@ -465,6 +465,38 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
|
|||
irqs disabled over activate_mm. Architectures that do IPI based TLB
|
||||
shootdowns should enable this.
|
||||
|
||||
# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
|
||||
# MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching
|
||||
# to/from kernel threads when the same mm is running on a lot of CPUs (a large
|
||||
# multi-threaded application), by reducing contention on the mm refcount.
|
||||
#
|
||||
# This can be disabled if the architecture ensures no CPUs are using an mm as a
|
||||
# "lazy tlb" beyond its final refcount (i.e., by the time __mmdrop frees the mm
|
||||
# or its kernel page tables). This could be arranged by arch_exit_mmap(), or
|
||||
# final exit(2) TLB flush, for example.
|
||||
#
|
||||
# To implement this, an arch *must*:
|
||||
# Ensure the _lazy_tlb variants of mmgrab/mmdrop are used when manipulating
|
||||
# the lazy tlb reference of a kthread's ->active_mm (non-arch code has been
|
||||
# converted already).
|
||||
config MMU_LAZY_TLB_REFCOUNT
|
||||
def_bool y
|
||||
depends on !MMU_LAZY_TLB_SHOOTDOWN
|
||||
|
||||
# This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
|
||||
# mm as a lazy tlb beyond its last reference count, by shooting down these
|
||||
# users before the mm is deallocated. __mmdrop() first IPIs all CPUs that may
|
||||
# be using the mm as a lazy tlb, so that they may switch themselves to using
|
||||
# init_mm for their active mm. mm_cpumask(mm) is used to determine which CPUs
|
||||
# may be using mm as a lazy tlb mm.
|
||||
#
|
||||
# To implement this, an arch *must*:
|
||||
# - At the time of the final mmdrop of the mm, ensure mm_cpumask(mm) contains
|
||||
# at least all possible CPUs in which the mm is lazy.
|
||||
# - It must meet the requirements for MMU_LAZY_TLB_REFCOUNT=n (see above).
|
||||
config MMU_LAZY_TLB_SHOOTDOWN
|
||||
bool
|
||||
|
||||
config ARCH_HAVE_NMI_SAFE_CMPXCHG
|
||||
bool
|
||||
|
||||
|
|
|
@ -556,7 +556,7 @@ endmenu # "ARC Architecture Configuration"
|
|||
|
||||
config ARCH_FORCE_MAX_ORDER
|
||||
int "Maximum zone order"
|
||||
default "12" if ARC_HUGEPAGE_16M
|
||||
default "11"
|
||||
default "11" if ARC_HUGEPAGE_16M
|
||||
default "10"
|
||||
|
||||
source "kernel/power/Kconfig"
|
||||
|
|
|
@ -74,11 +74,6 @@ void __init early_init_dt_add_memory_arch(u64 base, u64 size)
|
|||
base, TO_MB(size), !in_use ? "Not used":"");
|
||||
}
|
||||
|
||||
bool arch_has_descending_max_zone_pfns(void)
|
||||
{
|
||||
return !IS_ENABLED(CONFIG_ARC_HAS_PAE40);
|
||||
}
|
||||
|
||||
/*
|
||||
* First memory setup routine called from setup_arch()
|
||||
* 1. setup swapper's mm @init_mm
|
||||
|
|
|
@ -1352,20 +1352,19 @@ config ARM_MODULE_PLTS
|
|||
configurations. If unsure, say y.
|
||||
|
||||
config ARCH_FORCE_MAX_ORDER
|
||||
int "Maximum zone order"
|
||||
default "12" if SOC_AM33XX
|
||||
default "9" if SA1111
|
||||
default "11"
|
||||
int "Order of maximal physically contiguous allocations"
|
||||
default "11" if SOC_AM33XX
|
||||
default "8" if SA1111
|
||||
default "10"
|
||||
help
|
||||
The kernel memory allocator divides physically contiguous memory
|
||||
blocks into "zones", where each zone is a power of two number of
|
||||
pages. This option selects the largest power of two that the kernel
|
||||
keeps in the memory allocator. If you need to allocate very large
|
||||
blocks of physically contiguous memory, then you may need to
|
||||
increase this value.
|
||||
The kernel page allocator limits the size of maximal physically
|
||||
contiguous allocations. The limit is called MAX_ORDER and it
|
||||
defines the maximal power of two of number of pages that can be
|
||||
allocated as a single contiguous block. This option allows
|
||||
overriding the default setting when ability to allocate very
|
||||
large blocks of physically contiguous memory is required.
|
||||
|
||||
This config option is actually maximum order plus one. For example,
|
||||
a value of 11 means that the largest free memory block is 2^10 pages.
|
||||
Don't change if unsure.
|
||||
|
||||
config ALIGNMENT_TRAP
|
||||
def_bool CPU_CP15_MMU
|
||||
|
|
|
@ -31,7 +31,7 @@ CONFIG_SOC_VF610=y
|
|||
CONFIG_SMP=y
|
||||
CONFIG_ARM_PSCI=y
|
||||
CONFIG_HIGHMEM=y
|
||||
CONFIG_ARCH_FORCE_MAX_ORDER=14
|
||||
CONFIG_ARCH_FORCE_MAX_ORDER=13
|
||||
CONFIG_CMDLINE="noinitrd console=ttymxc0,115200"
|
||||
CONFIG_KEXEC=y
|
||||
CONFIG_CPU_FREQ=y
|
||||
|
|
|
@ -26,7 +26,7 @@ CONFIG_THUMB2_KERNEL=y
|
|||
# CONFIG_THUMB2_AVOID_R_ARM_THM_JUMP11 is not set
|
||||
# CONFIG_ARM_PATCH_IDIV is not set
|
||||
CONFIG_HIGHMEM=y
|
||||
CONFIG_ARCH_FORCE_MAX_ORDER=12
|
||||
CONFIG_ARCH_FORCE_MAX_ORDER=11
|
||||
CONFIG_SECCOMP=y
|
||||
CONFIG_KEXEC=y
|
||||
CONFIG_EFI=y
|
||||
|
|
|
@ -20,7 +20,7 @@ CONFIG_PXA_SHARPSL=y
|
|||
CONFIG_MACH_AKITA=y
|
||||
CONFIG_MACH_BORZOI=y
|
||||
CONFIG_AEABI=y
|
||||
CONFIG_ARCH_FORCE_MAX_ORDER=9
|
||||
CONFIG_ARCH_FORCE_MAX_ORDER=8
|
||||
CONFIG_CMDLINE="root=/dev/ram0 ro"
|
||||
CONFIG_KEXEC=y
|
||||
CONFIG_CPU_FREQ=y
|
||||
|
|
|
@ -19,7 +19,7 @@ CONFIG_ATMEL_CLOCKSOURCE_TCB=y
|
|||
# CONFIG_CACHE_L2X0 is not set
|
||||
# CONFIG_ARM_PATCH_IDIV is not set
|
||||
# CONFIG_CPU_SW_DOMAIN_PAN is not set
|
||||
CONFIG_ARCH_FORCE_MAX_ORDER=15
|
||||
CONFIG_ARCH_FORCE_MAX_ORDER=14
|
||||
CONFIG_UACCESS_WITH_MEMCPY=y
|
||||
# CONFIG_ATAGS is not set
|
||||
CONFIG_CMDLINE="console=ttyS0,115200 earlyprintk ignore_loglevel"
|
||||
|
|
|
@ -17,7 +17,7 @@ CONFIG_ARCH_SUNPLUS=y
|
|||
# CONFIG_VDSO is not set
|
||||
CONFIG_SMP=y
|
||||
CONFIG_THUMB2_KERNEL=y
|
||||
CONFIG_ARCH_FORCE_MAX_ORDER=12
|
||||
CONFIG_ARCH_FORCE_MAX_ORDER=11
|
||||
CONFIG_VFP=y
|
||||
CONFIG_NEON=y
|
||||
CONFIG_MODULES=y
|
||||
|
|
|
@ -253,7 +253,7 @@ static int ecard_init_mm(void)
|
|||
current->mm = mm;
|
||||
current->active_mm = mm;
|
||||
activate_mm(active_mm, mm);
|
||||
mmdrop(active_mm);
|
||||
mmdrop_lazy_tlb(active_mm);
|
||||
ecard_init_pgtables(mm);
|
||||
return 0;
|
||||
}
|
||||
|
|
|
@ -95,6 +95,7 @@ config ARM64
|
|||
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
|
||||
select ARCH_SUPPORTS_NUMA_BALANCING
|
||||
select ARCH_SUPPORTS_PAGE_TABLE_CHECK
|
||||
select ARCH_SUPPORTS_PER_VMA_LOCK
|
||||
select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
|
||||
select ARCH_WANT_DEFAULT_BPF_JIT
|
||||
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
|
||||
|
@ -1505,39 +1506,34 @@ config XEN
|
|||
|
||||
# include/linux/mmzone.h requires the following to be true:
|
||||
#
|
||||
# MAX_ORDER - 1 + PAGE_SHIFT <= SECTION_SIZE_BITS
|
||||
# MAX_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS
|
||||
#
|
||||
# so the maximum value of MAX_ORDER is SECTION_SIZE_BITS + 1 - PAGE_SHIFT:
|
||||
# so the maximum value of MAX_ORDER is SECTION_SIZE_BITS - PAGE_SHIFT:
|
||||
#
|
||||
# | SECTION_SIZE_BITS | PAGE_SHIFT | max MAX_ORDER | default MAX_ORDER |
|
||||
# ----+-------------------+--------------+-----------------+--------------------+
|
||||
# 4K | 27 | 12 | 16 | 11 |
|
||||
# 16K | 27 | 14 | 14 | 12 |
|
||||
# 64K | 29 | 16 | 14 | 14 |
|
||||
# 4K | 27 | 12 | 15 | 10 |
|
||||
# 16K | 27 | 14 | 13 | 11 |
|
||||
# 64K | 29 | 16 | 13 | 13 |
|
||||
config ARCH_FORCE_MAX_ORDER
|
||||
int "Maximum zone order" if ARM64_4K_PAGES || ARM64_16K_PAGES
|
||||
default "14" if ARM64_64K_PAGES
|
||||
range 12 14 if ARM64_16K_PAGES
|
||||
default "12" if ARM64_16K_PAGES
|
||||
range 11 16 if ARM64_4K_PAGES
|
||||
default "11"
|
||||
int "Order of maximal physically contiguous allocations" if EXPERT && (ARM64_4K_PAGES || ARM64_16K_PAGES)
|
||||
default "13" if ARM64_64K_PAGES
|
||||
default "11" if ARM64_16K_PAGES
|
||||
default "10"
|
||||
help
|
||||
The kernel memory allocator divides physically contiguous memory
|
||||
blocks into "zones", where each zone is a power of two number of
|
||||
pages. This option selects the largest power of two that the kernel
|
||||
keeps in the memory allocator. If you need to allocate very large
|
||||
blocks of physically contiguous memory, then you may need to
|
||||
increase this value.
|
||||
The kernel page allocator limits the size of maximal physically
|
||||
contiguous allocations. The limit is called MAX_ORDER and it
|
||||
defines the maximal power of two of number of pages that can be
|
||||
allocated as a single contiguous block. This option allows
|
||||
overriding the default setting when ability to allocate very
|
||||
large blocks of physically contiguous memory is required.
|
||||
|
||||
This config option is actually maximum order plus one. For example,
|
||||
a value of 11 means that the largest free memory block is 2^10 pages.
|
||||
The maximal size of allocation cannot exceed the size of the
|
||||
section, so the value of MAX_ORDER should satisfy
|
||||
|
||||
We make sure that we can allocate up to a HugePage size for each configuration.
|
||||
Hence we have :
|
||||
MAX_ORDER = (PMD_SHIFT - PAGE_SHIFT) + 1 => PAGE_SHIFT - 2
|
||||
MAX_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS
|
||||
|
||||
However for 4K, we choose a higher default value, 11 as opposed to 10, giving us
|
||||
4M allocations matching the default size used by generic code.
|
||||
Don't change if unsure.
|
||||
|
||||
config UNMAP_KERNEL_AT_EL0
|
||||
bool "Unmap kernel when running in userspace (aka \"KAISER\")" if EXPERT
|
||||
|
|
|
@ -261,9 +261,11 @@ static inline const void *__tag_set(const void *addr, u8 tag)
|
|||
}
|
||||
|
||||
#ifdef CONFIG_KASAN_HW_TAGS
|
||||
#define arch_enable_tagging_sync() mte_enable_kernel_sync()
|
||||
#define arch_enable_tagging_async() mte_enable_kernel_async()
|
||||
#define arch_enable_tagging_asymm() mte_enable_kernel_asymm()
|
||||
#define arch_enable_tag_checks_sync() mte_enable_kernel_sync()
|
||||
#define arch_enable_tag_checks_async() mte_enable_kernel_async()
|
||||
#define arch_enable_tag_checks_asymm() mte_enable_kernel_asymm()
|
||||
#define arch_suppress_tag_checks_start() mte_enable_tco()
|
||||
#define arch_suppress_tag_checks_stop() mte_disable_tco()
|
||||
#define arch_force_async_tag_fault() mte_check_tfsr_exit()
|
||||
#define arch_get_random_tag() mte_get_random_tag()
|
||||
#define arch_get_mem_tag(addr) mte_get_mem_tag(addr)
|
||||
|
|
|
@ -13,8 +13,73 @@
|
|||
|
||||
#include <linux/types.h>
|
||||
|
||||
#ifdef CONFIG_KASAN_HW_TAGS
|
||||
|
||||
/* Whether the MTE asynchronous mode is enabled. */
|
||||
DECLARE_STATIC_KEY_FALSE(mte_async_or_asymm_mode);
|
||||
|
||||
static inline bool system_uses_mte_async_or_asymm_mode(void)
|
||||
{
|
||||
return static_branch_unlikely(&mte_async_or_asymm_mode);
|
||||
}
|
||||
|
||||
#else /* CONFIG_KASAN_HW_TAGS */
|
||||
|
||||
static inline bool system_uses_mte_async_or_asymm_mode(void)
|
||||
{
|
||||
return false;
|
||||
}
|
||||
|
||||
#endif /* CONFIG_KASAN_HW_TAGS */
|
||||
|
||||
#ifdef CONFIG_ARM64_MTE
|
||||
|
||||
/*
|
||||
* The Tag Check Flag (TCF) mode for MTE is per EL, hence TCF0
|
||||
* affects EL0 and TCF affects EL1 irrespective of which TTBR is
|
||||
* used.
|
||||
* The kernel accesses TTBR0 usually with LDTR/STTR instructions
|
||||
* when UAO is available, so these would act as EL0 accesses using
|
||||
* TCF0.
|
||||
* However futex.h code uses exclusives which would be executed as
|
||||
* EL1, this can potentially cause a tag check fault even if the
|
||||
* user disables TCF0.
|
||||
*
|
||||
* To address the problem we set the PSTATE.TCO bit in uaccess_enable()
|
||||
* and reset it in uaccess_disable().
|
||||
*
|
||||
* The Tag check override (TCO) bit disables temporarily the tag checking
|
||||
* preventing the issue.
|
||||
*/
|
||||
static inline void mte_disable_tco(void)
|
||||
{
|
||||
asm volatile(ALTERNATIVE("nop", SET_PSTATE_TCO(0),
|
||||
ARM64_MTE, CONFIG_KASAN_HW_TAGS));
|
||||
}
|
||||
|
||||
static inline void mte_enable_tco(void)
|
||||
{
|
||||
asm volatile(ALTERNATIVE("nop", SET_PSTATE_TCO(1),
|
||||
ARM64_MTE, CONFIG_KASAN_HW_TAGS));
|
||||
}
|
||||
|
||||
/*
|
||||
* These functions disable tag checking only if in MTE async mode
|
||||
* since the sync mode generates exceptions synchronously and the
|
||||
* nofault or load_unaligned_zeropad can handle them.
|
||||
*/
|
||||
static inline void __mte_disable_tco_async(void)
|
||||
{
|
||||
if (system_uses_mte_async_or_asymm_mode())
|
||||
mte_disable_tco();
|
||||
}
|
||||
|
||||
static inline void __mte_enable_tco_async(void)
|
||||
{
|
||||
if (system_uses_mte_async_or_asymm_mode())
|
||||
mte_enable_tco();
|
||||
}
|
||||
|
||||
/*
|
||||
* These functions are meant to be only used from KASAN runtime through
|
||||
* the arch_*() interface defined in asm/memory.h.
|
||||
|
@ -138,6 +203,22 @@ void mte_enable_kernel_asymm(void);
|
|||
|
||||
#else /* CONFIG_ARM64_MTE */
|
||||
|
||||
static inline void mte_disable_tco(void)
|
||||
{
|
||||
}
|
||||
|
||||
static inline void mte_enable_tco(void)
|
||||
{
|
||||
}
|
||||
|
||||
static inline void __mte_disable_tco_async(void)
|
||||
{
|
||||
}
|
||||
|
||||
static inline void __mte_enable_tco_async(void)
|
||||
{
|
||||
}
|
||||
|
||||
static inline u8 mte_get_ptr_tag(void *ptr)
|
||||
{
|
||||
return 0xFF;
|
||||
|
|
|
@ -178,14 +178,6 @@ static inline void mte_disable_tco_entry(struct task_struct *task)
|
|||
}
|
||||
|
||||
#ifdef CONFIG_KASAN_HW_TAGS
|
||||
/* Whether the MTE asynchronous mode is enabled. */
|
||||
DECLARE_STATIC_KEY_FALSE(mte_async_or_asymm_mode);
|
||||
|
||||
static inline bool system_uses_mte_async_or_asymm_mode(void)
|
||||
{
|
||||
return static_branch_unlikely(&mte_async_or_asymm_mode);
|
||||
}
|
||||
|
||||
void mte_check_tfsr_el1(void);
|
||||
|
||||
static inline void mte_check_tfsr_entry(void)
|
||||
|
@ -212,10 +204,6 @@ static inline void mte_check_tfsr_exit(void)
|
|||
mte_check_tfsr_el1();
|
||||
}
|
||||
#else
|
||||
static inline bool system_uses_mte_async_or_asymm_mode(void)
|
||||
{
|
||||
return false;
|
||||
}
|
||||
static inline void mte_check_tfsr_el1(void)
|
||||
{
|
||||
}
|
||||
|
|
|
@ -57,7 +57,7 @@ static inline bool arch_thp_swp_supported(void)
|
|||
* fault on one CPU which has been handled concurrently by another CPU
|
||||
* does not need to perform additional invalidation.
|
||||
*/
|
||||
#define flush_tlb_fix_spurious_fault(vma, address) do { } while (0)
|
||||
#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
|
||||
|
||||
/*
|
||||
* ZERO_PAGE is a global shared page that is always zero: used
|
||||
|
|
|
@ -10,7 +10,7 @@
|
|||
/*
|
||||
* Section size must be at least 512MB for 64K base
|
||||
* page size config. Otherwise it will be less than
|
||||
* (MAX_ORDER - 1) and the build process will fail.
|
||||
* MAX_ORDER and the build process will fail.
|
||||
*/
|
||||
#ifdef CONFIG_ARM64_64K_PAGES
|
||||
#define SECTION_SIZE_BITS 29
|
||||
|
|
|
@ -136,55 +136,9 @@ static inline void __uaccess_enable_hw_pan(void)
|
|||
CONFIG_ARM64_PAN));
|
||||
}
|
||||
|
||||
/*
|
||||
* The Tag Check Flag (TCF) mode for MTE is per EL, hence TCF0
|
||||
* affects EL0 and TCF affects EL1 irrespective of which TTBR is
|
||||
* used.
|
||||
* The kernel accesses TTBR0 usually with LDTR/STTR instructions
|
||||
* when UAO is available, so these would act as EL0 accesses using
|
||||
* TCF0.
|
||||
* However futex.h code uses exclusives which would be executed as
|
||||
* EL1, this can potentially cause a tag check fault even if the
|
||||
* user disables TCF0.
|
||||
*
|
||||
* To address the problem we set the PSTATE.TCO bit in uaccess_enable()
|
||||
* and reset it in uaccess_disable().
|
||||
*
|
||||
* The Tag check override (TCO) bit disables temporarily the tag checking
|
||||
* preventing the issue.
|
||||
*/
|
||||
static inline void __uaccess_disable_tco(void)
|
||||
{
|
||||
asm volatile(ALTERNATIVE("nop", SET_PSTATE_TCO(0),
|
||||
ARM64_MTE, CONFIG_KASAN_HW_TAGS));
|
||||
}
|
||||
|
||||
static inline void __uaccess_enable_tco(void)
|
||||
{
|
||||
asm volatile(ALTERNATIVE("nop", SET_PSTATE_TCO(1),
|
||||
ARM64_MTE, CONFIG_KASAN_HW_TAGS));
|
||||
}
|
||||
|
||||
/*
|
||||
* These functions disable tag checking only if in MTE async mode
|
||||
* since the sync mode generates exceptions synchronously and the
|
||||
* nofault or load_unaligned_zeropad can handle them.
|
||||
*/
|
||||
static inline void __uaccess_disable_tco_async(void)
|
||||
{
|
||||
if (system_uses_mte_async_or_asymm_mode())
|
||||
__uaccess_disable_tco();
|
||||
}
|
||||
|
||||
static inline void __uaccess_enable_tco_async(void)
|
||||
{
|
||||
if (system_uses_mte_async_or_asymm_mode())
|
||||
__uaccess_enable_tco();
|
||||
}
|
||||
|
||||
static inline void uaccess_disable_privileged(void)
|
||||
{
|
||||
__uaccess_disable_tco();
|
||||
mte_disable_tco();
|
||||
|
||||
if (uaccess_ttbr0_disable())
|
||||
return;
|
||||
|
@ -194,7 +148,7 @@ static inline void uaccess_disable_privileged(void)
|
|||
|
||||
static inline void uaccess_enable_privileged(void)
|
||||
{
|
||||
__uaccess_enable_tco();
|
||||
mte_enable_tco();
|
||||
|
||||
if (uaccess_ttbr0_enable())
|
||||
return;
|
||||
|
@ -302,8 +256,8 @@ do { \
|
|||
#define get_user __get_user
|
||||
|
||||
/*
|
||||
* We must not call into the scheduler between __uaccess_enable_tco_async() and
|
||||
* __uaccess_disable_tco_async(). As `dst` and `src` may contain blocking
|
||||
* We must not call into the scheduler between __mte_enable_tco_async() and
|
||||
* __mte_disable_tco_async(). As `dst` and `src` may contain blocking
|
||||
* functions, we must evaluate these outside of the critical section.
|
||||
*/
|
||||
#define __get_kernel_nofault(dst, src, type, err_label) \
|
||||
|
@ -312,10 +266,10 @@ do { \
|
|||
__typeof__(src) __gkn_src = (src); \
|
||||
int __gkn_err = 0; \
|
||||
\
|
||||
__uaccess_enable_tco_async(); \
|
||||
__mte_enable_tco_async(); \
|
||||
__raw_get_mem("ldr", *((type *)(__gkn_dst)), \
|
||||
(__force type *)(__gkn_src), __gkn_err, K); \
|
||||
__uaccess_disable_tco_async(); \
|
||||
__mte_disable_tco_async(); \
|
||||
\
|
||||
if (unlikely(__gkn_err)) \
|
||||
goto err_label; \
|
||||
|
@ -388,8 +342,8 @@ do { \
|
|||
#define put_user __put_user
|
||||
|
||||
/*
|
||||
* We must not call into the scheduler between __uaccess_enable_tco_async() and
|
||||
* __uaccess_disable_tco_async(). As `dst` and `src` may contain blocking
|
||||
* We must not call into the scheduler between __mte_enable_tco_async() and
|
||||
* __mte_disable_tco_async(). As `dst` and `src` may contain blocking
|
||||
* functions, we must evaluate these outside of the critical section.
|
||||
*/
|
||||
#define __put_kernel_nofault(dst, src, type, err_label) \
|
||||
|
@ -398,10 +352,10 @@ do { \
|
|||
__typeof__(src) __pkn_src = (src); \
|
||||
int __pkn_err = 0; \
|
||||
\
|
||||
__uaccess_enable_tco_async(); \
|
||||
__mte_enable_tco_async(); \
|
||||
__raw_put_mem("str", *((type *)(__pkn_src)), \
|
||||
(__force type *)(__pkn_dst), __pkn_err, K); \
|
||||
__uaccess_disable_tco_async(); \
|
||||
__mte_disable_tco_async(); \
|
||||
\
|
||||
if (unlikely(__pkn_err)) \
|
||||
goto err_label; \
|
||||
|
|
|
@ -55,7 +55,7 @@ static inline unsigned long load_unaligned_zeropad(const void *addr)
|
|||
{
|
||||
unsigned long ret;
|
||||
|
||||
__uaccess_enable_tco_async();
|
||||
__mte_enable_tco_async();
|
||||
|
||||
/* Load word from unaligned pointer addr */
|
||||
asm(
|
||||
|
@ -65,7 +65,7 @@ static inline unsigned long load_unaligned_zeropad(const void *addr)
|
|||
: "=&r" (ret)
|
||||
: "r" (addr), "Q" (*(unsigned long *)addr));
|
||||
|
||||
__uaccess_disable_tco_async();
|
||||
__mte_disable_tco_async();
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
|
|
@ -16,7 +16,7 @@ struct hyp_pool {
|
|||
* API at EL2.
|
||||
*/
|
||||
hyp_spinlock_t lock;
|
||||
struct list_head free_area[MAX_ORDER];
|
||||
struct list_head free_area[MAX_ORDER + 1];
|
||||
phys_addr_t range_start;
|
||||
phys_addr_t range_end;
|
||||
unsigned short max_order;
|
||||
|
|
|
@ -110,7 +110,7 @@ static void __hyp_attach_page(struct hyp_pool *pool,
|
|||
* after coalescing, so make sure to mark it HYP_NO_ORDER proactively.
|
||||
*/
|
||||
p->order = HYP_NO_ORDER;
|
||||
for (; (order + 1) < pool->max_order; order++) {
|
||||
for (; (order + 1) <= pool->max_order; order++) {
|
||||
buddy = __find_buddy_avail(pool, p, order);
|
||||
if (!buddy)
|
||||
break;
|
||||
|
@ -203,9 +203,9 @@ void *hyp_alloc_pages(struct hyp_pool *pool, unsigned short order)
|
|||
hyp_spin_lock(&pool->lock);
|
||||
|
||||
/* Look for a high-enough-order page */
|
||||
while (i < pool->max_order && list_empty(&pool->free_area[i]))
|
||||
while (i <= pool->max_order && list_empty(&pool->free_area[i]))
|
||||
i++;
|
||||
if (i >= pool->max_order) {
|
||||
if (i > pool->max_order) {
|
||||
hyp_spin_unlock(&pool->lock);
|
||||
return NULL;
|
||||
}
|
||||
|
@ -228,8 +228,8 @@ int hyp_pool_init(struct hyp_pool *pool, u64 pfn, unsigned int nr_pages,
|
|||
int i;
|
||||
|
||||
hyp_spin_lock_init(&pool->lock);
|
||||
pool->max_order = min(MAX_ORDER, get_order((nr_pages + 1) << PAGE_SHIFT));
|
||||
for (i = 0; i < pool->max_order; i++)
|
||||
pool->max_order = min(MAX_ORDER, get_order(nr_pages << PAGE_SHIFT));
|
||||
for (i = 0; i <= pool->max_order; i++)
|
||||
INIT_LIST_HEAD(&pool->free_area[i]);
|
||||
pool->range_start = phys;
|
||||
pool->range_end = phys + (nr_pages << PAGE_SHIFT);
|
||||
|
|
|
@ -535,6 +535,9 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
|
|||
unsigned long vm_flags;
|
||||
unsigned int mm_flags = FAULT_FLAG_DEFAULT;
|
||||
unsigned long addr = untagged_addr(far);
|
||||
#ifdef CONFIG_PER_VMA_LOCK
|
||||
struct vm_area_struct *vma;
|
||||
#endif
|
||||
|
||||
if (kprobe_page_fault(regs, esr))
|
||||
return 0;
|
||||
|
@ -585,6 +588,36 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
|
|||
|
||||
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
|
||||
|
||||
#ifdef CONFIG_PER_VMA_LOCK
|
||||
if (!(mm_flags & FAULT_FLAG_USER))
|
||||
goto lock_mmap;
|
||||
|
||||
vma = lock_vma_under_rcu(mm, addr);
|
||||
if (!vma)
|
||||
goto lock_mmap;
|
||||
|
||||
if (!(vma->vm_flags & vm_flags)) {
|
||||
vma_end_read(vma);
|
||||
goto lock_mmap;
|
||||
}
|
||||
fault = handle_mm_fault(vma, addr & PAGE_MASK,
|
||||
mm_flags | FAULT_FLAG_VMA_LOCK, regs);
|
||||
vma_end_read(vma);
|
||||
|
||||
if (!(fault & VM_FAULT_RETRY)) {
|
||||
count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
|
||||
goto done;
|
||||
}
|
||||
count_vm_vma_lock_event(VMA_LOCK_RETRY);
|
||||
|
||||
/* Quick path to respond to signals */
|
||||
if (fault_signal_pending(fault, regs)) {
|
||||
if (!user_mode(regs))
|
||||
goto no_context;
|
||||
return 0;
|
||||
}
|
||||
lock_mmap:
|
||||
#endif /* CONFIG_PER_VMA_LOCK */
|
||||
/*
|
||||
* As per x86, we may deadlock here. However, since the kernel only
|
||||
* validly references user space from well defined areas of the code,
|
||||
|
@ -628,6 +661,9 @@ retry:
|
|||
}
|
||||
mmap_read_unlock(mm);
|
||||
|
||||
#ifdef CONFIG_PER_VMA_LOCK
|
||||
done:
|
||||
#endif
|
||||
/*
|
||||
* Handle the "normal" (no error) case first.
|
||||
*/
|
||||
|
|
|
@ -332,10 +332,6 @@ config HIGHMEM
|
|||
select KMAP_LOCAL
|
||||
default y
|
||||
|
||||
config ARCH_FORCE_MAX_ORDER
|
||||
int "Maximum zone order"
|
||||
default "11"
|
||||
|
||||
config DRAM_BASE
|
||||
hex "DRAM start addr (the same with memory-section in dts)"
|
||||
default 0x0
|
||||
|
|
|
@ -203,10 +203,9 @@ config IA64_CYCLONE
|
|||
If you're unsure, answer N.
|
||||
|
||||
config ARCH_FORCE_MAX_ORDER
|
||||
int "MAX_ORDER (11 - 17)" if !HUGETLB_PAGE
|
||||
range 11 17 if !HUGETLB_PAGE
|
||||
default "17" if HUGETLB_PAGE
|
||||
default "11"
|
||||
int
|
||||
default "16" if HUGETLB_PAGE
|
||||
default "10"
|
||||
|
||||
config SMP
|
||||
bool "Symmetric multi-processing support"
|
||||
|
|
|
@ -12,9 +12,9 @@
|
|||
#define SECTION_SIZE_BITS (30)
|
||||
#define MAX_PHYSMEM_BITS (50)
|
||||
#ifdef CONFIG_ARCH_FORCE_MAX_ORDER
|
||||
#if ((CONFIG_ARCH_FORCE_MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS)
|
||||
#if (CONFIG_ARCH_FORCE_MAX_ORDER + PAGE_SHIFT > SECTION_SIZE_BITS)
|
||||
#undef SECTION_SIZE_BITS
|
||||
#define SECTION_SIZE_BITS (CONFIG_ARCH_FORCE_MAX_ORDER - 1 + PAGE_SHIFT)
|
||||
#define SECTION_SIZE_BITS (CONFIG_ARCH_FORCE_MAX_ORDER + PAGE_SHIFT)
|
||||
#endif
|
||||
#endif
|
||||
|
||||
|
|
|
@ -170,7 +170,7 @@ static int __init hugetlb_setup_sz(char *str)
|
|||
size = memparse(str, &str);
|
||||
if (*str || !is_power_of_2(size) || !(tr_pages & size) ||
|
||||
size <= PAGE_SIZE ||
|
||||
size >= (1UL << PAGE_SHIFT << MAX_ORDER)) {
|
||||
size > (1UL << PAGE_SHIFT << MAX_ORDER)) {
|
||||
printk(KERN_WARNING "Invalid huge page size specified\n");
|
||||
return 1;
|
||||
}
|
||||
|
|
|
@ -53,8 +53,8 @@ config LOONGARCH
|
|||
select ARCH_USE_QUEUED_RWLOCKS
|
||||
select ARCH_USE_QUEUED_SPINLOCKS
|
||||
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
|
||||
select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
|
||||
select ARCH_WANT_LD_ORPHAN_WARN
|
||||
select ARCH_WANT_OPTIMIZE_VMEMMAP
|
||||
select ARCH_WANTS_NO_INSTR
|
||||
select BUILDTIME_TABLE_SORT
|
||||
select COMMON_CLK
|
||||
|
@ -421,12 +421,9 @@ config NODES_SHIFT
|
|||
|
||||
config ARCH_FORCE_MAX_ORDER
|
||||
int "Maximum zone order"
|
||||
range 14 64 if PAGE_SIZE_64KB
|
||||
default "14" if PAGE_SIZE_64KB
|
||||
range 12 64 if PAGE_SIZE_16KB
|
||||
default "12" if PAGE_SIZE_16KB
|
||||
range 11 64
|
||||
default "11"
|
||||
default "13" if PAGE_SIZE_64KB
|
||||
default "11" if PAGE_SIZE_16KB
|
||||
default "10"
|
||||
help
|
||||
The kernel memory allocator divides physically contiguous memory
|
||||
blocks into "zones", where each zone is a power of two number of
|
||||
|
@ -435,9 +432,6 @@ config ARCH_FORCE_MAX_ORDER
|
|||
blocks of physically contiguous memory, then you may need to
|
||||
increase this value.
|
||||
|
||||
This config option is actually maximum order plus one. For example,
|
||||
a value of 11 means that the largest free memory block is 2^10 pages.
|
||||
|
||||
The page size is not necessarily 4KB. Keep this in mind
|
||||
when choosing a value for this option.
|
||||
|
||||
|
|
|
@ -397,23 +397,22 @@ config SINGLE_MEMORY_CHUNK
|
|||
Say N if not sure.
|
||||
|
||||
config ARCH_FORCE_MAX_ORDER
|
||||
int "Maximum zone order" if ADVANCED
|
||||
int "Order of maximal physically contiguous allocations" if ADVANCED
|
||||
depends on !SINGLE_MEMORY_CHUNK
|
||||
default "11"
|
||||
default "10"
|
||||
help
|
||||
The kernel memory allocator divides physically contiguous memory
|
||||
blocks into "zones", where each zone is a power of two number of
|
||||
pages. This option selects the largest power of two that the kernel
|
||||
keeps in the memory allocator. If you need to allocate very large
|
||||
blocks of physically contiguous memory, then you may need to
|
||||
increase this value.
|
||||
The kernel page allocator limits the size of maximal physically
|
||||
contiguous allocations. The limit is called MAX_ORDER and it
|
||||
defines the maximal power of two of number of pages that can be
|
||||
allocated as a single contiguous block. This option allows
|
||||
overriding the default setting when ability to allocate very
|
||||
large blocks of physically contiguous memory is required.
|
||||
|
||||
For systems that have holes in their physical address space this
|
||||
value also defines the minimal size of the hole that allows
|
||||
freeing unused memory map.
|
||||
|
||||
This config option is actually maximum order plus one. For example,
|
||||
a value of 11 means that the largest free memory block is 2^10 pages.
|
||||
Don't change if unsure.
|
||||
|
||||
config 060_WRITETHROUGH
|
||||
bool "Use write-through caching for 68060 supervisor accesses"
|
||||
|
|
|
@ -46,7 +46,7 @@
|
|||
#define _CACHEMASK040 (~0x060)
|
||||
#define _PAGE_GLOBAL040 0x400 /* 68040 global bit, used for kva descs */
|
||||
|
||||
/* We borrow bit 24 to store the exclusive marker in swap PTEs. */
|
||||
/* We borrow bit 7 to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE CF_PAGE_NOCACHE
|
||||
|
||||
/*
|
||||
|
|
|
@ -2099,14 +2099,10 @@ endchoice
|
|||
|
||||
config ARCH_FORCE_MAX_ORDER
|
||||
int "Maximum zone order"
|
||||
range 14 64 if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_64KB
|
||||
default "14" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_64KB
|
||||
range 13 64 if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_32KB
|
||||
default "13" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_32KB
|
||||
range 12 64 if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_16KB
|
||||
default "12" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_16KB
|
||||
range 0 64
|
||||
default "11"
|
||||
default "13" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_64KB
|
||||
default "12" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_32KB
|
||||
default "11" if MIPS_HUGE_TLB_SUPPORT && PAGE_SIZE_16KB
|
||||
default "10"
|
||||
help
|
||||
The kernel memory allocator divides physically contiguous memory
|
||||
blocks into "zones", where each zone is a power of two number of
|
||||
|
@ -2115,9 +2111,6 @@ config ARCH_FORCE_MAX_ORDER
|
|||
blocks of physically contiguous memory, then you may need to
|
||||
increase this value.
|
||||
|
||||
This config option is actually maximum order plus one. For example,
|
||||
a value of 11 means that the largest free memory block is 2^10 pages.
|
||||
|
||||
The page size is not necessarily 4KB. Keep this in mind
|
||||
when choosing a value for this option.
|
||||
|
||||
|
|
|
@ -70,7 +70,7 @@ enum fixed_addresses {
|
|||
#include <asm-generic/fixmap.h>
|
||||
|
||||
/*
|
||||
* Called from pgtable_init()
|
||||
* Called from pagetable_init()
|
||||
*/
|
||||
extern void fixrange_init(unsigned long start, unsigned long end,
|
||||
pgd_t *pgd_base);
|
||||
|
|
|
@ -469,7 +469,8 @@ static inline pgprot_t pgprot_writecombine(pgprot_t _prot)
|
|||
}
|
||||
|
||||
static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
|
||||
unsigned long address)
|
||||
unsigned long address,
|
||||
pte_t *ptep)
|
||||
{
|
||||
}
|
||||
|
||||
|
|
|
@ -45,19 +45,17 @@ menu "Kernel features"
|
|||
source "kernel/Kconfig.hz"
|
||||
|
||||
config ARCH_FORCE_MAX_ORDER
|
||||
int "Maximum zone order"
|
||||
range 9 20
|
||||
default "11"
|
||||
int "Order of maximal physically contiguous allocations"
|
||||
default "10"
|
||||
help
|
||||
The kernel memory allocator divides physically contiguous memory
|
||||
blocks into "zones", where each zone is a power of two number of
|
||||
pages. This option selects the largest power of two that the kernel
|
||||
keeps in the memory allocator. If you need to allocate very large
|
||||
blocks of physically contiguous memory, then you may need to
|
||||
increase this value.
|
||||
The kernel page allocator limits the size of maximal physically
|
||||
contiguous allocations. The limit is called MAX_ORDER and it
|
||||
defines the maximal power of two of number of pages that can be
|
||||
allocated as a single contiguous block. This option allows
|
||||
overriding the default setting when ability to allocate very
|
||||
large blocks of physically contiguous memory is required.
|
||||
|
||||
This config option is actually maximum order plus one. For example,
|
||||
a value of 11 means that the largest free memory block is 2^10 pages.
|
||||
Don't change if unsure.
|
||||
|
||||
endmenu
|
||||
|
||||
|
|
|
@ -267,6 +267,7 @@ config PPC
|
|||
select MMU_GATHER_PAGE_SIZE
|
||||
select MMU_GATHER_RCU_TABLE_FREE
|
||||
select MMU_GATHER_MERGE_VMAS
|
||||
select MMU_LAZY_TLB_SHOOTDOWN if PPC_BOOK3S_64
|
||||
select MODULES_USE_ELF_RELA
|
||||
select NEED_DMA_MAP_STATE if PPC64 || NOT_COHERENT_CACHE
|
||||
select NEED_PER_CPU_EMBED_FIRST_CHUNK if PPC64
|
||||
|
@ -896,34 +897,27 @@ config DATA_SHIFT
|
|||
8M pages will be pinned.
|
||||
|
||||
config ARCH_FORCE_MAX_ORDER
|
||||
int "Maximum zone order"
|
||||
range 8 9 if PPC64 && PPC_64K_PAGES
|
||||
default "9" if PPC64 && PPC_64K_PAGES
|
||||
range 13 13 if PPC64 && !PPC_64K_PAGES
|
||||
default "13" if PPC64 && !PPC_64K_PAGES
|
||||
range 9 64 if PPC32 && PPC_16K_PAGES
|
||||
default "9" if PPC32 && PPC_16K_PAGES
|
||||
range 7 64 if PPC32 && PPC_64K_PAGES
|
||||
default "7" if PPC32 && PPC_64K_PAGES
|
||||
range 5 64 if PPC32 && PPC_256K_PAGES
|
||||
default "5" if PPC32 && PPC_256K_PAGES
|
||||
range 11 64
|
||||
default "11"
|
||||
int "Order of maximal physically contiguous allocations"
|
||||
default "8" if PPC64 && PPC_64K_PAGES
|
||||
default "12" if PPC64 && !PPC_64K_PAGES
|
||||
default "8" if PPC32 && PPC_16K_PAGES
|
||||
default "6" if PPC32 && PPC_64K_PAGES
|
||||
default "4" if PPC32 && PPC_256K_PAGES
|
||||
default "10"
|
||||
help
|
||||
The kernel memory allocator divides physically contiguous memory
|
||||
blocks into "zones", where each zone is a power of two number of
|
||||
pages. This option selects the largest power of two that the kernel
|
||||
keeps in the memory allocator. If you need to allocate very large
|
||||
blocks of physically contiguous memory, then you may need to
|
||||
increase this value.
|
||||
|
||||
This config option is actually maximum order plus one. For example,
|
||||
a value of 11 means that the largest free memory block is 2^10 pages.
|
||||
The kernel page allocator limits the size of maximal physically
|
||||
contiguous allocations. The limit is called MAX_ORDER and it
|
||||
defines the maximal power of two of number of pages that can be
|
||||
allocated as a single contiguous block. This option allows
|
||||
overriding the default setting when ability to allocate very
|
||||
large blocks of physically contiguous memory is required.
|
||||
|
||||
The page size is not necessarily 4KB. For example, on 64-bit
|
||||
systems, 64KB pages can be enabled via CONFIG_PPC_64K_PAGES. Keep
|
||||
this in mind when choosing a value for this option.
|
||||
|
||||
Don't change if unsure.
|
||||
|
||||
config PPC_SUBPAGE_PROT
|
||||
bool "Support setting protections for 4k subpages (subpage_prot syscall)"
|
||||
default n
|
||||
|
|
|
@ -30,7 +30,7 @@ CONFIG_PREEMPT=y
|
|||
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
|
||||
CONFIG_BINFMT_MISC=m
|
||||
CONFIG_MATH_EMULATION=y
|
||||
CONFIG_ARCH_FORCE_MAX_ORDER=17
|
||||
CONFIG_ARCH_FORCE_MAX_ORDER=16
|
||||
CONFIG_PCI=y
|
||||
CONFIG_PCIEPORTBUS=y
|
||||
CONFIG_PCI_MSI=y
|
||||
|
|
|
@ -41,7 +41,7 @@ CONFIG_FIXED_PHY=y
|
|||
CONFIG_FONT_8x16=y
|
||||
CONFIG_FONT_8x8=y
|
||||
CONFIG_FONTS=y
|
||||
CONFIG_ARCH_FORCE_MAX_ORDER=13
|
||||
CONFIG_ARCH_FORCE_MAX_ORDER=12
|
||||
CONFIG_FRAMEBUFFER_CONSOLE=y
|
||||
CONFIG_FRAME_WARN=1024
|
||||
CONFIG_FTL=y
|
||||
|
|
|
@ -121,7 +121,8 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
|
|||
|
||||
#define flush_tlb_fix_spurious_fault flush_tlb_fix_spurious_fault
|
||||
static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
|
||||
unsigned long address)
|
||||
unsigned long address,
|
||||
pte_t *ptep)
|
||||
{
|
||||
/*
|
||||
* Book3S 64 does not require spurious fault flushes because the PTE
|
||||
|
|
|
@ -1611,7 +1611,7 @@ void start_secondary(void *unused)
|
|||
if (IS_ENABLED(CONFIG_PPC32))
|
||||
setup_kup();
|
||||
|
||||
mmgrab(&init_mm);
|
||||
mmgrab_lazy_tlb(&init_mm);
|
||||
current->active_mm = &init_mm;
|
||||
|
||||
smp_store_cpu_info(cpu);
|
||||
|
|
|
@ -97,7 +97,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
|
|||
}
|
||||
|
||||
mmap_read_lock(mm);
|
||||
chunk = (1UL << (PAGE_SHIFT + MAX_ORDER - 1)) /
|
||||
chunk = (1UL << (PAGE_SHIFT + MAX_ORDER)) /
|
||||
sizeof(struct vm_area_struct *);
|
||||
chunk = min(chunk, entries);
|
||||
for (entry = 0; entry < entries; entry += chunk) {
|
||||
|
|
|
@ -797,10 +797,10 @@ void exit_lazy_flush_tlb(struct mm_struct *mm, bool always_flush)
|
|||
if (current->active_mm == mm) {
|
||||
WARN_ON_ONCE(current->mm != NULL);
|
||||
/* Is a kernel thread and is using mm as the lazy tlb */
|
||||
mmgrab(&init_mm);
|
||||
mmgrab_lazy_tlb(&init_mm);
|
||||
current->active_mm = &init_mm;
|
||||
switch_mm_irqs_off(mm, &init_mm, current);
|
||||
mmdrop(mm);
|
||||
mmdrop_lazy_tlb(mm);
|
||||
}
|
||||
|
||||
/*
|
||||
|
|
|
@ -474,6 +474,40 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
|
|||
if (is_exec)
|
||||
flags |= FAULT_FLAG_INSTRUCTION;
|
||||
|
||||
#ifdef CONFIG_PER_VMA_LOCK
|
||||
if (!(flags & FAULT_FLAG_USER))
|
||||
goto lock_mmap;
|
||||
|
||||
vma = lock_vma_under_rcu(mm, address);
|
||||
if (!vma)
|
||||
goto lock_mmap;
|
||||
|
||||
if (unlikely(access_pkey_error(is_write, is_exec,
|
||||
(error_code & DSISR_KEYFAULT), vma))) {
|
||||
vma_end_read(vma);
|
||||
goto lock_mmap;
|
||||
}
|
||||
|
||||
if (unlikely(access_error(is_write, is_exec, vma))) {
|
||||
vma_end_read(vma);
|
||||
goto lock_mmap;
|
||||
}
|
||||
|
||||
fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
|
||||
vma_end_read(vma);
|
||||
|
||||
if (!(fault & VM_FAULT_RETRY)) {
|
||||
count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
|
||||
goto done;
|
||||
}
|
||||
count_vm_vma_lock_event(VMA_LOCK_RETRY);
|
||||
|
||||
if (fault_signal_pending(fault, regs))
|
||||
return user_mode(regs) ? 0 : SIGBUS;
|
||||
|
||||
lock_mmap:
|
||||
#endif /* CONFIG_PER_VMA_LOCK */
|
||||
|
||||
/* When running in the kernel we expect faults to occur only to
|
||||
* addresses in user space. All other faults represent errors in the
|
||||
* kernel and should generate an OOPS. Unfortunately, in the case of an
|
||||
|
@ -550,6 +584,9 @@ retry:
|
|||
|
||||
mmap_read_unlock(current->mm);
|
||||
|
||||
#ifdef CONFIG_PER_VMA_LOCK
|
||||
done:
|
||||
#endif
|
||||
if (unlikely(fault & VM_FAULT_ERROR))
|
||||
return mm_fault_error(regs, address, fault);
|
||||
|
||||
|
|
|
@ -615,7 +615,7 @@ void __init gigantic_hugetlb_cma_reserve(void)
|
|||
order = mmu_psize_to_shift(MMU_PAGE_16G) - PAGE_SHIFT;
|
||||
|
||||
if (order) {
|
||||
VM_WARN_ON(order < MAX_ORDER);
|
||||
VM_WARN_ON(order <= MAX_ORDER);
|
||||
hugetlb_cma_reserve(order);
|
||||
}
|
||||
}
|
||||
|
|
|
@ -16,6 +16,7 @@ config PPC_POWERNV
|
|||
select PPC_DOORBELL
|
||||
select MMU_NOTIFIER
|
||||
select FORCE_SMP
|
||||
select ARCH_SUPPORTS_PER_VMA_LOCK
|
||||
default y
|
||||
|
||||
config OPAL_PRD
|
||||
|
|
|
@ -1740,7 +1740,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
|
|||
* DMA window can be larger than available memory, which will
|
||||
* cause errors later.
|
||||
*/
|
||||
const u64 maxblock = 1UL << (PAGE_SHIFT + MAX_ORDER - 1);
|
||||
const u64 maxblock = 1UL << (PAGE_SHIFT + MAX_ORDER);
|
||||
|
||||
/*
|
||||
* We create the default window as big as we can. The constraint is
|
||||
|
|
|
@ -22,6 +22,7 @@ config PPC_PSERIES
|
|||
select HOTPLUG_CPU
|
||||
select FORCE_SMP
|
||||
select SWIOTLB
|
||||
select ARCH_SUPPORTS_PER_VMA_LOCK
|
||||
default y
|
||||
|
||||
config PARAVIRT
|
||||
|
|
|
@ -120,13 +120,14 @@ config S390
|
|||
select ARCH_SUPPORTS_DEBUG_PAGEALLOC
|
||||
select ARCH_SUPPORTS_HUGETLBFS
|
||||
select ARCH_SUPPORTS_NUMA_BALANCING
|
||||
select ARCH_SUPPORTS_PER_VMA_LOCK
|
||||
select ARCH_USE_BUILTIN_BSWAP
|
||||
select ARCH_USE_CMPXCHG_LOCKREF
|
||||
select ARCH_WANTS_DYNAMIC_TASK_STRUCT
|
||||
select ARCH_WANTS_NO_INSTR
|
||||
select ARCH_WANT_DEFAULT_BPF_JIT
|
||||
select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
|
||||
select ARCH_WANT_IPC_PARSE_VERSION
|
||||
select ARCH_WANT_OPTIMIZE_VMEMMAP
|
||||
select BUILDTIME_TABLE_SORT
|
||||
select CLONE_BACKWARDS2
|
||||
select DMA_OPS if PCI
|
||||
|
|
|
@ -1239,7 +1239,8 @@ static inline int pte_allow_rdp(pte_t old, pte_t new)
|
|||
}
|
||||
|
||||
static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
|
||||
unsigned long address)
|
||||
unsigned long address,
|
||||
pte_t *ptep)
|
||||
{
|
||||
/*
|
||||
* RDP might not have propagated the PTE protection reset to all CPUs,
|
||||
|
@ -1247,11 +1248,12 @@ static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
|
|||
* NOTE: This will also be called when a racing pagetable update on
|
||||
* another thread already installed the correct PTE. Both cases cannot
|
||||
* really be distinguished.
|
||||
* Therefore, only do the local TLB flush when RDP can be used, to avoid
|
||||
* unnecessary overhead.
|
||||
* Therefore, only do the local TLB flush when RDP can be used, and the
|
||||
* PTE does not have _PAGE_PROTECT set, to avoid unnecessary overhead.
|
||||
* A local RDP can be used to do the flush.
|
||||
*/
|
||||
if (MACHINE_HAS_RDP)
|
||||
asm volatile("ptlb" : : : "memory");
|
||||
if (MACHINE_HAS_RDP && !(pte_val(*ptep) & _PAGE_PROTECT))
|
||||
__ptep_rdp(address, ptep, 0, 0, 1);
|
||||
}
|
||||
#define flush_tlb_fix_spurious_fault flush_tlb_fix_spurious_fault
|
||||
|
||||
|
|
|
@ -407,6 +407,30 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
|
|||
access = VM_WRITE;
|
||||
if (access == VM_WRITE)
|
||||
flags |= FAULT_FLAG_WRITE;
|
||||
#ifdef CONFIG_PER_VMA_LOCK
|
||||
if (!(flags & FAULT_FLAG_USER))
|
||||
goto lock_mmap;
|
||||
vma = lock_vma_under_rcu(mm, address);
|
||||
if (!vma)
|
||||
goto lock_mmap;
|
||||
if (!(vma->vm_flags & access)) {
|
||||
vma_end_read(vma);
|
||||
goto lock_mmap;
|
||||
}
|
||||
fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
|
||||
vma_end_read(vma);
|
||||
if (!(fault & VM_FAULT_RETRY)) {
|
||||
count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
|
||||
goto out;
|
||||
}
|
||||
count_vm_vma_lock_event(VMA_LOCK_RETRY);
|
||||
/* Quick path to respond to signals */
|
||||
if (fault_signal_pending(fault, regs)) {
|
||||
fault = VM_FAULT_SIGNAL;
|
||||
goto out;
|
||||
}
|
||||
lock_mmap:
|
||||
#endif /* CONFIG_PER_VMA_LOCK */
|
||||
mmap_read_lock(mm);
|
||||
|
||||
gmap = NULL;
|
||||
|
|
|
@ -2591,6 +2591,13 @@ int gmap_mark_unmergeable(void)
|
|||
int ret;
|
||||
VMA_ITERATOR(vmi, mm, 0);
|
||||
|
||||
/*
|
||||
* Make sure to disable KSM (if enabled for the whole process or
|
||||
* individual VMAs). Note that nothing currently hinders user space
|
||||
* from re-enabling it.
|
||||
*/
|
||||
clear_bit(MMF_VM_MERGE_ANY, &mm->flags);
|
||||
|
||||
for_each_vma(vmi, vma) {
|
||||
/* Copy vm_flags to avoid partial modifications in ksm_madvise */
|
||||
vm_flags = vma->vm_flags;
|
||||
|
|
|
@ -273,7 +273,7 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
|
|||
|
||||
info.flags = VM_UNMAPPED_AREA_TOPDOWN;
|
||||
info.length = len;
|
||||
info.low_limit = max(PAGE_SIZE, mmap_min_addr);
|
||||
info.low_limit = PAGE_SIZE;
|
||||
info.high_limit = current->mm->mmap_base;
|
||||
info.align_mask = PAGE_MASK & ~huge_page_mask(h);
|
||||
info.align_offset = 0;
|
||||
|
|
|
@ -136,7 +136,7 @@ unsigned long arch_get_unmapped_area_topdown(struct file *filp, unsigned long ad
|
|||
|
||||
info.flags = VM_UNMAPPED_AREA_TOPDOWN;
|
||||
info.length = len;
|
||||
info.low_limit = max(PAGE_SIZE, mmap_min_addr);
|
||||
info.low_limit = PAGE_SIZE;
|
||||
info.high_limit = mm->mmap_base;
|
||||
if (filp || (flags & MAP_SHARED))
|
||||
info.align_mask = MMAP_ALIGN_MASK << PAGE_SHIFT;
|
||||
|
|
|
@ -8,7 +8,7 @@ CONFIG_MODULES=y
|
|||
CONFIG_MODULE_UNLOAD=y
|
||||
# CONFIG_BLK_DEV_BSG is not set
|
||||
CONFIG_CPU_SUBTYPE_SH7724=y
|
||||
CONFIG_ARCH_FORCE_MAX_ORDER=12
|
||||
CONFIG_ARCH_FORCE_MAX_ORDER=11
|
||||
CONFIG_MEMORY_SIZE=0x10000000
|
||||
CONFIG_FLATMEM_MANUAL=y
|
||||
CONFIG_SH_ECOVEC=y
|
||||
|
|
|
@ -19,28 +19,24 @@ config PAGE_OFFSET
|
|||
default "0x00000000"
|
||||
|
||||
config ARCH_FORCE_MAX_ORDER
|
||||
int "Maximum zone order"
|
||||
range 9 64 if PAGE_SIZE_16KB
|
||||
default "9" if PAGE_SIZE_16KB
|
||||
range 7 64 if PAGE_SIZE_64KB
|
||||
default "7" if PAGE_SIZE_64KB
|
||||
range 11 64
|
||||
default "14" if !MMU
|
||||
default "11"
|
||||
int "Order of maximal physically contiguous allocations"
|
||||
default "8" if PAGE_SIZE_16KB
|
||||
default "6" if PAGE_SIZE_64KB
|
||||
default "13" if !MMU
|
||||
default "10"
|
||||
help
|
||||
The kernel memory allocator divides physically contiguous memory
|
||||
blocks into "zones", where each zone is a power of two number of
|
||||
pages. This option selects the largest power of two that the kernel
|
||||
keeps in the memory allocator. If you need to allocate very large
|
||||
blocks of physically contiguous memory, then you may need to
|
||||
increase this value.
|
||||
|
||||
This config option is actually maximum order plus one. For example,
|
||||
a value of 11 means that the largest free memory block is 2^10 pages.
|
||||
The kernel page allocator limits the size of maximal physically
|
||||
contiguous allocations. The limit is called MAX_ORDER and it
|
||||
defines the maximal power of two of number of pages that can be
|
||||
allocated as a single contiguous block. This option allows
|
||||
overriding the default setting when ability to allocate very
|
||||
large blocks of physically contiguous memory is required.
|
||||
|
||||
The page size is not necessarily 4KB. Keep this in mind when
|
||||
choosing a value for this option.
|
||||
|
||||
Don't change if unsure.
|
||||
|
||||
config MEMORY_START
|
||||
hex "Physical memory start address"
|
||||
default "0x08000000"
|
||||
|
|
|
@ -271,18 +271,17 @@ config ARCH_SPARSEMEM_DEFAULT
|
|||
def_bool y if SPARC64
|
||||
|
||||
config ARCH_FORCE_MAX_ORDER
|
||||
int "Maximum zone order"
|
||||
default "13"
|
||||
int "Order of maximal physically contiguous allocations"
|
||||
default "12"
|
||||
help
|
||||
The kernel memory allocator divides physically contiguous memory
|
||||
blocks into "zones", where each zone is a power of two number of
|
||||
pages. This option selects the largest power of two that the kernel
|
||||
keeps in the memory allocator. If you need to allocate very large
|
||||
blocks of physically contiguous memory, then you may need to
|
||||
increase this value.
|
||||
The kernel page allocator limits the size of maximal physically
|
||||
contiguous allocations. The limit is called MAX_ORDER and it
|
||||
defines the maximal power of two of number of pages that can be
|
||||
allocated as a single contiguous block. This option allows
|
||||
overriding the default setting when ability to allocate very
|
||||
large blocks of physically contiguous memory is required.
|
||||
|
||||
This config option is actually maximum order plus one. For example,
|
||||
a value of 13 means that the largest free memory block is 2^12 pages.
|
||||
Don't change if unsure.
|
||||
|
||||
if SPARC64 || COMPILE_TEST
|
||||
source "kernel/power/Kconfig"
|
||||
|
|
|
@ -357,6 +357,42 @@ static inline pgprot_t pgprot_noncached(pgprot_t prot)
|
|||
*/
|
||||
#define pgprot_noncached pgprot_noncached
|
||||
|
||||
static inline unsigned long pte_dirty(pte_t pte)
|
||||
{
|
||||
unsigned long mask;
|
||||
|
||||
__asm__ __volatile__(
|
||||
"\n661: mov %1, %0\n"
|
||||
" nop\n"
|
||||
" .section .sun4v_2insn_patch, \"ax\"\n"
|
||||
" .word 661b\n"
|
||||
" sethi %%uhi(%2), %0\n"
|
||||
" sllx %0, 32, %0\n"
|
||||
" .previous\n"
|
||||
: "=r" (mask)
|
||||
: "i" (_PAGE_MODIFIED_4U), "i" (_PAGE_MODIFIED_4V));
|
||||
|
||||
return (pte_val(pte) & mask);
|
||||
}
|
||||
|
||||
static inline unsigned long pte_write(pte_t pte)
|
||||
{
|
||||
unsigned long mask;
|
||||
|
||||
__asm__ __volatile__(
|
||||
"\n661: mov %1, %0\n"
|
||||
" nop\n"
|
||||
" .section .sun4v_2insn_patch, \"ax\"\n"
|
||||
" .word 661b\n"
|
||||
" sethi %%uhi(%2), %0\n"
|
||||
" sllx %0, 32, %0\n"
|
||||
" .previous\n"
|
||||
: "=r" (mask)
|
||||
: "i" (_PAGE_WRITE_4U), "i" (_PAGE_WRITE_4V));
|
||||
|
||||
return (pte_val(pte) & mask);
|
||||
}
|
||||
|
||||
#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
|
||||
pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags);
|
||||
#define arch_make_huge_pte arch_make_huge_pte
|
||||
|
@ -418,28 +454,43 @@ static inline bool is_hugetlb_pte(pte_t pte)
|
|||
}
|
||||
#endif
|
||||
|
||||
static inline pte_t __pte_mkhwwrite(pte_t pte)
|
||||
{
|
||||
unsigned long val = pte_val(pte);
|
||||
|
||||
/*
|
||||
* Note: we only want to set the HW writable bit if the SW writable bit
|
||||
* and the SW dirty bit are set.
|
||||
*/
|
||||
__asm__ __volatile__(
|
||||
"\n661: or %0, %2, %0\n"
|
||||
" .section .sun4v_1insn_patch, \"ax\"\n"
|
||||
" .word 661b\n"
|
||||
" or %0, %3, %0\n"
|
||||
" .previous\n"
|
||||
: "=r" (val)
|
||||
: "0" (val), "i" (_PAGE_W_4U), "i" (_PAGE_W_4V));
|
||||
|
||||
return __pte(val);
|
||||
}
|
||||
|
||||
static inline pte_t pte_mkdirty(pte_t pte)
|
||||
{
|
||||
unsigned long val = pte_val(pte), tmp;
|
||||
unsigned long val = pte_val(pte), mask;
|
||||
|
||||
__asm__ __volatile__(
|
||||
"\n661: or %0, %3, %0\n"
|
||||
" nop\n"
|
||||
"\n662: nop\n"
|
||||
"\n661: mov %1, %0\n"
|
||||
" nop\n"
|
||||
" .section .sun4v_2insn_patch, \"ax\"\n"
|
||||
" .word 661b\n"
|
||||
" sethi %%uhi(%4), %1\n"
|
||||
" sllx %1, 32, %1\n"
|
||||
" .word 662b\n"
|
||||
" or %1, %%lo(%4), %1\n"
|
||||
" or %0, %1, %0\n"
|
||||
" sethi %%uhi(%2), %0\n"
|
||||
" sllx %0, 32, %0\n"
|
||||
" .previous\n"
|
||||
: "=r" (val), "=r" (tmp)
|
||||
: "0" (val), "i" (_PAGE_MODIFIED_4U | _PAGE_W_4U),
|
||||
"i" (_PAGE_MODIFIED_4V | _PAGE_W_4V));
|
||||
: "=r" (mask)
|
||||
: "i" (_PAGE_MODIFIED_4U), "i" (_PAGE_MODIFIED_4V));
|
||||
|
||||
return __pte(val);
|
||||
pte = __pte(val | mask);
|
||||
return pte_write(pte) ? __pte_mkhwwrite(pte) : pte;
|
||||
}
|
||||
|
||||
static inline pte_t pte_mkclean(pte_t pte)
|
||||
|
@ -481,7 +532,8 @@ static inline pte_t pte_mkwrite(pte_t pte)
|
|||
: "=r" (mask)
|
||||
: "i" (_PAGE_WRITE_4U), "i" (_PAGE_WRITE_4V));
|
||||
|
||||
return __pte(val | mask);
|
||||
pte = __pte(val | mask);
|
||||
return pte_dirty(pte) ? __pte_mkhwwrite(pte) : pte;
|
||||
}
|
||||
|
||||
static inline pte_t pte_wrprotect(pte_t pte)
|
||||
|
@ -584,42 +636,6 @@ static inline unsigned long pte_young(pte_t pte)
|
|||
return (pte_val(pte) & mask);
|
||||
}
|
||||
|
||||
static inline unsigned long pte_dirty(pte_t pte)
|
||||
{
|
||||
unsigned long mask;
|
||||
|
||||
__asm__ __volatile__(
|
||||
"\n661: mov %1, %0\n"
|
||||
" nop\n"
|
||||
" .section .sun4v_2insn_patch, \"ax\"\n"
|
||||
" .word 661b\n"
|
||||
" sethi %%uhi(%2), %0\n"
|
||||
" sllx %0, 32, %0\n"
|
||||
" .previous\n"
|
||||
: "=r" (mask)
|
||||
: "i" (_PAGE_MODIFIED_4U), "i" (_PAGE_MODIFIED_4V));
|
||||
|
||||
return (pte_val(pte) & mask);
|
||||
}
|
||||
|
||||
static inline unsigned long pte_write(pte_t pte)
|
||||
{
|
||||
unsigned long mask;
|
||||
|
||||
__asm__ __volatile__(
|
||||
"\n661: mov %1, %0\n"
|
||||
" nop\n"
|
||||
" .section .sun4v_2insn_patch, \"ax\"\n"
|
||||
" .word 661b\n"
|
||||
" sethi %%uhi(%2), %0\n"
|
||||
" sllx %0, 32, %0\n"
|
||||
" .previous\n"
|
||||
: "=r" (mask)
|
||||
: "i" (_PAGE_WRITE_4U), "i" (_PAGE_WRITE_4V));
|
||||
|
||||
return (pte_val(pte) & mask);
|
||||
}
|
||||
|
||||
static inline unsigned long pte_exec(pte_t pte)
|
||||
{
|
||||
unsigned long mask;
|
||||
|
|
|
@ -193,7 +193,7 @@ static void *dma_4v_alloc_coherent(struct device *dev, size_t size,
|
|||
|
||||
size = IO_PAGE_ALIGN(size);
|
||||
order = get_order(size);
|
||||
if (unlikely(order >= MAX_ORDER))
|
||||
if (unlikely(order > MAX_ORDER))
|
||||
return NULL;
|
||||
|
||||
npages = size >> IO_PAGE_SHIFT;
|
||||
|
|
|
@ -897,7 +897,7 @@ void __init cheetah_ecache_flush_init(void)
|
|||
|
||||
/* Now allocate error trap reporting scoreboard. */
|
||||
sz = NR_CPUS * (2 * sizeof(struct cheetah_err_info));
|
||||
for (order = 0; order < MAX_ORDER; order++) {
|
||||
for (order = 0; order <= MAX_ORDER; order++) {
|
||||
if ((PAGE_SIZE << order) >= sz)
|
||||
break;
|
||||
}
|
||||
|
|
|
@ -402,8 +402,8 @@ void tsb_grow(struct mm_struct *mm, unsigned long tsb_index, unsigned long rss)
|
|||
unsigned long new_rss_limit;
|
||||
gfp_t gfp_flags;
|
||||
|
||||
if (max_tsb_size > (PAGE_SIZE << MAX_ORDER))
|
||||
max_tsb_size = (PAGE_SIZE << MAX_ORDER);
|
||||
if (max_tsb_size > PAGE_SIZE << MAX_ORDER)
|
||||
max_tsb_size = PAGE_SIZE << MAX_ORDER;
|
||||
|
||||
new_cache_index = 0;
|
||||
for (new_size = 8192; new_size < max_tsb_size; new_size <<= 1UL) {
|
||||
|
|
|
@ -27,6 +27,7 @@ config X86_64
|
|||
# Options that are inherently 64-bit kernel only:
|
||||
select ARCH_HAS_GIGANTIC_PAGE
|
||||
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
|
||||
select ARCH_SUPPORTS_PER_VMA_LOCK
|
||||
select ARCH_USE_CMPXCHG_LOCKREF
|
||||
select HAVE_ARCH_SOFT_DIRTY
|
||||
select MODULES_USE_ELF_RELA
|
||||
|
@ -125,8 +126,8 @@ config X86
|
|||
select ARCH_WANTS_NO_INSTR
|
||||
select ARCH_WANT_GENERAL_HUGETLB
|
||||
select ARCH_WANT_HUGE_PMD_SHARE
|
||||
select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP if X86_64
|
||||
select ARCH_WANT_LD_ORPHAN_WARN
|
||||
select ARCH_WANT_OPTIMIZE_VMEMMAP if X86_64
|
||||
select ARCH_WANTS_THP_SWAP if X86_64
|
||||
select ARCH_HAS_PARANOID_L1D_FLUSH
|
||||
select BUILDTIME_TABLE_SORT
|
||||
|
|
|
@ -1097,7 +1097,7 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm,
|
|||
clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
|
||||
}
|
||||
|
||||
#define flush_tlb_fix_spurious_fault(vma, address) do { } while (0)
|
||||
#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
|
||||
|
||||
#define mk_pmd(page, pgprot) pfn_pmd(page_to_pfn(page), (pgprot))
|
||||
|
||||
|
|
|
@ -15,24 +15,18 @@
|
|||
#endif
|
||||
|
||||
#define __HAVE_ARCH_MEMCPY 1
|
||||
#if defined(__SANITIZE_MEMORY__) && defined(__NO_FORTIFY)
|
||||
#undef memcpy
|
||||
#define memcpy __msan_memcpy
|
||||
#else
|
||||
extern void *memcpy(void *to, const void *from, size_t len);
|
||||
#endif
|
||||
extern void *__memcpy(void *to, const void *from, size_t len);
|
||||
|
||||
#define __HAVE_ARCH_MEMSET
|
||||
#if defined(__SANITIZE_MEMORY__) && defined(__NO_FORTIFY)
|
||||
extern void *__msan_memset(void *s, int c, size_t n);
|
||||
#undef memset
|
||||
#define memset __msan_memset
|
||||
#else
|
||||
void *memset(void *s, int c, size_t n);
|
||||
#endif
|
||||
void *__memset(void *s, int c, size_t n);
|
||||
|
||||
/*
|
||||
* KMSAN needs to instrument as much code as possible. Use C versions of
|
||||
* memsetXX() from lib/string.c under KMSAN.
|
||||
*/
|
||||
#if !defined(CONFIG_KMSAN)
|
||||
#define __HAVE_ARCH_MEMSET16
|
||||
static inline void *memset16(uint16_t *s, uint16_t v, size_t n)
|
||||
{
|
||||
|
@ -68,15 +62,10 @@ static inline void *memset64(uint64_t *s, uint64_t v, size_t n)
|
|||
: "memory");
|
||||
return s;
|
||||
}
|
||||
#endif
|
||||
|
||||
#define __HAVE_ARCH_MEMMOVE
|
||||
#if defined(__SANITIZE_MEMORY__) && defined(__NO_FORTIFY)
|
||||
#undef memmove
|
||||
void *__msan_memmove(void *dest, const void *src, size_t len);
|
||||
#define memmove __msan_memmove
|
||||
#else
|
||||
void *memmove(void *dest, const void *src, size_t count);
|
||||
#endif
|
||||
void *__memmove(void *dest, const void *src, size_t count);
|
||||
|
||||
int memcmp(const void *cs, const void *ct, size_t count);
|
||||
|
|
|
@ -19,6 +19,7 @@
|
|||
#include <linux/uaccess.h> /* faulthandler_disabled() */
|
||||
#include <linux/efi.h> /* efi_crash_gracefully_on_page_fault()*/
|
||||
#include <linux/mm_types.h>
|
||||
#include <linux/mm.h> /* find_and_lock_vma() */
|
||||
|
||||
#include <asm/cpufeature.h> /* boot_cpu_has, ... */
|
||||
#include <asm/traps.h> /* dotraplinkage, ... */
|
||||
|
@ -1333,6 +1334,38 @@ void do_user_addr_fault(struct pt_regs *regs,
|
|||
}
|
||||
#endif
|
||||
|
||||
#ifdef CONFIG_PER_VMA_LOCK
|
||||
if (!(flags & FAULT_FLAG_USER))
|
||||
goto lock_mmap;
|
||||
|
||||
vma = lock_vma_under_rcu(mm, address);
|
||||
if (!vma)
|
||||
goto lock_mmap;
|
||||
|
||||
if (unlikely(access_error(error_code, vma))) {
|
||||
vma_end_read(vma);
|
||||
goto lock_mmap;
|
||||
}
|
||||
fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
|
||||
vma_end_read(vma);
|
||||
|
||||
if (!(fault & VM_FAULT_RETRY)) {
|
||||
count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
|
||||
goto done;
|
||||
}
|
||||
count_vm_vma_lock_event(VMA_LOCK_RETRY);
|
||||
|
||||
/* Quick path to respond to signals */
|
||||
if (fault_signal_pending(fault, regs)) {
|
||||
if (!user_mode(regs))
|
||||
kernelmode_fixup_or_oops(regs, error_code, address,
|
||||
SIGBUS, BUS_ADRERR,
|
||||
ARCH_DEFAULT_PKEY);
|
||||
return;
|
||||
}
|
||||
lock_mmap:
|
||||
#endif /* CONFIG_PER_VMA_LOCK */
|
||||
|
||||
/*
|
||||
* Kernel-mode access to the user address space should only occur
|
||||
* on well-defined single instructions listed in the exception
|
||||
|
@ -1433,6 +1466,9 @@ good_area:
|
|||
}
|
||||
|
||||
mmap_read_unlock(mm);
|
||||
#ifdef CONFIG_PER_VMA_LOCK
|
||||
done:
|
||||
#endif
|
||||
if (likely(!(fault & VM_FAULT_ERROR)))
|
||||
return;
|
||||
|
||||
|
|
|
@ -1073,11 +1073,15 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
|
|||
}
|
||||
|
||||
/*
|
||||
* untrack_pfn_moved is called, while mremapping a pfnmap for a new region,
|
||||
* with the old vma after its pfnmap page table has been removed. The new
|
||||
* vma has a new pfnmap to the same pfn & cache type with VM_PAT set.
|
||||
* untrack_pfn_clear is called if the following situation fits:
|
||||
*
|
||||
* 1) while mremapping a pfnmap for a new region, with the old vma after
|
||||
* its pfnmap page table has been removed. The new vma has a new pfnmap
|
||||
* to the same pfn & cache type with VM_PAT set.
|
||||
* 2) while duplicating vm area, the new vma fails to copy the pgtable from
|
||||
* old vma.
|
||||
*/
|
||||
void untrack_pfn_moved(struct vm_area_struct *vma)
|
||||
void untrack_pfn_clear(struct vm_area_struct *vma)
|
||||
{
|
||||
vm_flags_clear(vma, VM_PAT);
|
||||
}
|
||||
|
|
|
@ -772,18 +772,17 @@ config HIGHMEM
|
|||
If unsure, say Y.
|
||||
|
||||
config ARCH_FORCE_MAX_ORDER
|
||||
int "Maximum zone order"
|
||||
default "11"
|
||||
int "Order of maximal physically contiguous allocations"
|
||||
default "10"
|
||||
help
|
||||
The kernel memory allocator divides physically contiguous memory
|
||||
blocks into "zones", where each zone is a power of two number of
|
||||
pages. This option selects the largest power of two that the kernel
|
||||
keeps in the memory allocator. If you need to allocate very large
|
||||
blocks of physically contiguous memory, then you may need to
|
||||
increase this value.
|
||||
The kernel page allocator limits the size of maximal physically
|
||||
contiguous allocations. The limit is called MAX_ORDER and it
|
||||
defines the maximal power of two of number of pages that can be
|
||||
allocated as a single contiguous block. This option allows
|
||||
overriding the default setting when ability to allocate very
|
||||
large blocks of physically contiguous memory is required.
|
||||
|
||||
This config option is actually maximum order plus one. For example,
|
||||
a value of 11 means that the largest free memory block is 2^10 pages.
|
||||
Don't change if unsure.
|
||||
|
||||
endmenu
|
||||
|
||||
|
|
|
@ -226,8 +226,8 @@ static ssize_t regmap_read_debugfs(struct regmap *map, unsigned int from,
|
|||
if (*ppos < 0 || !count)
|
||||
return -EINVAL;
|
||||
|
||||
if (count > (PAGE_SIZE << (MAX_ORDER - 1)))
|
||||
count = PAGE_SIZE << (MAX_ORDER - 1);
|
||||
if (count > (PAGE_SIZE << MAX_ORDER))
|
||||
count = PAGE_SIZE << MAX_ORDER;
|
||||
|
||||
buf = kmalloc(count, GFP_KERNEL);
|
||||
if (!buf)
|
||||
|
@ -373,8 +373,8 @@ static ssize_t regmap_reg_ranges_read_file(struct file *file,
|
|||
if (*ppos < 0 || !count)
|
||||
return -EINVAL;
|
||||
|
||||
if (count > (PAGE_SIZE << (MAX_ORDER - 1)))
|
||||
count = PAGE_SIZE << (MAX_ORDER - 1);
|
||||
if (count > (PAGE_SIZE << MAX_ORDER))
|
||||
count = PAGE_SIZE << MAX_ORDER;
|
||||
|
||||
buf = kmalloc(count, GFP_KERNEL);
|
||||
if (!buf)
|
||||
|
|
|
@ -3108,7 +3108,7 @@ loop:
|
|||
ptr->resultcode = 0;
|
||||
|
||||
if (ptr->flags & (FD_RAW_READ | FD_RAW_WRITE)) {
|
||||
if (ptr->length <= 0 || ptr->length >= MAX_LEN)
|
||||
if (ptr->length <= 0 || ptr->length > MAX_LEN)
|
||||
return -EINVAL;
|
||||
ptr->kernel_data = (char *)fd_dma_mem_alloc(ptr->length);
|
||||
fallback_on_nodma_alloc(&ptr->kernel_data, ptr->length);
|
||||
|
|
|
@ -54,9 +54,8 @@ static size_t huge_class_size;
|
|||
static const struct block_device_operations zram_devops;
|
||||
|
||||
static void zram_free_page(struct zram *zram, size_t index);
|
||||
static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
|
||||
u32 index, int offset, struct bio *bio);
|
||||
|
||||
static int zram_read_page(struct zram *zram, struct page *page, u32 index,
|
||||
struct bio *parent);
|
||||
|
||||
static int zram_slot_trylock(struct zram *zram, u32 index)
|
||||
{
|
||||
|
@ -148,6 +147,7 @@ static inline bool is_partial_io(struct bio_vec *bvec)
|
|||
{
|
||||
return bvec->bv_len != PAGE_SIZE;
|
||||
}
|
||||
#define ZRAM_PARTIAL_IO 1
|
||||
#else
|
||||
static inline bool is_partial_io(struct bio_vec *bvec)
|
||||
{
|
||||
|
@ -174,36 +174,6 @@ static inline u32 zram_get_priority(struct zram *zram, u32 index)
|
|||
return prio & ZRAM_COMP_PRIORITY_MASK;
|
||||
}
|
||||
|
||||
/*
|
||||
* Check if request is within bounds and aligned on zram logical blocks.
|
||||
*/
|
||||
static inline bool valid_io_request(struct zram *zram,
|
||||
sector_t start, unsigned int size)
|
||||
{
|
||||
u64 end, bound;
|
||||
|
||||
/* unaligned request */
|
||||
if (unlikely(start & (ZRAM_SECTOR_PER_LOGICAL_BLOCK - 1)))
|
||||
return false;
|
||||
if (unlikely(size & (ZRAM_LOGICAL_BLOCK_SIZE - 1)))
|
||||
return false;
|
||||
|
||||
end = start + (size >> SECTOR_SHIFT);
|
||||
bound = zram->disksize >> SECTOR_SHIFT;
|
||||
/* out of range */
|
||||
if (unlikely(start >= bound || end > bound || start > end))
|
||||
return false;
|
||||
|
||||
/* I/O request is valid */
|
||||
return true;
|
||||
}
|
||||
|
||||
static void update_position(u32 *index, int *offset, struct bio_vec *bvec)
|
||||
{
|
||||
*index += (*offset + bvec->bv_len) / PAGE_SIZE;
|
||||
*offset = (*offset + bvec->bv_len) % PAGE_SIZE;
|
||||
}
|
||||
|
||||
static inline void update_used_max(struct zram *zram,
|
||||
const unsigned long pages)
|
||||
{
|
||||
|
@ -606,41 +576,16 @@ static void free_block_bdev(struct zram *zram, unsigned long blk_idx)
|
|||
atomic64_dec(&zram->stats.bd_count);
|
||||
}
|
||||
|
||||
static void zram_page_end_io(struct bio *bio)
|
||||
{
|
||||
struct page *page = bio_first_page_all(bio);
|
||||
|
||||
page_endio(page, op_is_write(bio_op(bio)),
|
||||
blk_status_to_errno(bio->bi_status));
|
||||
bio_put(bio);
|
||||
}
|
||||
|
||||
/*
|
||||
* Returns 1 if the submission is successful.
|
||||
*/
|
||||
static int read_from_bdev_async(struct zram *zram, struct bio_vec *bvec,
|
||||
static void read_from_bdev_async(struct zram *zram, struct page *page,
|
||||
unsigned long entry, struct bio *parent)
|
||||
{
|
||||
struct bio *bio;
|
||||
|
||||
bio = bio_alloc(zram->bdev, 1, parent ? parent->bi_opf : REQ_OP_READ,
|
||||
GFP_NOIO);
|
||||
if (!bio)
|
||||
return -ENOMEM;
|
||||
|
||||
bio = bio_alloc(zram->bdev, 1, parent->bi_opf, GFP_NOIO);
|
||||
bio->bi_iter.bi_sector = entry * (PAGE_SIZE >> 9);
|
||||
if (!bio_add_page(bio, bvec->bv_page, bvec->bv_len, bvec->bv_offset)) {
|
||||
bio_put(bio);
|
||||
return -EIO;
|
||||
}
|
||||
|
||||
if (!parent)
|
||||
bio->bi_end_io = zram_page_end_io;
|
||||
else
|
||||
bio_chain(bio, parent);
|
||||
|
||||
__bio_add_page(bio, page, PAGE_SIZE, 0);
|
||||
bio_chain(bio, parent);
|
||||
submit_bio(bio);
|
||||
return 1;
|
||||
}
|
||||
|
||||
#define PAGE_WB_SIG "page_index="
|
||||
|
@ -701,10 +646,6 @@ static ssize_t writeback_store(struct device *dev,
|
|||
}
|
||||
|
||||
for (; nr_pages != 0; index++, nr_pages--) {
|
||||
struct bio_vec bvec;
|
||||
|
||||
bvec_set_page(&bvec, page, PAGE_SIZE, 0);
|
||||
|
||||
spin_lock(&zram->wb_limit_lock);
|
||||
if (zram->wb_limit_enable && !zram->bd_wb_limit) {
|
||||
spin_unlock(&zram->wb_limit_lock);
|
||||
|
@ -748,7 +689,7 @@ static ssize_t writeback_store(struct device *dev,
|
|||
/* Need for hugepage writeback racing */
|
||||
zram_set_flag(zram, index, ZRAM_IDLE);
|
||||
zram_slot_unlock(zram, index);
|
||||
if (zram_bvec_read(zram, &bvec, index, 0, NULL)) {
|
||||
if (zram_read_page(zram, page, index, NULL)) {
|
||||
zram_slot_lock(zram, index);
|
||||
zram_clear_flag(zram, index, ZRAM_UNDER_WB);
|
||||
zram_clear_flag(zram, index, ZRAM_IDLE);
|
||||
|
@ -759,9 +700,8 @@ static ssize_t writeback_store(struct device *dev,
|
|||
bio_init(&bio, zram->bdev, &bio_vec, 1,
|
||||
REQ_OP_WRITE | REQ_SYNC);
|
||||
bio.bi_iter.bi_sector = blk_idx * (PAGE_SIZE >> 9);
|
||||
bio_add_page(&bio, page, PAGE_SIZE, 0);
|
||||
|
||||
bio_add_page(&bio, bvec.bv_page, bvec.bv_len,
|
||||
bvec.bv_offset);
|
||||
/*
|
||||
* XXX: A single page IO would be inefficient for write
|
||||
* but it would be not bad as starter.
|
||||
|
@ -829,19 +769,20 @@ struct zram_work {
|
|||
struct work_struct work;
|
||||
struct zram *zram;
|
||||
unsigned long entry;
|
||||
struct bio *bio;
|
||||
struct bio_vec bvec;
|
||||
struct page *page;
|
||||
int error;
|
||||
};
|
||||
|
||||
#if PAGE_SIZE != 4096
|
||||
static void zram_sync_read(struct work_struct *work)
|
||||
{
|
||||
struct zram_work *zw = container_of(work, struct zram_work, work);
|
||||
struct zram *zram = zw->zram;
|
||||
unsigned long entry = zw->entry;
|
||||
struct bio *bio = zw->bio;
|
||||
struct bio_vec bv;
|
||||
struct bio bio;
|
||||
|
||||
read_from_bdev_async(zram, &zw->bvec, entry, bio);
|
||||
bio_init(&bio, zw->zram->bdev, &bv, 1, REQ_OP_READ);
|
||||
bio.bi_iter.bi_sector = zw->entry * (PAGE_SIZE >> 9);
|
||||
__bio_add_page(&bio, zw->page, PAGE_SIZE, 0);
|
||||
zw->error = submit_bio_wait(&bio);
|
||||
}
|
||||
|
||||
/*
|
||||
|
@ -849,45 +790,39 @@ static void zram_sync_read(struct work_struct *work)
|
|||
* chained IO with parent IO in same context, it's a deadlock. To avoid that,
|
||||
* use a worker thread context.
|
||||
*/
|
||||
static int read_from_bdev_sync(struct zram *zram, struct bio_vec *bvec,
|
||||
unsigned long entry, struct bio *bio)
|
||||
static int read_from_bdev_sync(struct zram *zram, struct page *page,
|
||||
unsigned long entry)
|
||||
{
|
||||
struct zram_work work;
|
||||
|
||||
work.bvec = *bvec;
|
||||
work.page = page;
|
||||
work.zram = zram;
|
||||
work.entry = entry;
|
||||
work.bio = bio;
|
||||
|
||||
INIT_WORK_ONSTACK(&work.work, zram_sync_read);
|
||||
queue_work(system_unbound_wq, &work.work);
|
||||
flush_work(&work.work);
|
||||
destroy_work_on_stack(&work.work);
|
||||
|
||||
return 1;
|
||||
return work.error;
|
||||
}
|
||||
#else
|
||||
static int read_from_bdev_sync(struct zram *zram, struct bio_vec *bvec,
|
||||
unsigned long entry, struct bio *bio)
|
||||
{
|
||||
WARN_ON(1);
|
||||
return -EIO;
|
||||
}
|
||||
#endif
|
||||
|
||||
static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
|
||||
unsigned long entry, struct bio *parent, bool sync)
|
||||
static int read_from_bdev(struct zram *zram, struct page *page,
|
||||
unsigned long entry, struct bio *parent)
|
||||
{
|
||||
atomic64_inc(&zram->stats.bd_reads);
|
||||
if (sync)
|
||||
return read_from_bdev_sync(zram, bvec, entry, parent);
|
||||
else
|
||||
return read_from_bdev_async(zram, bvec, entry, parent);
|
||||
if (!parent) {
|
||||
if (WARN_ON_ONCE(!IS_ENABLED(ZRAM_PARTIAL_IO)))
|
||||
return -EIO;
|
||||
return read_from_bdev_sync(zram, page, entry);
|
||||
}
|
||||
read_from_bdev_async(zram, page, entry, parent);
|
||||
return 0;
|
||||
}
|
||||
#else
|
||||
static inline void reset_bdev(struct zram *zram) {};
|
||||
static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
|
||||
unsigned long entry, struct bio *parent, bool sync)
|
||||
static int read_from_bdev(struct zram *zram, struct page *page,
|
||||
unsigned long entry, struct bio *parent)
|
||||
{
|
||||
return -EIO;
|
||||
}
|
||||
|
@ -1190,10 +1125,9 @@ static ssize_t io_stat_show(struct device *dev,
|
|||
|
||||
down_read(&zram->init_lock);
|
||||
ret = scnprintf(buf, PAGE_SIZE,
|
||||
"%8llu %8llu %8llu %8llu\n",
|
||||
"%8llu %8llu 0 %8llu\n",
|
||||
(u64)atomic64_read(&zram->stats.failed_reads),
|
||||
(u64)atomic64_read(&zram->stats.failed_writes),
|
||||
(u64)atomic64_read(&zram->stats.invalid_io),
|
||||
(u64)atomic64_read(&zram->stats.notify_free));
|
||||
up_read(&zram->init_lock);
|
||||
|
||||
|
@ -1371,20 +1305,6 @@ out:
|
|||
~(1UL << ZRAM_LOCK | 1UL << ZRAM_UNDER_WB));
|
||||
}
|
||||
|
||||
/*
|
||||
* Reads a page from the writeback devices. Corresponding ZRAM slot
|
||||
* should be unlocked.
|
||||
*/
|
||||
static int zram_bvec_read_from_bdev(struct zram *zram, struct page *page,
|
||||
u32 index, struct bio *bio, bool partial_io)
|
||||
{
|
||||
struct bio_vec bvec;
|
||||
|
||||
bvec_set_page(&bvec, page, PAGE_SIZE, 0);
|
||||
return read_from_bdev(zram, &bvec, zram_get_element(zram, index), bio,
|
||||
partial_io);
|
||||
}
|
||||
|
||||
/*
|
||||
* Reads (decompresses if needed) a page from zspool (zsmalloc).
|
||||
* Corresponding ZRAM slot should be locked.
|
||||
|
@ -1434,8 +1354,8 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,
|
|||
return ret;
|
||||
}
|
||||
|
||||
static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
|
||||
struct bio *bio, bool partial_io)
|
||||
static int zram_read_page(struct zram *zram, struct page *page, u32 index,
|
||||
struct bio *parent)
|
||||
{
|
||||
int ret;
|
||||
|
||||
|
@ -1445,11 +1365,14 @@ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
|
|||
ret = zram_read_from_zspool(zram, page, index);
|
||||
zram_slot_unlock(zram, index);
|
||||
} else {
|
||||
/* Slot should be unlocked before the function call */
|
||||
/*
|
||||
* The slot should be unlocked before reading from the backing
|
||||
* device.
|
||||
*/
|
||||
zram_slot_unlock(zram, index);
|
||||
|
||||
ret = zram_bvec_read_from_bdev(zram, page, index, bio,
|
||||
partial_io);
|
||||
ret = read_from_bdev(zram, page, zram_get_element(zram, index),
|
||||
parent);
|
||||
}
|
||||
|
||||
/* Should NEVER happen. Return bio error if it does. */
|
||||
|
@ -1459,39 +1382,34 @@ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
|
|||
return ret;
|
||||
}
|
||||
|
||||
static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
|
||||
u32 index, int offset, struct bio *bio)
|
||||
/*
|
||||
* Use a temporary buffer to decompress the page, as the decompressor
|
||||
* always expects a full page for the output.
|
||||
*/
|
||||
static int zram_bvec_read_partial(struct zram *zram, struct bio_vec *bvec,
|
||||
u32 index, int offset)
|
||||
{
|
||||
struct page *page = alloc_page(GFP_NOIO);
|
||||
int ret;
|
||||
struct page *page;
|
||||
|
||||
page = bvec->bv_page;
|
||||
if (is_partial_io(bvec)) {
|
||||
/* Use a temporary buffer to decompress the page */
|
||||
page = alloc_page(GFP_NOIO|__GFP_HIGHMEM);
|
||||
if (!page)
|
||||
return -ENOMEM;
|
||||
}
|
||||
|
||||
ret = __zram_bvec_read(zram, page, index, bio, is_partial_io(bvec));
|
||||
if (unlikely(ret))
|
||||
goto out;
|
||||
|
||||
if (is_partial_io(bvec)) {
|
||||
void *src = kmap_atomic(page);
|
||||
|
||||
memcpy_to_bvec(bvec, src + offset);
|
||||
kunmap_atomic(src);
|
||||
}
|
||||
out:
|
||||
if (is_partial_io(bvec))
|
||||
__free_page(page);
|
||||
|
||||
if (!page)
|
||||
return -ENOMEM;
|
||||
ret = zram_read_page(zram, page, index, NULL);
|
||||
if (likely(!ret))
|
||||
memcpy_to_bvec(bvec, page_address(page) + offset);
|
||||
__free_page(page);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
|
||||
u32 index, struct bio *bio)
|
||||
static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
|
||||
u32 index, int offset, struct bio *bio)
|
||||
{
|
||||
if (is_partial_io(bvec))
|
||||
return zram_bvec_read_partial(zram, bvec, index, offset);
|
||||
return zram_read_page(zram, bvec->bv_page, index, bio);
|
||||
}
|
||||
|
||||
static int zram_write_page(struct zram *zram, struct page *page, u32 index)
|
||||
{
|
||||
int ret = 0;
|
||||
unsigned long alloced_pages;
|
||||
|
@ -1499,7 +1417,6 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
|
|||
unsigned int comp_len = 0;
|
||||
void *src, *dst, *mem;
|
||||
struct zcomp_strm *zstrm;
|
||||
struct page *page = bvec->bv_page;
|
||||
unsigned long element = 0;
|
||||
enum zram_pageflags flags = 0;
|
||||
|
||||
|
@ -1617,42 +1534,35 @@ out:
|
|||
return ret;
|
||||
}
|
||||
|
||||
static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
|
||||
u32 index, int offset, struct bio *bio)
|
||||
/*
|
||||
* This is a partial IO. Read the full page before writing the changes.
|
||||
*/
|
||||
static int zram_bvec_write_partial(struct zram *zram, struct bio_vec *bvec,
|
||||
u32 index, int offset, struct bio *bio)
|
||||
{
|
||||
struct page *page = alloc_page(GFP_NOIO);
|
||||
int ret;
|
||||
struct page *page = NULL;
|
||||
struct bio_vec vec;
|
||||
|
||||
vec = *bvec;
|
||||
if (is_partial_io(bvec)) {
|
||||
void *dst;
|
||||
/*
|
||||
* This is a partial IO. We need to read the full page
|
||||
* before to write the changes.
|
||||
*/
|
||||
page = alloc_page(GFP_NOIO|__GFP_HIGHMEM);
|
||||
if (!page)
|
||||
return -ENOMEM;
|
||||
if (!page)
|
||||
return -ENOMEM;
|
||||
|
||||
ret = __zram_bvec_read(zram, page, index, bio, true);
|
||||
if (ret)
|
||||
goto out;
|
||||
|
||||
dst = kmap_atomic(page);
|
||||
memcpy_from_bvec(dst + offset, bvec);
|
||||
kunmap_atomic(dst);
|
||||
|
||||
bvec_set_page(&vec, page, PAGE_SIZE, 0);
|
||||
ret = zram_read_page(zram, page, index, bio);
|
||||
if (!ret) {
|
||||
memcpy_from_bvec(page_address(page) + offset, bvec);
|
||||
ret = zram_write_page(zram, page, index);
|
||||
}
|
||||
|
||||
ret = __zram_bvec_write(zram, &vec, index, bio);
|
||||
out:
|
||||
if (is_partial_io(bvec))
|
||||
__free_page(page);
|
||||
__free_page(page);
|
||||
return ret;
|
||||
}
|
||||
|
||||
static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
|
||||
u32 index, int offset, struct bio *bio)
|
||||
{
|
||||
if (is_partial_io(bvec))
|
||||
return zram_bvec_write_partial(zram, bvec, index, offset, bio);
|
||||
return zram_write_page(zram, bvec->bv_page, index);
|
||||
}
|
||||
|
||||
#ifdef CONFIG_ZRAM_MULTI_COMP
|
||||
/*
|
||||
* This function will decompress (unless it's ZRAM_HUGE) the page and then
|
||||
|
@ -1761,7 +1671,7 @@ static int zram_recompress(struct zram *zram, u32 index, struct page *page,
|
|||
|
||||
/*
|
||||
* No direct reclaim (slow path) for handle allocation and no
|
||||
* re-compression attempt (unlike in __zram_bvec_write()) since
|
||||
* re-compression attempt (unlike in zram_write_bvec()) since
|
||||
* we already have stored that object in zsmalloc. If we cannot
|
||||
* alloc memory for recompressed object then we bail out and
|
||||
* simply keep the old (existing) object in zsmalloc.
|
||||
|
@ -1921,15 +1831,12 @@ release_init_lock:
|
|||
}
|
||||
#endif
|
||||
|
||||
/*
|
||||
* zram_bio_discard - handler on discard request
|
||||
* @index: physical block index in PAGE_SIZE units
|
||||
* @offset: byte offset within physical block
|
||||
*/
|
||||
static void zram_bio_discard(struct zram *zram, u32 index,
|
||||
int offset, struct bio *bio)
|
||||
static void zram_bio_discard(struct zram *zram, struct bio *bio)
|
||||
{
|
||||
size_t n = bio->bi_iter.bi_size;
|
||||
u32 index = bio->bi_iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
|
||||
u32 offset = (bio->bi_iter.bi_sector & (SECTORS_PER_PAGE - 1)) <<
|
||||
SECTOR_SHIFT;
|
||||
|
||||
/*
|
||||
* zram manages data in physical block size units. Because logical block
|
||||
|
@ -1957,80 +1864,58 @@ static void zram_bio_discard(struct zram *zram, u32 index,
|
|||
index++;
|
||||
n -= PAGE_SIZE;
|
||||
}
|
||||
|
||||
bio_endio(bio);
|
||||
}
|
||||
|
||||
/*
|
||||
* Returns errno if it has some problem. Otherwise return 0 or 1.
|
||||
* Returns 0 if IO request was done synchronously
|
||||
* Returns 1 if IO request was successfully submitted.
|
||||
*/
|
||||
static int zram_bvec_rw(struct zram *zram, struct bio_vec *bvec, u32 index,
|
||||
int offset, enum req_op op, struct bio *bio)
|
||||
static void zram_bio_read(struct zram *zram, struct bio *bio)
|
||||
{
|
||||
int ret;
|
||||
|
||||
if (!op_is_write(op)) {
|
||||
ret = zram_bvec_read(zram, bvec, index, offset, bio);
|
||||
flush_dcache_page(bvec->bv_page);
|
||||
} else {
|
||||
ret = zram_bvec_write(zram, bvec, index, offset, bio);
|
||||
}
|
||||
|
||||
zram_slot_lock(zram, index);
|
||||
zram_accessed(zram, index);
|
||||
zram_slot_unlock(zram, index);
|
||||
|
||||
if (unlikely(ret < 0)) {
|
||||
if (!op_is_write(op))
|
||||
atomic64_inc(&zram->stats.failed_reads);
|
||||
else
|
||||
atomic64_inc(&zram->stats.failed_writes);
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
static void __zram_make_request(struct zram *zram, struct bio *bio)
|
||||
{
|
||||
int offset;
|
||||
u32 index;
|
||||
struct bio_vec bvec;
|
||||
struct bvec_iter iter;
|
||||
struct bio_vec bv;
|
||||
unsigned long start_time;
|
||||
|
||||
index = bio->bi_iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
|
||||
offset = (bio->bi_iter.bi_sector &
|
||||
(SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT;
|
||||
start_time = bio_start_io_acct(bio);
|
||||
bio_for_each_segment(bv, bio, iter) {
|
||||
u32 index = iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
|
||||
u32 offset = (iter.bi_sector & (SECTORS_PER_PAGE - 1)) <<
|
||||
SECTOR_SHIFT;
|
||||
|
||||
switch (bio_op(bio)) {
|
||||
case REQ_OP_DISCARD:
|
||||
case REQ_OP_WRITE_ZEROES:
|
||||
zram_bio_discard(zram, index, offset, bio);
|
||||
bio_endio(bio);
|
||||
return;
|
||||
default:
|
||||
break;
|
||||
if (zram_bvec_read(zram, &bv, index, offset, bio) < 0) {
|
||||
atomic64_inc(&zram->stats.failed_reads);
|
||||
bio->bi_status = BLK_STS_IOERR;
|
||||
break;
|
||||
}
|
||||
flush_dcache_page(bv.bv_page);
|
||||
|
||||
zram_slot_lock(zram, index);
|
||||
zram_accessed(zram, index);
|
||||
zram_slot_unlock(zram, index);
|
||||
}
|
||||
bio_end_io_acct(bio, start_time);
|
||||
bio_endio(bio);
|
||||
}
|
||||
|
||||
static void zram_bio_write(struct zram *zram, struct bio *bio)
|
||||
{
|
||||
struct bvec_iter iter;
|
||||
struct bio_vec bv;
|
||||
unsigned long start_time;
|
||||
|
||||
start_time = bio_start_io_acct(bio);
|
||||
bio_for_each_segment(bvec, bio, iter) {
|
||||
struct bio_vec bv = bvec;
|
||||
unsigned int unwritten = bvec.bv_len;
|
||||
bio_for_each_segment(bv, bio, iter) {
|
||||
u32 index = iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
|
||||
u32 offset = (iter.bi_sector & (SECTORS_PER_PAGE - 1)) <<
|
||||
SECTOR_SHIFT;
|
||||
|
||||
do {
|
||||
bv.bv_len = min_t(unsigned int, PAGE_SIZE - offset,
|
||||
unwritten);
|
||||
if (zram_bvec_rw(zram, &bv, index, offset,
|
||||
bio_op(bio), bio) < 0) {
|
||||
bio->bi_status = BLK_STS_IOERR;
|
||||
break;
|
||||
}
|
||||
if (zram_bvec_write(zram, &bv, index, offset, bio) < 0) {
|
||||
atomic64_inc(&zram->stats.failed_writes);
|
||||
bio->bi_status = BLK_STS_IOERR;
|
||||
break;
|
||||
}
|
||||
|
||||
bv.bv_offset += bv.bv_len;
|
||||
unwritten -= bv.bv_len;
|
||||
|
||||
update_position(&index, &offset, &bv);
|
||||
} while (unwritten);
|
||||
zram_slot_lock(zram, index);
|
||||
zram_accessed(zram, index);
|
||||
zram_slot_unlock(zram, index);
|
||||
}
|
||||
bio_end_io_acct(bio, start_time);
|
||||
bio_endio(bio);
|
||||
|
@ -2043,14 +1928,21 @@ static void zram_submit_bio(struct bio *bio)
|
|||
{
|
||||
struct zram *zram = bio->bi_bdev->bd_disk->private_data;
|
||||
|
||||
if (!valid_io_request(zram, bio->bi_iter.bi_sector,
|
||||
bio->bi_iter.bi_size)) {
|
||||
atomic64_inc(&zram->stats.invalid_io);
|
||||
bio_io_error(bio);
|
||||
return;
|
||||
switch (bio_op(bio)) {
|
||||
case REQ_OP_READ:
|
||||
zram_bio_read(zram, bio);
|
||||
break;
|
||||
case REQ_OP_WRITE:
|
||||
zram_bio_write(zram, bio);
|
||||
break;
|
||||
case REQ_OP_DISCARD:
|
||||
case REQ_OP_WRITE_ZEROES:
|
||||
zram_bio_discard(zram, bio);
|
||||
break;
|
||||
default:
|
||||
WARN_ON_ONCE(1);
|
||||
bio_endio(bio);
|
||||
}
|
||||
|
||||
__zram_make_request(zram, bio);
|
||||
}
|
||||
|
||||
static void zram_slot_free_notify(struct block_device *bdev,
|
||||
|
|
|
@ -78,7 +78,6 @@ struct zram_stats {
|
|||
atomic64_t compr_data_size; /* compressed size of pages stored */
|
||||
atomic64_t failed_reads; /* can happen when memory is too low */
|
||||
atomic64_t failed_writes; /* can happen when memory is too low */
|
||||
atomic64_t invalid_io; /* non-page-aligned I/O requests */
|
||||
atomic64_t notify_free; /* no. of swap slot free notifications */
|
||||
atomic64_t same_pages; /* no. of same element filled pages */
|
||||
atomic64_t huge_pages; /* no. of huge pages */
|
||||
|
|
|
@ -892,7 +892,7 @@ static int sev_ioctl_do_get_id2(struct sev_issue_cmd *argp)
|
|||
/*
|
||||
* The length of the ID shouldn't be assumed by software since
|
||||
* it may change in the future. The allocation size is limited
|
||||
* to 1 << (PAGE_SHIFT + MAX_ORDER - 1) by the page allocator.
|
||||
* to 1 << (PAGE_SHIFT + MAX_ORDER) by the page allocator.
|
||||
* If the allocation fails, simply return ENOMEM rather than
|
||||
* warning in the kernel log.
|
||||
*/
|
||||
|
|
|
@ -70,11 +70,11 @@ struct hisi_acc_sgl_pool *hisi_acc_create_sgl_pool(struct device *dev,
|
|||
HISI_ACC_SGL_ALIGN_SIZE);
|
||||
|
||||
/*
|
||||
* the pool may allocate a block of memory of size PAGE_SIZE * 2^(MAX_ORDER - 1),
|
||||
* the pool may allocate a block of memory of size PAGE_SIZE * 2^MAX_ORDER,
|
||||
* block size may exceed 2^31 on ia64, so the max of block size is 2^31
|
||||
*/
|
||||
block_size = 1 << (PAGE_SHIFT + MAX_ORDER <= 32 ?
|
||||
PAGE_SHIFT + MAX_ORDER - 1 : 31);
|
||||
block_size = 1 << (PAGE_SHIFT + MAX_ORDER < 32 ?
|
||||
PAGE_SHIFT + MAX_ORDER : 31);
|
||||
sgl_num_per_block = block_size / sgl_size;
|
||||
block_num = count / sgl_num_per_block;
|
||||
remain_sgl = count % sgl_num_per_block;
|
||||
|
|
|
@ -41,12 +41,11 @@ struct dma_heap_attachment {
|
|||
bool mapped;
|
||||
};
|
||||
|
||||
#define LOW_ORDER_GFP (GFP_HIGHUSER | __GFP_ZERO | __GFP_COMP)
|
||||
#define MID_ORDER_GFP (LOW_ORDER_GFP | __GFP_NOWARN)
|
||||
#define LOW_ORDER_GFP (GFP_HIGHUSER | __GFP_ZERO)
|
||||
#define HIGH_ORDER_GFP (((GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN \
|
||||
| __GFP_NORETRY) & ~__GFP_RECLAIM) \
|
||||
| __GFP_COMP)
|
||||
static gfp_t order_flags[] = {HIGH_ORDER_GFP, MID_ORDER_GFP, LOW_ORDER_GFP};
|
||||
static gfp_t order_flags[] = {HIGH_ORDER_GFP, HIGH_ORDER_GFP, LOW_ORDER_GFP};
|
||||
/*
|
||||
* The selection of the orders used for allocation (1MB, 64K, 4K) is designed
|
||||
* to match with the sizes often found in IOMMUs. Using order 4 pages instead
|
||||
|
|
|
@ -115,7 +115,7 @@ static int get_huge_pages(struct drm_i915_gem_object *obj)
|
|||
do {
|
||||
struct page *page;
|
||||
|
||||
GEM_BUG_ON(order >= MAX_ORDER);
|
||||
GEM_BUG_ON(order > MAX_ORDER);
|
||||
page = alloc_pages(GFP | __GFP_ZERO, order);
|
||||
if (!page)
|
||||
goto err;
|
||||
|
|
|
@ -261,7 +261,7 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
|
|||
* encryption bits. This is because the exact location of the
|
||||
* data may not be known at mmap() time and may also change
|
||||
* at arbitrary times while the data is mmap'ed.
|
||||
* See vmf_insert_mixed_prot() for a discussion.
|
||||
* See vmf_insert_pfn_prot() for a discussion.
|
||||
*/
|
||||
ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
|
||||
|
||||
|
|
|
@ -65,11 +65,11 @@ module_param(page_pool_size, ulong, 0644);
|
|||
|
||||
static atomic_long_t allocated_pages;
|
||||
|
||||
static struct ttm_pool_type global_write_combined[MAX_ORDER];
|
||||
static struct ttm_pool_type global_uncached[MAX_ORDER];
|
||||
static struct ttm_pool_type global_write_combined[MAX_ORDER + 1];
|
||||
static struct ttm_pool_type global_uncached[MAX_ORDER + 1];
|
||||
|
||||
static struct ttm_pool_type global_dma32_write_combined[MAX_ORDER];
|
||||
static struct ttm_pool_type global_dma32_uncached[MAX_ORDER];
|
||||
static struct ttm_pool_type global_dma32_write_combined[MAX_ORDER + 1];
|
||||
static struct ttm_pool_type global_dma32_uncached[MAX_ORDER + 1];
|
||||
|
||||
static spinlock_t shrinker_lock;
|
||||
static struct list_head shrinker_list;
|
||||
|
@ -444,7 +444,7 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
|
|||
else
|
||||
gfp_flags |= GFP_HIGHUSER;
|
||||
|
||||
for (order = min_t(unsigned int, MAX_ORDER - 1, __fls(num_pages));
|
||||
for (order = min_t(unsigned int, MAX_ORDER, __fls(num_pages));
|
||||
num_pages;
|
||||
order = min_t(unsigned int, order, __fls(num_pages))) {
|
||||
struct ttm_pool_type *pt;
|
||||
|
@ -563,7 +563,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
|
|||
|
||||
if (use_dma_alloc) {
|
||||
for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
|
||||
for (j = 0; j < MAX_ORDER; ++j)
|
||||
for (j = 0; j <= MAX_ORDER; ++j)
|
||||
ttm_pool_type_init(&pool->caching[i].orders[j],
|
||||
pool, i, j);
|
||||
}
|
||||
|
@ -583,7 +583,7 @@ void ttm_pool_fini(struct ttm_pool *pool)
|
|||
|
||||
if (pool->use_dma_alloc) {
|
||||
for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
|
||||
for (j = 0; j < MAX_ORDER; ++j)
|
||||
for (j = 0; j <= MAX_ORDER; ++j)
|
||||
ttm_pool_type_fini(&pool->caching[i].orders[j]);
|
||||
}
|
||||
|
||||
|
@ -637,7 +637,7 @@ static void ttm_pool_debugfs_header(struct seq_file *m)
|
|||
unsigned int i;
|
||||
|
||||
seq_puts(m, "\t ");
|
||||
for (i = 0; i < MAX_ORDER; ++i)
|
||||
for (i = 0; i <= MAX_ORDER; ++i)
|
||||
seq_printf(m, " ---%2u---", i);
|
||||
seq_puts(m, "\n");
|
||||
}
|
||||
|
@ -648,7 +648,7 @@ static void ttm_pool_debugfs_orders(struct ttm_pool_type *pt,
|
|||
{
|
||||
unsigned int i;
|
||||
|
||||
for (i = 0; i < MAX_ORDER; ++i)
|
||||
for (i = 0; i <= MAX_ORDER; ++i)
|
||||
seq_printf(m, " %8u", ttm_pool_type_count(&pt[i]));
|
||||
seq_puts(m, "\n");
|
||||
}
|
||||
|
@ -757,7 +757,7 @@ int ttm_pool_mgr_init(unsigned long num_pages)
|
|||
spin_lock_init(&shrinker_lock);
|
||||
INIT_LIST_HEAD(&shrinker_list);
|
||||
|
||||
for (i = 0; i < MAX_ORDER; ++i) {
|
||||
for (i = 0; i <= MAX_ORDER; ++i) {
|
||||
ttm_pool_type_init(&global_write_combined[i], NULL,
|
||||
ttm_write_combined, i);
|
||||
ttm_pool_type_init(&global_uncached[i], NULL, ttm_uncached, i);
|
||||
|
@ -790,7 +790,7 @@ void ttm_pool_mgr_fini(void)
|
|||
{
|
||||
unsigned int i;
|
||||
|
||||
for (i = 0; i < MAX_ORDER; ++i) {
|
||||
for (i = 0; i <= MAX_ORDER; ++i) {
|
||||
ttm_pool_type_fini(&global_write_combined[i]);
|
||||
ttm_pool_type_fini(&global_uncached[i]);
|
||||
|
||||
|
|
|
@ -182,7 +182,7 @@
|
|||
#ifdef CONFIG_CMA_ALIGNMENT
|
||||
#define Q_MAX_SZ_SHIFT (PAGE_SHIFT + CONFIG_CMA_ALIGNMENT)
|
||||
#else
|
||||
#define Q_MAX_SZ_SHIFT (PAGE_SHIFT + MAX_ORDER - 1)
|
||||
#define Q_MAX_SZ_SHIFT (PAGE_SHIFT + MAX_ORDER)
|
||||
#endif
|
||||
|
||||
/*
|
||||
|
|
|
@ -736,7 +736,7 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
|
|||
struct page **pages;
|
||||
unsigned int i = 0, nid = dev_to_node(dev);
|
||||
|
||||
order_mask &= (2U << MAX_ORDER) - 1;
|
||||
order_mask &= GENMASK(MAX_ORDER, 0);
|
||||
if (!order_mask)
|
||||
return NULL;
|
||||
|
||||
|
@ -756,7 +756,7 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
|
|||
* than a necessity, hence using __GFP_NORETRY until
|
||||
* falling back to minimum-order allocations.
|
||||
*/
|
||||
for (order_mask &= (2U << __fls(count)) - 1;
|
||||
for (order_mask &= GENMASK(__fls(count), 0);
|
||||
order_mask; order_mask &= ~order_size) {
|
||||
unsigned int order = __fls(order_mask);
|
||||
gfp_t alloc_flags = gfp;
|
||||
|
|
|
@ -2445,8 +2445,8 @@ static bool its_parse_indirect_baser(struct its_node *its,
|
|||
* feature is not supported by hardware.
|
||||
*/
|
||||
new_order = max_t(u32, get_order(esz << ids), new_order);
|
||||
if (new_order >= MAX_ORDER) {
|
||||
new_order = MAX_ORDER - 1;
|
||||
if (new_order > MAX_ORDER) {
|
||||
new_order = MAX_ORDER;
|
||||
ids = ilog2(PAGE_ORDER_TO_SIZE(new_order) / (int)esz);
|
||||
pr_warn("ITS@%pa: %s Table too large, reduce ids %llu->%u\n",
|
||||
&its->phys_base, its_base_type_string[type],
|
||||
|
|
|
@ -1134,7 +1134,7 @@ static void __cache_size_refresh(void)
|
|||
* If the allocation may fail we use __get_free_pages. Memory fragmentation
|
||||
* won't have a fatal effect here, but it just causes flushes of some other
|
||||
* buffers and more I/O will be performed. Don't use __get_free_pages if it
|
||||
* always fails (i.e. order >= MAX_ORDER).
|
||||
* always fails (i.e. order > MAX_ORDER).
|
||||
*
|
||||
* If the allocation shouldn't fail we use __vmalloc. This is only for the
|
||||
* initial reserve allocation, so there's no risk of wasting all vmalloc
|
||||
|
|
|
@ -1828,7 +1828,7 @@ int dm_cache_metadata_abort(struct dm_cache_metadata *cmd)
|
|||
* Replacement block manager (new_bm) is created and old_bm destroyed outside of
|
||||
* cmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
|
||||
* shrinker associated with the block manager's bufio client vs cmd root_lock).
|
||||
* - must take shrinker_rwsem without holding cmd->root_lock
|
||||
* - must take shrinker_mutex without holding cmd->root_lock
|
||||
*/
|
||||
new_bm = dm_block_manager_create(cmd->bdev, DM_CACHE_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
|
||||
CACHE_MAX_CONCURRENT_LOCKS);
|
||||
|
|
|
@ -1887,7 +1887,7 @@ int dm_pool_abort_metadata(struct dm_pool_metadata *pmd)
|
|||
* Replacement block manager (new_bm) is created and old_bm destroyed outside of
|
||||
* pmd root_lock to avoid ABBA deadlock that would result (due to life-cycle of
|
||||
* shrinker associated with the block manager's bufio client vs pmd root_lock).
|
||||
* - must take shrinker_rwsem without holding pmd->root_lock
|
||||
* - must take shrinker_mutex without holding pmd->root_lock
|
||||
*/
|
||||
new_bm = dm_block_manager_create(pmd->bdev, THIN_METADATA_BLOCK_SIZE << SECTOR_SHIFT,
|
||||
THIN_MAX_CONCURRENT_LOCKS);
|
||||
|
|
|
@ -210,7 +210,7 @@ u32 genwqe_crc32(u8 *buff, size_t len, u32 init)
|
|||
void *__genwqe_alloc_consistent(struct genwqe_dev *cd, size_t size,
|
||||
dma_addr_t *dma_handle)
|
||||
{
|
||||
if (get_order(size) >= MAX_ORDER)
|
||||
if (get_order(size) > MAX_ORDER)
|
||||
return NULL;
|
||||
|
||||
return dma_alloc_coherent(&cd->pci_dev->dev, size, dma_handle,
|
||||
|
|
|
@ -1040,7 +1040,7 @@ static void hns3_init_tx_spare_buffer(struct hns3_enet_ring *ring)
|
|||
return;
|
||||
|
||||
order = get_order(alloc_size);
|
||||
if (order >= MAX_ORDER) {
|
||||
if (order > MAX_ORDER) {
|
||||
if (net_ratelimit())
|
||||
dev_warn(ring_to_dev(ring), "failed to allocate tx spare buffer, exceed to max order\n");
|
||||
return;
|
||||
|
|
|
@ -75,7 +75,7 @@
|
|||
* pool for the 4MB. Thus the 16 Rx and Tx queues require 32 * 5 = 160
|
||||
* plus 16 for the TSO pools for a total of 176 LTB mappings per VNIC.
|
||||
*/
|
||||
#define IBMVNIC_ONE_LTB_MAX ((u32)((1 << (MAX_ORDER - 1)) * PAGE_SIZE))
|
||||
#define IBMVNIC_ONE_LTB_MAX ((u32)((1 << MAX_ORDER) * PAGE_SIZE))
|
||||
#define IBMVNIC_ONE_LTB_SIZE min((u32)(8 << 20), IBMVNIC_ONE_LTB_MAX)
|
||||
#define IBMVNIC_LTB_SET_SIZE (38 << 20)
|
||||
|
||||
|
|
|
@ -946,7 +946,7 @@ static phys_addr_t hvfb_get_phymem(struct hv_device *hdev,
|
|||
if (request_size == 0)
|
||||
return -1;
|
||||
|
||||
if (order < MAX_ORDER) {
|
||||
if (order <= MAX_ORDER) {
|
||||
/* Call alloc_pages if the size is less than 2^MAX_ORDER */
|
||||
page = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
|
||||
if (!page)
|
||||
|
@ -977,7 +977,7 @@ static void hvfb_release_phymem(struct hv_device *hdev,
|
|||
{
|
||||
unsigned int order = get_order(size);
|
||||
|
||||
if (order < MAX_ORDER)
|
||||
if (order <= MAX_ORDER)
|
||||
__free_pages(pfn_to_page(paddr >> PAGE_SHIFT), order);
|
||||
else
|
||||
dma_free_coherent(&hdev->device,
|
||||
|
|
|
@ -197,7 +197,7 @@ static int vmlfb_alloc_vram(struct vml_info *vinfo,
|
|||
va = &vinfo->vram[i];
|
||||
order = 0;
|
||||
|
||||
while (requested > (PAGE_SIZE << order) && order < MAX_ORDER)
|
||||
while (requested > (PAGE_SIZE << order) && order <= MAX_ORDER)
|
||||
order++;
|
||||
|
||||
err = vmlfb_alloc_vram_area(va, order, 0);
|
||||
|
|
|
@ -33,7 +33,7 @@
|
|||
#define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \
|
||||
__GFP_NOMEMALLOC)
|
||||
/* The order of free page blocks to report to host */
|
||||
#define VIRTIO_BALLOON_HINT_BLOCK_ORDER (MAX_ORDER - 1)
|
||||
#define VIRTIO_BALLOON_HINT_BLOCK_ORDER MAX_ORDER
|
||||
/* The size of a free page block in bytes */
|
||||
#define VIRTIO_BALLOON_HINT_BLOCK_BYTES \
|
||||
(1 << (VIRTIO_BALLOON_HINT_BLOCK_ORDER + PAGE_SHIFT))
|
||||
|
|
|
@ -1120,13 +1120,13 @@ static void virtio_mem_clear_fake_offline(unsigned long pfn,
|
|||
*/
|
||||
static void virtio_mem_fake_online(unsigned long pfn, unsigned long nr_pages)
|
||||
{
|
||||
unsigned long order = MAX_ORDER - 1;
|
||||
unsigned long order = MAX_ORDER;
|
||||
unsigned long i;
|
||||
|
||||
/*
|
||||
* We might get called for ranges that don't cover properly aligned
|
||||
* MAX_ORDER - 1 pages; however, we can only online properly aligned
|
||||
* pages with an order of MAX_ORDER - 1 at maximum.
|
||||
* MAX_ORDER pages; however, we can only online properly aligned
|
||||
* pages with an order of MAX_ORDER at maximum.
|
||||
*/
|
||||
while (!IS_ALIGNED(pfn | nr_pages, 1 << order))
|
||||
order--;
|
||||
|
@ -1237,9 +1237,9 @@ static void virtio_mem_online_page(struct virtio_mem *vm,
|
|||
bool do_online;
|
||||
|
||||
/*
|
||||
* We can get called with any order up to MAX_ORDER - 1. If our
|
||||
* subblock size is smaller than that and we have a mixture of plugged
|
||||
* and unplugged subblocks within such a page, we have to process in
|
||||
* We can get called with any order up to MAX_ORDER. If our subblock
|
||||
* size is smaller than that and we have a mixture of plugged and
|
||||
* unplugged subblocks within such a page, we have to process in
|
||||
* smaller granularity. In that case we'll adjust the order exactly once
|
||||
* within the loop.
|
||||
*/
|
||||
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue