OpenCloudOS-Kernel/mm
David Hildenbrand 395f6081ba drivers/base/memory: determine and store zone for single-zone memory blocks
test_pages_in_a_zone() is just another nasty PFN walker that can easily
stumble over ZONE_DEVICE memory ranges falling into the same memory block
as ordinary system RAM: the memmap of parts of these ranges might possibly
be uninitialized.  In fact, we observed (on an older kernel) with UBSAN:

  UBSAN: Undefined behaviour in ./include/linux/mm.h:1133:50
  index 7 is out of range for type 'zone [5]'
  CPU: 121 PID: 35603 Comm: read_all Kdump: loaded Tainted: [...]
  Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.12.2 11/15/2019
  Call Trace:
   dump_stack+0x9a/0xf0
   ubsan_epilogue+0x9/0x7a
   __ubsan_handle_out_of_bounds+0x13a/0x181
   test_pages_in_a_zone+0x3c4/0x500
   show_valid_zones+0x1fa/0x380
   dev_attr_show+0x43/0xb0
   sysfs_kf_seq_show+0x1c5/0x440
   seq_read+0x49d/0x1190
   vfs_read+0xff/0x300
   ksys_read+0xb8/0x170
   do_syscall_64+0xa5/0x4b0
   entry_SYSCALL_64_after_hwframe+0x6a/0xdf
  RIP: 0033:0x7f01f4439b52

We seem to stumble over a memmap that contains a garbage zone id.  While
we could try inserting pfn_to_online_page() calls, it will just make
memory offlining slower, because we use test_pages_in_a_zone() to make
sure we're offlining pages that all belong to the same zone.

Let's just get rid of this PFN walker and determine the single zone of a
memory block -- if any -- for early memory blocks during boot.  For memory
onlining, we know the single zone already.  Let's avoid any additional
memmap scanning and just rely on the zone information available during
boot.

For memory hot(un)plug, we only really care about memory blocks that:
* span a single zone (and, thereby, a single node)
* are completely System RAM (IOW, no holes, no ZONE_DEVICE)
If one of these conditions is not met, we reject memory offlining.
Hotplugged memory blocks (starting out offline), always meet both
conditions.

There are three scenarios to handle:

(1) Memory hot(un)plug

A memory block with zone == NULL cannot be offlined, corresponding to
our previous test_pages_in_a_zone() check.

After successful memory onlining/offlining, we simply set the zone
accordingly.
* Memory onlining: set the zone we just used for onlining
* Memory offlining: set zone = NULL

So a hotplugged memory block starts with zone = NULL. Once memory
onlining is done, we set the proper zone.

(2) Boot memory with !CONFIG_NUMA

We know that there is just a single pgdat, so we simply scan all zones
of that pgdat for an intersection with our memory block PFN range when
adding the memory block. If more than one zone intersects (e.g., DMA and
DMA32 on x86 for the first memory block) we set zone = NULL and
consequently mimic what test_pages_in_a_zone() used to do.

(3) Boot memory with CONFIG_NUMA

At the point in time we create the memory block devices during boot, we
don't know yet which nodes *actually* span a memory block. While we could
scan all zones of all nodes for intersections, overlapping nodes complicate
the situation and scanning all nodes is possibly expensive. But that
problem has already been solved by the code that sets the node of a memory
block and creates the link in the sysfs --
do_register_memory_block_under_node().

So, we hook into the code that sets the node id for a memory block. If
we already have a different node id set for the memory block, we know
that multiple nodes *actually* have PFNs falling into our memory block:
we set zone = NULL and consequently mimic what test_pages_in_a_zone() used
to do. If there is no node id set, we do the same as (2) for the given
node.

Note that the call order in driver_init() is:
-> memory_dev_init(): create memory block devices
-> node_dev_init(): link memory block devices to the node and set the
		    node id

So in summary, we detect if there is a single zone responsible for this
memory block and we consequently store the zone in that case in the
memory block, updating it during memory onlining/offlining.

Link: https://lkml.kernel.org/r/20220210184359.235565-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Rafael Parra <rparrazo@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rafael Parra <rparrazo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:10 -07:00
..
damon mm/damon: hide kernel pointer from tracepoint event 2022-01-15 16:30:33 +02:00
kasan lib/stackdepot: always do filter_irq_stacks() in stack_depot_save() 2022-01-22 08:33:38 +02:00
kfence kfence: make test case compatible with run time set sample interval 2022-02-11 17:55:00 -08:00
Kconfig mm/hugetlb: generalize ARCH_WANT_GENERAL_HUGETLB 2022-03-22 15:57:08 -07:00
Kconfig.debug mm: page table check 2022-01-15 16:30:28 +02:00
Makefile mm: remove cleancache 2022-01-22 08:33:38 +02:00
backing-dev.c remove congestion tracking framework 2022-03-22 15:57:01 -07:00
balloon_compaction.c mm: fix typos in comments 2021-05-07 00:26:35 -07:00
bootmem_info.c bootmem: Use page->index instead of page->freelist 2022-01-06 12:27:03 +01:00
cma.c mm/cma: provide option to opt out from exposing pages on activation failure 2022-03-22 15:57:09 -07:00
cma.h mm/cma: provide option to opt out from exposing pages on activation failure 2022-03-22 15:57:09 -07:00
cma_debug.c mm/cma: change cma mutex to irq safe spinlock 2021-05-05 11:27:21 -07:00
cma_sysfs.c mm: cma: support sysfs 2021-05-05 11:27:24 -07:00
compaction.c mm: compaction: cleanup the compaction trace events 2022-03-22 15:57:09 -07:00
debug.c mm,fs: split dump_mapping() out from dump_page() 2022-01-15 16:30:26 +02:00
debug_page_ref.c
debug_vm_pgtable.c mm/debug_vm_pgtable: remove pte entry from the page table 2022-02-04 09:25:04 -08:00
dmapool.c mm/dmapool.c: revert "make dma pool to use kmalloc_node" 2022-01-15 16:30:28 +02:00
early_ioremap.c mm/early_ioremap.c: remove redundant early_ioremap_shutdown() 2021-09-08 11:50:24 -07:00
fadvise.c remove inode_congested() 2022-03-22 15:57:01 -07:00
failslab.c
filemap.c tmpfs: do not allocate pages on read 2022-03-22 15:57:02 -07:00
folio-compat.c filemap: Add filemap_release_folio() 2022-01-04 13:15:34 -05:00
frontswap.c frontswap: remove support for multiple ops 2022-01-22 08:33:38 +02:00
gup.c mm/gup: remove unused get_user_pages_locked() 2022-03-22 15:57:01 -07:00
gup_test.c selftests/vm: gup_test: test faulting in kernel, and verify pinnable pages 2021-05-05 11:27:26 -07:00
gup_test.h selftests/vm: gup_test: fix test flag 2021-05-05 11:27:26 -07:00
highmem.c Fixes for 5.16 folios: 2021-11-25 10:13:56 -08:00
hmm.c mm/hmm.c: allow VM_MIXEDMAP to work with hmm_range_fault 2022-01-15 16:30:31 +02:00
huge_memory.c memory tiering: skip to scan fast memory 2022-03-22 15:57:09 -07:00
hugetlb.c userfaultfd: provide unmasked address on page-fault 2022-03-22 15:57:08 -07:00
hugetlb_cgroup.c hugetlb: add hugetlb.*.numa_stat file 2022-01-15 16:30:29 +02:00
hugetlb_vmemmap.c mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key 2022-03-22 15:57:08 -07:00
hugetlb_vmemmap.h mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate 2021-06-30 20:47:25 -07:00
hwpoison-inject.c mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler 2022-03-22 15:57:07 -07:00
init-mm.c mm: add setup_initial_init_mm() helper 2021-07-08 11:48:21 -07:00
internal.h mm: handle uninitialized numa nodes gracefully 2022-03-22 15:57:10 -07:00
interval_tree.c mm/interval_tree: add comments to improve code readability 2021-04-30 11:20:38 -07:00
io-mapping.c mm: add a io_mapping_map_user helper 2021-04-30 11:20:39 -07:00
ioremap.c mm: move ioremap_page_range to vmalloc.c 2021-09-08 11:50:24 -07:00
khugepaged.c mm/page_table_check: check entries at pmd levels 2022-02-04 09:25:04 -08:00
kmemleak.c mm/kmemleak: avoid scanning potential huge holes 2022-02-04 09:25:05 -08:00
ksm.c mm/ksm: use helper macro __ATTR_RW 2022-03-22 15:57:09 -07:00
list_lru.c mm/list_lru: optimize memcg_reparent_list_lru_node() 2022-03-22 15:57:08 -07:00
maccess.c ARM: 9115/1: mm/maccess: fix unaligned copy_{from,to}_kernel_nofault 2021-08-20 11:39:25 +01:00
madvise.c mm: madvise: skip unmapped vma holes passed to process_madvise 2022-03-22 15:57:10 -07:00
mapping_dirty_helpers.c mm: move tlb_flush_pending inline helpers to mm_inline.h 2022-01-15 16:30:27 +02:00
memblock.c memblock: use kfree() to release kmalloced memblock regions 2022-02-20 08:45:39 +02:00
memcontrol.c memcg: do not tweak node in alloc_mem_cgroup_per_node_info 2022-03-22 15:57:10 -07:00
memfd.c memfd: fix F_SEAL_WRITE after shmem huge page allocated 2022-03-05 11:08:32 -08:00
memory-failure.c mm/memory-failure.c: make non-LRU movable pages unhandlable 2022-03-22 15:57:07 -07:00
memory.c userfaultfd: provide unmasked address on page-fault 2022-03-22 15:57:08 -07:00
memory_hotplug.c drivers/base/memory: determine and store zone for single-zone memory blocks 2022-03-22 15:57:10 -07:00
mempolicy.c mempolicy: mbind_range() set_policy() after vma_merge() 2022-03-22 15:57:09 -07:00
mempool.c mm: remove spurious blkdev.h includes 2021-10-18 06:17:01 -06:00
memremap.c mm/memremap: avoid calling kasan_remove_zero_shadow() for device private memory 2022-03-22 15:57:01 -07:00
memtest.c
migrate.c NUMA balancing: optimize page placement for memory tiering system 2022-03-22 15:57:09 -07:00
mincore.c inode: make init and permission helpers idmapped mount aware 2021-01-24 14:27:16 +01:00
mlock.c mm/mlock: fix potential imbalanced rlimit ucounts adjustment 2022-03-22 15:57:07 -07:00
mm_init.c include/linux/page-flags-layout.h: cleanups 2021-04-30 11:20:42 -07:00
mmap.c mm/mmap: remove obsolete comment in ksys_mmap_pgoff 2022-03-22 15:57:05 -07:00
mmap_lock.c mm: mmap_lock: fix disabling preemption directly 2021-07-23 17:43:28 -07:00
mmu_gather.c mm: move tlb_flush_pending inline helpers to mm_inline.h 2022-01-15 16:30:27 +02:00
mmu_notifier.c mm/mmu_notifiers: ensure range_end() is paired with range_start() 2021-03-25 09:22:55 -07:00
mmzone.c mm/mmzone.c: use try_cmpxchg() in page_cpupid_xchg_last() 2022-03-22 15:57:05 -07:00
mprotect.c memory tiering: skip to scan fast memory 2022-03-22 15:57:09 -07:00
mremap.c mm/mremap:: use vma_lookup() instead of find_vma() 2022-03-22 15:57:05 -07:00
msync.c mm/msync: exit early when the flags is an MS_ASYNC and start < vm_start 2021-04-30 11:20:37 -07:00
nommu.c Merge branch 'akpm' (patches from Andrew) 2021-11-06 14:08:17 -07:00
oom_kill.c mm/oom_kill: remove unneeded is_memcg_oom check 2022-03-22 15:57:09 -07:00
page-writeback.c mm/writeback: minor clean up for highmem_dirtyable_memory 2022-03-22 15:57:01 -07:00
page_alloc.c mm: make free_area_init_node aware of memory less nodes 2022-03-22 15:57:10 -07:00
page_counter.c mm/page_counter: remove an incorrect call to propagate_protected_usage() 2022-01-15 16:30:27 +02:00
page_ext.c mm: make some vars and functions static or __init 2022-01-15 16:30:31 +02:00
page_idle.c mm/idle_page_tracking: make PG_idle reusable 2021-09-08 11:50:24 -07:00
page_io.c mm: page_io: fix psi memory pressure error on cold swapins 2022-03-22 15:57:09 -07:00
page_isolation.c Revert "mm/page_isolation: unset migratetype directly for non Buddy page" 2022-02-04 09:25:04 -08:00
page_owner.c lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() 2022-01-22 08:33:37 +02:00
page_poison.c mm: page_poison: print page info when corruption is caught 2021-04-30 11:20:36 -07:00
page_reporting.c mm/page_reporting: allow driver to specify reporting order 2021-06-29 10:53:47 -07:00
page_reporting.h mm/page_reporting: export reporting order as module parameter 2021-06-29 10:53:47 -07:00
page_table_check.c mm/page_table_check: check entries at pmd levels 2022-02-04 09:25:04 -08:00
page_vma_mapped.c mm: device exclusive memory access 2021-07-01 11:06:03 -07:00
pagewalk.c mm: pagewalk: fix walk for hugepage tables 2021-06-29 10:53:49 -07:00
percpu-internal.h mm: memcg/percpu: account extra objcg space to memory cgroups 2022-01-15 16:30:31 +02:00
percpu-km.c percpu: flush tlb in pcpu_reclaim_populated() 2021-07-04 18:30:17 +00:00
percpu-stats.c percpu: rework memcg accounting 2021-06-05 20:43:15 +00:00
percpu-vm.c percpu: flush tlb in pcpu_reclaim_populated() 2021-07-04 18:30:17 +00:00
percpu.c bitmap patches for 5.17-rc1 2022-01-23 06:20:44 +02:00
pgalloc-track.h mm: fix typos in comments 2021-05-07 00:26:35 -07:00
pgtable-generic.c mm: move tlb_flush_pending inline helpers to mm_inline.h 2022-01-15 16:30:27 +02:00
process_vm_access.c mm/process_vm_access.c: remove duplicate include 2021-05-05 11:27:27 -07:00
ptdump.c mm: sparsemem: use page table lock to protect kernel pmd operations 2022-03-22 15:57:08 -07:00
readahead.c remove inode_congested() 2022-03-22 15:57:01 -07:00
rmap.c mm/hwpoison: check the subpage, not the head page 2022-03-22 15:57:09 -07:00
rodata_test.c
secretmem.c mm/secretmem: avoid letting secretmem_users drop to zero 2021-10-28 17:18:55 -07:00
shmem.c mm: shmem: fix missing cache flush in shmem_mfill_atomic_pte() 2022-03-22 15:57:04 -07:00
shuffle.c mm: eliminate "expecting prototype" kernel-doc warnings 2021-04-16 16:10:36 -07:00
shuffle.h mm/shuffle: fix section mismatch warning 2021-05-22 15:09:07 -10:00
slab.c mm: introduce kmem_cache_alloc_lru 2022-03-22 15:57:03 -07:00
slab.h mm: introduce kmem_cache_alloc_lru 2022-03-22 15:57:03 -07:00
slab_common.c Merge branch 'akpm' (patches from Andrew) 2022-01-15 20:37:06 +02:00
slob.c mm: introduce kmem_cache_alloc_lru 2022-03-22 15:57:03 -07:00
slub.c mm: introduce kmem_cache_alloc_lru 2022-03-22 15:57:03 -07:00
sparse-vmemmap.c mm: sparsemem: move vmemmap related to HugeTLB to CONFIG_HUGETLB_PAGE_FREE_VMEMMAP 2022-03-22 15:57:08 -07:00
sparse.c mm/sparse: make mminit_validate_memmodel_limits() static 2022-03-22 15:57:05 -07:00
swap.c mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu 2022-03-22 15:57:08 -07:00
swap_cgroup.c
swap_slots.c treewide: Add missing includes masked by cgroup -> bpf dependency 2021-12-03 10:58:13 -08:00
swap_state.c mm: swap: get rid of livelock in swapin readahead 2022-03-17 11:02:13 -07:00
swapfile.c userfaultfd: provide unmasked address on page-fault 2022-03-22 15:57:08 -07:00
truncate.c mm: remove cleancache 2022-01-22 08:33:38 +02:00
usercopy.c mm: Convert check_heap_object() to use struct slab 2022-01-06 12:25:51 +01:00
userfaultfd.c mm: userfaultfd: fix missing cache flush in mcopy_atomic_pte() and __mcopy_atomic() 2022-03-22 15:57:04 -07:00
util.c mm: Consider __GFP_NOWARN flag for oversized kvmalloc() calls 2022-03-04 10:00:37 -08:00
vmacache.c
vmalloc.c mm/vmalloc.c: fix "unused function" warning 2022-03-22 15:57:05 -07:00
vmpressure.c mm/vmpressure: fix data-race with memcg->socket_pressure 2021-11-06 13:30:40 -07:00
vmscan.c NUMA balancing: optimize page placement for memory tiering system 2022-03-22 15:57:09 -07:00
vmstat.c mm/vmstat: add event for ksm swapping in copy 2022-03-22 15:57:09 -07:00
workingset.c mm: workingset: replace IRQ-off check with a lockdep assert. 2022-03-22 15:57:08 -07:00
z3fold.c mm/z3fold: add kerneldoc fields for z3fold_pool 2021-07-01 11:06:03 -07:00
zbud.c mm/zbud: add kerneldoc fields for zbud_pool 2021-07-01 11:06:03 -07:00
zpool.c zpool: remove the list of pools_head 2022-01-15 16:30:31 +02:00
zsmalloc.c zsmalloc: replace get_cpu_var with local_lock 2022-01-22 08:33:37 +02:00
zswap.c frontswap: remove support for multiple ops 2022-01-22 08:33:38 +02:00