OpenCloudOS-Kernel

History

Zach O'Keefe 34488399fa mm/madvise: add file and shmem support to MADV_COLLAPSE Add support for MADV_COLLAPSE to collapse shmem-backed and file-backed memory into THPs (requires CONFIG_READ_ONLY_THP_FOR_FS=y). On success, the backing memory will be a hugepage. For the memory range and process provided, the page tables will synchronously have a huge pmd installed, mapping the THP. Other mappings of the file extent mapped by the memory range may be added to a set of entries that khugepaged will later process and attempt update their page tables to map the THP by a pmd. This functionality unlocks two important uses: (1) Immediately back executable text by THPs. Current support provided by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which might impair services from serving at their full rated load after (re)starting. Tricks like mremap(2)'ing text onto anonymous memory to immediately realize iTLB performance prevents page sharing and demand paging, both of which increase steady state memory footprint. Now, we can have the best of both worlds: Peak upfront performance and lower RAM footprints. (2) userfaultfd-based live migration of virtual machines satisfy UFFD faults by fetching native-sized pages over the network (to avoid latency of transferring an entire hugepage). However, after guest memory has been fully copied to the new host, MADV_COLLAPSE can be used to immediately increase guest performance. Since khugepaged is single threaded, this change now introduces possibility of collapse contexts racing in file collapse path. There a important few places to consider: (1) hpage_collapse_scan_file(), when we xas_pause() and drop RCU. We could have the memory collapsed out from under us, but the next xas_for_each() iteration will correctly pick up the hugepage. The hugepage might not be up to date (insofar as copying of small page contents might not have completed - the page still may be locked), but regardless what small page index we were iterating over, we'll find the hugepage and identify it as a suitably aligned compound page of order HPAGE_PMD_ORDER. In khugepaged path, we locklessly check the value of the pmd, and only add it to deferred collapse array if we find pmd mapping pte table. This is fine, since other values that could have raced in right afterwards denote failure, or that the memory was successfully collapsed, so we don't need further processing. In madvise path, we'll take mmap_lock() in write to serialize against page table updates and will know what to do based on the true value of the pmd: recheck all ptes if we point to a pte table, directly install the pmd, if the pmd has been cleared, but memory not yet faulted, or nothing at all if we find a huge pmd. It's worth putting emphasis here on how we treat the none pmd here. If khugepaged has processed this mm's page tables already, it will have left the pmd cleared (ready for refault by the process). Depending on the VMA flags and sysfs settings, amount of RAM on the machine, and the current load, could be a relatively common occurrence - and as such is one we'd like to handle successfully in MADV_COLLAPSE. When we see the none pmd in collapse_pte_mapped_thp(), we've locked mmap_lock in write and checked (a) huepaged_vma_check() to see if the backing memory is appropriate still, along with VMA sizing and appropriate hugepage alignment within the file, and (b) we've found a hugepage head of order HPAGE_PMD_ORDER at the offset in the file mapped by our hugepage-aligned virtual address. Even though the common-case is likely race with khugepaged, given these checks (regardless how we got here - we could be operating on a completely different file than originally checked in hpage_collapse_scan_file() for all we know) it should be safe to directly make the pmd a huge pmd pointing to this hugepage. (2) collapse_file() is mostly serialized on the same file extent by lock sequence: \| lock hupepage \| lock mapping->i_pages \| lock 1st page \| unlock mapping->i_pages \| <page checks> \| lock mapping->i_pages \| page_ref_freeze(3) \| xas_store(hugepage) \| unlock mapping->i_pages \| page_ref_unfreeze(1) \| unlock 1st page V unlock hugepage Once a context (who already has their fresh hugepage locked) locks mapping->i_pages exclusively, it will hold said lock until it locks the first page, and it will hold that lock until the after the hugepage has been added to the page cache (and will unlock the hugepage after page table update, though that isn't important here). A racing context that loses the race for mapping->i_pages will then lose the race to locking the first page. Here - depending on how far the other racing context has gotten - we might find the new hugepage (in which case we'll exit cleanly when we check PageTransCompound()), or we'll find the "old" 1st small page (in which we'll exit cleanly when we discover unexpected refcount of 2 after isolate_lru_page()). This is assuming we are able to successfully lock the page we find - in shmem path, we could just fail the trylock and exit cleanly anyways. Failure path in collapse_file() is similar: once we hold lock on 1st small page, we are serialized against other collapse contexts. Before the 1st small page is unlocked, we add it back to the pagecache and unfreeze the refcount appropriately. Contexts who lost the race to the 1st small page will then find the same 1st small page with the correct refcount and will be able to proceed. [zokeefe@google.com: don't check pmd value twice in collapse_pte_mapped_thp()] Link: https://lkml.kernel.org/r/20220927033854.477018-1-zokeefe@google.com [shy828301@gmail.com: Delete hugepage_vma_revalidate_anon(), remove check for multi-add in khugepaged_add_pte_mapped_thp()] Link: https://lore.kernel.org/linux-mm/CAHbLzkrtpM=ic7cYAHcqkubah5VTR8N5=k5RT8MTvv5rN1Y91w@mail.gmail.com/ Link: https://lkml.kernel.org/r/20220907144521.3115321-4-zokeefe@google.com Link: https://lkml.kernel.org/r/20220922224046.1143204-4-zokeefe@google.com Signed-off-by: Zach O'Keefe <zokeefe@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>		2022-10-03 14:03:33 -07:00
..
bpf	bpf: kmsan: initialize BPF registers with zeroes	2022-10-03 14:03:25 -07:00
cgroup	mm: multi-gen LRU: kill switch	2022-09-26 19:46:10 -07:00
configs	xen: branch for v6.0-rc1b	2022-08-14 09:28:54 -07:00
debug	mm: remove vmacache	2022-09-26 19:46:18 -07:00
dma	dma: kmsan: unpoison DMA mappings	2022-10-03 14:03:21 -07:00
entry	entry: kmsan: introduce kmsan_unpoison_entry_regs()	2022-10-03 14:03:25 -07:00
events	mm/madvise: add file and shmem support to MADV_COLLAPSE	2022-10-03 14:03:33 -07:00
futex	drm for 5.19-rc1	2022-05-25 16:18:27 -07:00
gcov	gcov: Remove compiler version check	2021-12-02 17:25:21 +09:00
irq	irqchip/genirq updates for 5.20:	2022-07-28 12:36:35 +02:00
kcsan	kcsan: test: Add a .kunitconfig to run KCSAN tests	2022-07-22 09:22:59 -06:00
livepatch	Livepatching changes for 5.19	2022-06-02 08:55:01 -07:00
locking	kmsan: disable instrumentation of unsupported common kernel code	2022-10-03 14:03:20 -07:00
module	module: kunit: Load .kunit_test_suites section when CONFIG_KUNIT=m	2022-08-15 13:51:07 -06:00
power	Char / Misc driver changes for 6.0-rc1	2022-08-04 11:05:48 -07:00
printk	printk: do not wait for consoles when suspended	2022-07-15 10:52:11 +02:00
rcu	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
sched	sched: use maple tree iterator to walk VMAs	2022-09-26 19:46:22 -07:00
time	time: Correct the prototype of ns_to_kernel_old_timeval and ns_to_timespec64	2022-08-09 20:02:13 +02:00
trace	ftrace: Fix build warning for ops_references_rec() not used	2022-08-22 09:41:12 -04:00
.gitignore	…
Kconfig.freezer	…
Kconfig.hz	…
Kconfig.locks	…
Kconfig.preempt	Revert "signal, x86: Delay calling signals in atomic on RT enabled kernels"	2022-03-31 10:36:55 +02:00
Makefile	kmsan: disable instrumentation of unsupported common kernel code	2022-10-03 14:03:20 -07:00
acct.c	acct: use VMA iterator instead of linked list	2022-09-26 19:46:22 -07:00
async.c	Revert "module, async: async_synchronize_full() on module init iff async is used"	2022-02-03 11:20:34 -08:00
audit.c	audit: make is_audit_feature_set() static	2022-06-13 14:08:57 -04:00
audit.h	audit: log AUDIT_TIME_* records only from rules	2022-02-22 13:51:40 -05:00
audit_fsnotify.c	audit: fix potential double free on error path from fsnotify_add_inode_mark	2022-08-22 18:50:06 -04:00
audit_tree.c	audit: use fsnotify group lock helpers	2022-04-25 14:37:28 +02:00
audit_watch.c	fsnotify: pass flags argument to fsnotify_alloc_group()	2022-04-25 14:37:12 +02:00
auditfilter.c	audit/stable-5.17 PR 20220110	2022-01-11 13:08:21 -08:00
auditsc.c	audit: move audit_return_fixup before the filters	2022-08-25 17:25:08 -04:00
backtracetest.c	…
bounds.c	mm: multi-gen LRU: minimal implementation	2022-09-26 19:46:09 -07:00
capability.c	xfs: don't generate selinux audit messages for capability testing	2022-03-09 10:32:06 -08:00
cfi.c	context_tracking: Take IRQ eqs entrypoints over RCU	2022-07-05 13:32:59 -07:00
compat.c	…
configs.c	…
context_tracking.c	MAINTAINERS: Add Paul as context tracking maintainer	2022-07-05 13:33:00 -07:00
cpu.c	Intel Trust Domain Extensions	2022-05-23 17:51:12 -07:00
cpu_pm.c	context_tracking: Take IRQ eqs entrypoints over RCU	2022-07-05 13:32:59 -07:00
crash_core.c	vmcoreinfo: add kallsyms_num_syms symbol	2022-08-28 14:02:44 -07:00
crash_dump.c	…
cred.c	x86: Mark __invalid_creds() __noreturn	2022-03-15 10:32:44 +01:00
delayacct.c	delayacct: support re-entrance detection of thrashing accounting	2022-09-26 19:46:07 -07:00
dma.c	…
exec_domain.c	…
exit.c	kmsan: handle task creation and exiting	2022-10-03 14:03:20 -07:00
extable.c	context_tracking: Take NMI eqs entrypoints over RCU	2022-07-05 13:32:59 -07:00
fail_function.c	…
fork.c	kmsan: handle task creation and exiting	2022-10-03 14:03:20 -07:00
freezer.c	…
gen_kheaders.sh	kheaders: Have cpio unconditionally replace files	2022-05-08 03:16:59 +09:00
groups.c	security: Add LSM hook to setgroups() syscall	2022-07-15 18:21:49 +00:00
hung_task.c	kernel/hung_task: fix address space of proc_dohung_task_timeout_secs	2022-07-29 18:12:35 -07:00
iomem.c	…
irq_work.c	irq_work: use kasan_record_aux_stack_noalloc() record callstack	2022-04-15 14:49:55 -07:00
jump_label.c	jump_label: make initial NOP patching the special case	2022-06-24 09:48:55 +02:00
kallsyms.c	Updates to various subsystems which I help look after. lib, ocfs2,	2022-08-07 10:03:24 -07:00
kallsyms_internal.h	kallsyms: move declarations to internal header	2022-07-17 17:31:39 -07:00
kcmp.c	…
kcov.c	kcov: kmsan: unpoison area->list in kcov_remote_area_put()	2022-10-03 14:03:23 -07:00
kexec.c	…
kexec_core.c	kexec: drop weak attribute from functions	2022-07-15 12:21:16 -04:00
kexec_elf.c	…
kexec_file.c	Updates to various subsystems which I help look after. lib, ocfs2,	2022-08-07 10:03:24 -07:00
kexec_internal.h	…
kheaders.c	…
kmod.c	…
kprobes.c	kprobes: don't call disarm_kprobe() for disabled kprobes	2022-08-20 15:17:46 -07:00
ksysfs.c	kernel/ksysfs.c: use helper macro __ATTR_RW	2022-03-23 19:00:33 -07:00
kthread.c	kthread: make it clear that kthread_create_on_node() might be terminated by any fatal signal	2022-06-16 19:11:30 -07:00
latencytop.c	latencytop: move sysctl to its own file	2022-04-21 11:40:59 -07:00
module_signature.c	…
notifier.c	notifier: Add blocking/atomic_notifier_chain_register_unique_prio()	2022-05-19 19:30:30 +02:00
nsproxy.c	fs/exec: allow to unshare a time namespace on vfork+exec	2022-06-15 07:58:04 -07:00
padata.c	padata: replace cpumask_weight with cpumask_empty in padata.c	2022-01-31 11:21:46 +11:00
panic.c	linux-kselftest-kunit-5.20-rc1	2022-08-02 19:34:45 -07:00
params.c	kobject: remove kset from struct kset_uevent_ops callbacks	2021-12-28 11:26:18 +01:00
pid.c	pid: add pidfd_get_task() helper	2021-10-14 13:29:18 +02:00
pid_namespace.c	kernel: pid_namespace: use NULL instead of using plain integer as pointer	2022-04-29 14:38:00 -07:00
profile.c	profile: setup_profiling_timer() is moslty not implemented	2022-07-29 18:12:36 -07:00
ptrace.c	ptrace: fix clearing of JOBCTL_TRACED in ptrace_unfreeze_traced()	2022-07-09 11:06:19 -07:00
range.c	…
reboot.c	Merge branch 'rework/kthreads' into for-linus	2022-06-23 19:11:28 +02:00
regset.c	…
relay.c	relay: remove redundant assignment to pointer buf	2022-05-12 20:38:37 -07:00
resource.c	resource: Introduce alloc_free_mem_region()	2022-07-21 17:19:25 -07:00
resource_kunit.c	…
rseq.c	rseq: Kill process when unknown flags are encountered in ABI structures	2022-08-01 15:21:42 +02:00
scftorture.c	scftorture: Fix distribution of short handler delays	2022-04-11 17:07:29 -07:00
scs.c	kasan, vmalloc: only tag normal vmalloc allocations	2022-03-24 19:06:48 -07:00
seccomp.c	seccomp: Add wait_killable semantic to seccomp user notifier	2022-05-03 14:11:58 -07:00
signal.c	signal handling: don't use BUG_ON() for debugging	2022-07-07 09:53:43 -07:00
smp.c	locking/csd_lock: Change csdlock_debug from early_param to __setup	2022-07-19 11:40:00 -07:00
smpboot.c	cpu/hotplug: Allow the CPU in CPU_UP_PREPARE state to be brought up again.	2022-04-12 14:13:01 +02:00
smpboot.h	…
softirq.c	context_tracking: Take IRQ eqs entrypoints over RCU	2022-07-05 13:32:59 -07:00
stackleak.c	stackleak: add on/off stack variants	2022-05-08 01:33:09 -07:00
stacktrace.c	uaccess: remove CONFIG_SET_FS	2022-02-25 09:36:06 +01:00
static_call.c	static_call: Don't make __static_call_return0 static	2022-04-05 09:59:38 +02:00
static_call_inline.c	static_call: Don't make __static_call_return0 static	2022-04-05 09:59:38 +02:00
stop_machine.c	Scheduler changes in this cycle were:	2022-05-24 11:11:13 -07:00
sys.c	arm64/sme: Implement vector length configuration prctl()s	2022-04-22 18:50:54 +01:00
sys_ni.c	kernel/sys_ni: add compat entry for fadvise64_64	2022-08-20 15:17:45 -07:00
sysctl-test.c	…
sysctl.c	memory tiering: rate limit NUMA migration throughput	2022-09-11 20:25:54 -07:00
task_work.c	task_work: allow TWA_SIGNAL without a rescheduling IPI	2022-04-30 08:39:32 -06:00
taskstats.c	kernel: make taskstats available from all net namespaces	2022-04-29 14:38:03 -07:00
torture.c	torture: Wake up kthreads after storing task_struct pointer	2022-02-01 17:24:39 -08:00
tracepoint.c	…
tsacct.c	taskstats: version 12 with thread group and exe info	2022-04-29 14:38:03 -07:00
ucount.c	ucounts: Handle wrapping in is_ucounts_overlimit	2022-02-17 09:11:57 -06:00
uid16.c	…
uid16.h	…
umh.c	kthread: Don't allocate kthread_struct for init and umh	2022-05-06 14:49:44 -05:00
up.c	…
user-return-notifier.c	…
user.c	…
user_namespace.c	ucounts: Fix systemd LimitNPROC with private users regression	2022-02-25 10:40:14 -06:00
usermode_driver.c	blob_to_mnt(): kern_unmount() is needed to undo kern_mount()	2022-05-19 23:25:47 -04:00
utsname.c	…
utsname_sysctl.c	…
watch_queue.c	This was a moderately busy cycle for documentation, but nothing all that	2022-08-02 19:24:24 -07:00
watchdog.c	powerpc updates for 6.0	2022-08-06 16:38:17 -07:00
watchdog_hld.c	Revert "printk: add functions to prefer direct printing"	2022-06-23 18:41:40 +02:00
workqueue.c	drm for 5.20/6.0	2022-08-03 19:52:08 -07:00
workqueue_internal.h	…