OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
David Hildenbrand	c098ce73c2	drm/exynos: remove FOLL_FORCE usage FOLL_FORCE is really only for ptrace access. As we unpin the pinned pages using unpin_user_pages_dirty_lock(true), the assumption is that all these pages are writable. FOLL_FORCE in this case seems to be a legacy leftover. Let's just remove it. Link: https://lkml.kernel.org/r/20221116102659.70287-18-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Inki Dae <inki.dae@samsung.com> Cc: Seung-Woo Kim <sw0312.kim@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: David Airlie <airlied@gmail.com> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:59 -08:00
David Hildenbrand	cb78a634f3	mm/frame-vector: remove FOLL_FORCE usage FOLL_FORCE is really only for ptrace access. According to commit `707947247e` ("media: videobuf2-vmalloc: get_userptr: buffers are always writable"), get_vaddr_frames() currently pins all pages writable as a workaround for issues with read-only buffers. FOLL_FORCE, however, seems to be a legacy leftover as it predates commit `707947247e` ("media: videobuf2-vmalloc: get_userptr: buffers are always writable"). Let's just remove it. Once the read-only buffer issue has been resolved, FOLL_WRITE could again be set depending on the DMA direction. Link: https://lkml.kernel.org/r/20221116102659.70287-17-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Acked-by: Tomasz Figa <tfiga@chromium.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:59 -08:00
David Hildenbrand	70b96f24a4	media: pci/ivtv: remove FOLL_FORCE usage FOLL_FORCE is really only for ptrace access. R/O pinning a page is supposed to fail if the VMA misses proper access permissions (no VM_READ). Let's just remove FOLL_FORCE usage here; there would have to be a pretty good reason to allow arbitrary drivers to R/O pin pages in a PROT_NONE VMA. Most probably, FOLL_FORCE usage is just some legacy leftover. Link: https://lkml.kernel.org/r/20221116102659.70287-16-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Andy Walls <awalls@md.metrocast.net> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:59 -08:00
David Hildenbrand	7d96eb6a91	drm/etnaviv: remove FOLL_FORCE usage GUP now supports reliable R/O long-term pinning in COW mappings, such that we break COW early. MAP_SHARED VMAs only use the shared zeropage so far in one corner case (DAXFS file with holes), which can be ignored because GUP does not support long-term pinning in fsdax (see check_vma_flags()). commit `cd5297b085` ("drm/etnaviv: Use FOLL_FORCE for userptr") documents that FOLL_FORCE \| FOLL_WRITE was really only used for reliable R/O pinning. Consequently, FOLL_FORCE \| FOLL_WRITE \| FOLL_LONGTERM is no longer required for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop using FOLL_FORCE, which is really only for ptrace access. Link: https://lkml.kernel.org/r/20221116102659.70287-15-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Russell King <linux+etnaviv@armlinux.org.uk> Cc: Christian Gmeiner <christian.gmeiner@gmail.com> Cc: David Airlie <airlied@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:59 -08:00
David Hildenbrand	3298de2c66	media: videobuf-dma-sg: remove FOLL_FORCE usage GUP now supports reliable R/O long-term pinning in COW mappings, such that we break COW early. MAP_SHARED VMAs only use the shared zeropage so far in one corner case (DAXFS file with holes), which can be ignored because GUP does not support long-term pinning in fsdax (see check_vma_flags()). Consequently, FOLL_FORCE \| FOLL_WRITE \| FOLL_LONGTERM is no longer required for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop using FOLL_FORCE, which is really only for ptrace access. Link: https://lkml.kernel.org/r/20221116102659.70287-14-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:59 -08:00
David Hildenbrand	129e636fe9	RDMA/siw: remove FOLL_FORCE usage GUP now supports reliable R/O long-term pinning in COW mappings, such that we break COW early. MAP_SHARED VMAs only use the shared zeropage so far in one corner case (DAXFS file with holes), which can be ignored because GUP does not support long-term pinning in fsdax (see check_vma_flags()). Consequently, FOLL_FORCE \| FOLL_WRITE \| FOLL_LONGTERM is no longer required for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop using FOLL_FORCE, which is really only for ptrace access. Link: https://lkml.kernel.org/r/20221116102659.70287-13-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Bernard Metzler <bmt@zurich.ibm.com> Cc: Leon Romanovsky <leon@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:59 -08:00
David Hildenbrand	a9d0284033	RDMA/usnic: remove FOLL_FORCE usage GUP now supports reliable R/O long-term pinning in COW mappings, such that we break COW early. MAP_SHARED VMAs only use the shared zeropage so far in one corner case (DAXFS file with holes), which can be ignored because GUP does not support long-term pinning in fsdax (see check_vma_flags()). Consequently, FOLL_FORCE \| FOLL_WRITE \| FOLL_LONGTERM is no longer required for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop using FOLL_FORCE, which is really only for ptrace access. Link: https://lkml.kernel.org/r/20221116102659.70287-12-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Christian Benvenuti <benve@cisco.com> Cc: Nelson Escobar <neescoba@cisco.com> Cc: Leon Romanovsky <leon@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:58 -08:00
David Hildenbrand	b40656aa7d	RDMA/umem: remove FOLL_FORCE usage GUP now supports reliable R/O long-term pinning in COW mappings, such that we break COW early. MAP_SHARED VMAs only use the shared zeropage so far in one corner case (DAXFS file with holes), which can be ignored because GUP does not support long-term pinning in fsdax (see check_vma_flags()). Consequently, FOLL_FORCE \| FOLL_WRITE \| FOLL_LONGTERM is no longer required for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop using FOLL_FORCE, which is really only for ptrace access. Link: https://lkml.kernel.org/r/20221116102659.70287-11-david@redhat.com Tested-by: Leon Romanovsky <leonro@nvidia.com> [over mlx4 and mlx5] Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Leon Romanovsky <leon@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:58 -08:00
David Hildenbrand	84209e87c6	mm/gup: reliable R/O long-term pinning in COW mappings We already support reliable R/O pinning of anonymous memory. However, assume we end up pinning (R/O long-term) a pagecache page or the shared zeropage inside a writable private ("COW") mapping. The next write access will trigger a write-fault and replace the pinned page by an exclusive anonymous page in the process page tables to break COW: the pinned page no longer corresponds to the page mapped into the process' page table. Now that FAULT_FLAG_UNSHARE can break COW on anything mapped into a COW mapping, let's properly break COW first before R/O long-term pinning something that's not an exclusive anon page inside a COW mapping. FAULT_FLAG_UNSHARE will break COW and map an exclusive anon page instead that can get pinned safely. With this change, we can stop using FOLL_FORCE\|FOLL_WRITE for reliable R/O long-term pinning in COW mappings. With this change, the new R/O long-term pinning tests for non-anonymous memory succeed: # [RUN] R/O longterm GUP pin ... with shared zeropage ok 151 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd ok 152 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with tmpfile ok 153 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with huge zeropage ok 154 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB) ok 155 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB) ok 156 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with shared zeropage ok 157 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd ok 158 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with tmpfile ok 159 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with huge zeropage ok 160 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB) ok 161 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB) ok 162 Longterm R/O pin is reliable Note 1: We don't care about short-term R/O-pinning, because they have snapshot semantics: they are not supposed to observe modifications that happen after pinning. As one example, assume we start direct I/O to read from a page and store page content into a file: modifications to page content after starting direct I/O are not guaranteed to end up in the file. So even if we'd pin the shared zeropage, the end result would be as expected -- getting zeroes stored to the file. Note 2: For shared mappings we'll now always fallback to the slow path to lookup the VMA when R/O long-term pining. While that's the necessary price we have to pay right now, it's actually not that bad in practice: most FOLL_LONGTERM users already specify FOLL_WRITE, for example, along with FOLL_FORCE because they tried dealing with COW mappings correctly ... Note 3: For users that use FOLL_LONGTERM right now without FOLL_WRITE, such as VFIO, we'd now no longer pin the shared zeropage. Instead, we'd populate exclusive anon pages that we can pin. There was a concern that this could affect the memlock limit of existing setups. For example, a VM running with VFIO could run into the memlock limit and fail to run. However, we essentially had the same behavior already in commit `17839856fd` ("gup: document and work around "COW can break either way" issue") which got merged into some enterprise distros, and there were not any such complaints. So most probably, we're fine. Link: https://lkml.kernel.org/r/20221116102659.70287-10-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:58 -08:00
David Hildenbrand	8d6a0ac09a	mm: extend FAULT_FLAG_UNSHARE support to anything in a COW mapping Extend FAULT_FLAG_UNSHARE to break COW on anything mapped into a COW (i.e., private writable) mapping and adjust the documentation accordingly. FAULT_FLAG_UNSHARE will now also break COW when encountering the shared zeropage, a pagecache page, a PFNMAP, ... inside a COW mapping, by properly replacing the mapped page/pfn by a private copy (an exclusive anonymous page). Note that only do_wp_page() needs care: hugetlb_wp() already handles FAULT_FLAG_UNSHARE correctly. wp_huge_pmd()/wp_huge_pud() also handles it correctly, for example, splitting the huge zeropage on FAULT_FLAG_UNSHARE such that we can handle FAULT_FLAG_UNSHARE on the PTE level. This change is a requirement for reliable long-term R/O pinning in COW mappings. Link: https://lkml.kernel.org/r/20221116102659.70287-9-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:58 -08:00
David Hildenbrand	aea06577a9	mm: don't call vm_ops->huge_fault() in wp_huge_pmd()/wp_huge_pud() for private mappings If we already have a PMD/PUD mapped write-protected in a private mapping and we want to break COW either due to FAULT_FLAG_WRITE or FAULT_FLAG_UNSHARE, there is no need to inform the file system just like on the PTE path. Let's just split (->zap) + fallback in that case. This is a preparation for more generic FAULT_FLAG_UNSHARE support in COW mappings. Link: https://lkml.kernel.org/r/20221116102659.70287-8-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:58 -08:00
David Hildenbrand	b9086fde6d	mm: rework handling in do_wp_page() based on private vs. shared mappings We want to extent FAULT_FLAG_UNSHARE support to anything mapped into a COW mapping (pagecache page, zeropage, PFN, ...), not just anonymous pages. Let's prepare for that by handling shared mappings first such that we can handle private mappings last. While at it, use folio-based functions instead of page-based functions where we touch the code either way. Link: https://lkml.kernel.org/r/20221116102659.70287-7-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:57 -08:00
David Hildenbrand	79881fed60	mm: add early FAULT_FLAG_WRITE consistency checks Let's catch abuse of FAULT_FLAG_WRITE early, such that we don't have to care in all other handlers and might get "surprises" if we forget to do so. Write faults without VM_MAYWRITE don't make any sense, and our maybe_mkwrite() logic could have hidden such abuse for now. Write faults without VM_WRITE on something that is not a COW mapping is similarly broken, and e.g., do_wp_page() could end up placing an anonymous page into a shared mapping, which would be bad. This is a preparation for reliable R/O long-term pinning of pages in private mappings, whereby we want to make sure that we will never break COW in a read-only private mapping. Link: https://lkml.kernel.org/r/20221116102659.70287-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:57 -08:00
David Hildenbrand	cdc5021cda	mm: add early FAULT_FLAG_UNSHARE consistency checks For now, FAULT_FLAG_UNSHARE only applies to anonymous pages, which implies a COW mapping. Let's hide FAULT_FLAG_UNSHARE early if we're not dealing with a COW mapping, such that we treat it like a read fault as documented and don't have to worry about the flag throughout all fault handlers. While at it, centralize the check for mutual exclusion of FAULT_FLAG_UNSHARE and FAULT_FLAG_WRITE and just drop the check that either flag is set in the WP handler. Link: https://lkml.kernel.org/r/20221116102659.70287-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:57 -08:00
David Hildenbrand	97713a3abe	selftests/vm: cow: R/O long-term pinning reliability tests for non-anon pages Let's test whether R/O long-term pinning is reliable for non-anonymous memory: when R/O long-term pinning a page, the expectation is that we break COW early before pinning, such that actual write access via the page tables won't break COW later and end up replacing the R/O-pinned page in the page table. Consequently, R/O long-term pinning in private mappings would only target exclusive anonymous pages. For now, all tests fail: # [RUN] R/O longterm GUP pin ... with shared zeropage not ok 151 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd not ok 152 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with tmpfile not ok 153 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with huge zeropage not ok 154 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB) not ok 155 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB) not ok 156 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with shared zeropage not ok 157 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd not ok 158 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with tmpfile not ok 159 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with huge zeropage not ok 160 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB) not ok 161 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB) not ok 162 Longterm R/O pin is reliable Link: https://lkml.kernel.org/r/20221116102659.70287-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:57 -08:00
David Hildenbrand	f8664f3c4a	selftests/vm: cow: basic COW tests for non-anonymous pages Let's add basic tests for COW with non-anonymous pages in private mappings: write access should properly trigger COW and result in the private changes not being visible through other page mappings. Especially, add tests for: * Zeropage * Huge zeropage * Ordinary pagecache pages via memfd and tmpfile() * Hugetlb pages via memfd Fortunately, all tests pass. Link: https://lkml.kernel.org/r/20221116102659.70287-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:57 -08:00
David Hildenbrand	7aca5ca154	selftests/vm: anon_cow: prepare for non-anonymous COW tests Patch series "mm/gup: remove FOLL_FORCE usage from drivers (reliable R/O long-term pinning)". For now, we did not support reliable R/O long-term pinning in COW mappings. That means, if we would trigger R/O long-term pinning in MAP_PRIVATE mapping, we could end up pinning the (R/O-mapped) shared zeropage or a pagecache page. The next write access would trigger a write fault and replace the pinned page by an exclusive anonymous page in the process page table; whatever the process would write to that private page copy would not be visible by the owner of the previous page pin: for example, RDMA could read stale data. The end result is essentially an unexpected and hard-to-debug memory corruption. Some drivers tried working around that limitation by using "FOLL_FORCE\|FOLL_WRITE\|FOLL_LONGTERM" for R/O long-term pinning for now. FOLL_WRITE would trigger a write fault, if required, and break COW before pinning the page. FOLL_FORCE is required because the VMA might lack write permissions, and drivers wanted to make that working as well, just like one would expect (no write access, but still triggering a write access to break COW). However, that is not a practical solution, because (1) Drivers that don't stick to that undocumented and debatable pattern would still run into that issue. For example, VFIO only uses FOLL_LONGTERM for R/O long-term pinning. (2) Using FOLL_WRITE just to work around a COW mapping + page pinning limitation is unintuitive. FOLL_WRITE would, for example, mark the page softdirty or trigger uffd-wp, even though, there actually isn't going to be any write access. (3) The purpose of FOLL_FORCE is debug access, not access without lack of VMA permissions by arbitrarty drivers. So instead, make R/O long-term pinning work as expected, by breaking COW in a COW mapping early, such that we can remove any FOLL_FORCE usage from drivers and make FOLL_FORCE ptrace-specific (renaming it to FOLL_PTRACE). More details in patch #8. This patch (of 19): Originally, the plan was to have a separate tests for testing COW of non-anonymous (e.g., shared zeropage) pages. Turns out, that we'd need a lot of similar functionality and that there isn't a really good reason to separate it. So let's prepare for non-anon tests by renaming to "cow". Link: https://lkml.kernel.org/r/20221116102659.70287-1-david@redhat.com Link: https://lkml.kernel.org/r/20221116102659.70287-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Walls <awalls@md.metrocast.net> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bernard Metzler <bmt@zurich.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Benvenuti <benve@cisco.com> Cc: Christian Gmeiner <christian.gmeiner@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hans Verkuil <hverkuil@xs4all.nl> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Inki Dae <inki.dae@samsung.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Morris <jmorris@namei.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Leon Romanovsky <leonro@nvidia.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <namit@vmware.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Nelson Escobar <neescoba@cisco.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oded Gabbay <ogabbay@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux+etnaviv@armlinux.org.uk> Cc: Serge Hallyn <serge@hallyn.com> Cc: Seung-Woo Kim <sw0312.kim@samsung.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomasz Figa <tfiga@chromium.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:56 -08:00
Lukas Bulwahn	749477244b	mm: Kconfig: make config SECRETMEM visible with EXPERT Commit `6a108a14fa` ("kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT") introduces CONFIG_EXPERT to carry the previous intent of CONFIG_EMBEDDED and just gives that intent a much better name. That has been clearly a good and long overdue renaming, and it is clearly an improvement to the kernel build configuration that has shown to help managing the kernel build configuration in the last decade. However, rather than bravely and radically just deleting CONFIG_EMBEDDED, this commit gives CONFIG_EMBEDDED a new intended semantics, but keeps it open for future contributors to implement that intended semantics: A new CONFIG_EMBEDDED option is added that automatically selects CONFIG_EXPERT when enabled and can be used in the future to isolate options that should only be considered for embedded systems (RISC architectures, SLOB, etc). Since then, this CONFIG_EMBEDDED implicitly had two purposes: - It can make even more options visible beyond what CONFIG_EXPERT makes visible. In other words, it may introduce another level of enabling the visibility of configuration options: always visible, visible with CONFIG_EXPERT and visible with CONFIG_EMBEDDED. - Set certain default values of some configurations differently, following the assumption that configuring a kernel build for an embedded system generally starts with a different set of default values compared to kernel builds for all other kind of systems. Considering the second purpose, note that already probably arguing that a kernel build for an embedded system would choose some values differently is already tricky: the set of embedded systems with Linux kernels is already quite diverse. Many embedded system have powerful CPUs and it would not be clear that all embedded systems just optimize towards one specific aspect, e.g., a smaller kernel image size. So, it is unclear if starting with "one set of default configuration" that is induced by CONFIG_EMBEDDED is a good offer for developers configuring their kernels. Also, the differences of needed user-space features in an embedded system compared to a non-embedded system are probably difficult or even impossible to name in some generic way. So it is not surprising that in the last decade hardly anyone has contributed changes to make something default differently in case of CONFIG_EMBEDDED=y. Currently, in v6.0-rc4, SECRETMEM is the only config switched off if CONFIG_EMBEDDED=y. As long as that is actually the only option that currently is selected or deselected, it is better to just make SECRETMEM configurable at build time by experts using menuconfig instead. Make SECRETMEM configurable when EXPERT is set and otherwise default to yes. Further, SECRETMEM needs ARCH_HAS_SET_DIRECT_MAP. This allows us to remove CONFIG_EMBEDDED in the close future. Link: https://lkml.kernel.org/r/20221116131922.25533-1-lukas.bulwahn@gmail.com Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:56 -08:00
Jason Gunthorpe	53b2d09bdd	mm/gup: remove the restriction on locked with FOLL_LONGTERM This restriction was created because FOLL_LONGTERM used to scan the vma list, so it could not tolerate becoming unlocked. That was fixed in commit `52650c8b46` ("mm/gup: remove the vma allocation from gup_longterm_locked()") and the restriction on !vma was removed. However, the locked restriction remained, even though it isn't necessary anymore. Adjust __gup_longterm_locked() so it can handle the mmap_read_lock() becoming unlocked while it is looping for migration. Migration does not require the mmap_read_sem because it is only handling struct pages. If we had to unlock then ensure the whole thing returns unlocked. Remove __get_user_pages_remote() and __gup_longterm_unlocked(). These cases can now just directly call other functions. Link: https://lkml.kernel.org/r/0-v1-b9ae39aa8884+14dbb-gup_longterm_locked_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:56 -08:00
Rong Tao	eff6aa17aa	selftests/damon: fix unnecessary compilation warnings When testing overflow and overread, there is no need to keep unnecessary compilation warnings, we should simply ignore them. The motivation for this patch is to eliminate the compilation warning, maybe one day we will compile the kernel with "-Werror -Wall", at which point this compilation warning will turn into a compilation error, we should fix this error in advance. How to reproduce the problem (with gcc-11.3.1): $ make -C tools/testing/selftests/ ... warning: `write' reading 4294967295 bytes from a region of size 1 [-Wstringop-overread] warning: `read' writing 4294967295 bytes into a region of size 25 overflows the destination [-Wstringop-overflow=] "-Wno-stringop-overread" is supported at least in gcc-11.1.0. Link: https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=d14c547abd484d3540b692bb8048c4a6efe92c8b Link: https://lkml.kernel.org/r/tencent_51C4ACA8CB3895C2D7F35178440283602107@qq.com Signed-off-by: Rong Tao <rongtao@cestc.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:56 -08:00
Li zeming	dbaf7dc97a	hugetlbfs: inode: remove unnecessary (void*) conversions The ei pointer does not need to cast the type. Link: https://lkml.kernel.org/r/20221107015659.3221-1-zeming@nfschina.com Signed-off-by: Li zeming <zeming@nfschina.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:56 -08:00
Jan Kara	e83b39d6bb	mm: make drop_caches keep reclaiming on all nodes Currently, drop_caches are reclaiming node-by-node, looping on each node until reclaim could not make progress. This can however leave quite some slab entries (such as filesystem inodes) unreclaimed if objects say on node 1 keep objects on node 0 pinned. So move the "loop until no progress" loop to the node-by-node iteration to retry reclaim also on other nodes if reclaim on some nodes made progress. This fixes problem when drop_caches was not reclaiming lots of otherwise perfectly fine to reclaim inodes. Link: https://lkml.kernel.org/r/20221115123255.12559-1-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: You Zhou <you.zhou@intel.com> Reported-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:55 -08:00
Pasha Tatashin	d09e8ca6cb	mm: anonymous shared memory naming Since commit `9a10064f56` ("mm: add a field to store names for private anonymous memory"), name for private anonymous memory, but not shared anonymous, can be set. However, naming shared anonymous memory just as useful for tracking purposes. Extend the functionality to be able to set names for shared anon. There are two ways to create anonymous shared memory, using memfd or directly via mmap(): 1. fd = memfd_create(...) mem = mmap(..., MAP_SHARED, fd, ...) 2. mem = mmap(..., MAP_SHARED \| MAP_ANONYMOUS, -1, ...) In both cases the anonymous shared memory is created the same way by mapping an unlinked file on tmpfs. The memfd way allows to give a name for anonymous shared memory, but not useful when parts of shared memory require to have distinct names. Example use case: The VMM maps VM memory as anonymous shared memory (not private because VMM is sandboxed and drivers are running in their own processes). However, the VM tells back to the VMM how parts of the memory are actually used by the guest, how each of the segments should be backed (i.e. 4K pages, 2M pages), and some other information about the segments. The naming allows us to monitor the effective memory footprint for each of these segments from the host without looking inside the guest. Sample output: /* Create shared anonymous segmenet / anon_shmem = mmap(NULL, SIZE, PROT_READ \| PROT_WRITE, MAP_SHARED \| MAP_ANONYMOUS, -1, 0); / Name the segment: "MY-NAME" */ rv = prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, anon_shmem, SIZE, "MY-NAME"); cat /proc/<pid>/maps (and smaps): 7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 [anon_shmem:MY-NAME] If the segment is not named, the output is: 7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 /dev/zero (deleted) Link: https://lkml.kernel.org/r/20221115020602.804224-1-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Colin Cross <ccross@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Vincent Whitchurch <vincent.whitchurch@axis.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: xu xin <cgel.zte@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:55 -08:00
T.J. Mercier	b7217a0bbe	mm: shrinkers: add missing includes for undeclared types The shrinker.h header depends on a user including other headers before it for types used by shrinker.h. Fix this by including the appropriate headers in shrinker.h. ./include/linux/shrinker.h:13:9: error: unknown type name `gfp_t' 13 \| gfp_t gfp_mask; \| ^~~~~ ./include/linux/shrinker.h:71:26: error: field `list' has incomplete type 71 \| struct list_head list; \| ^~~~ ./include/linux/shrinker.h:82:9: error: unknown type name `atomic_long_t' 82 \| atomic_long_t *nr_deferred; \| Link: https://lkml.kernel.org/r/20221114235949.201749-1-tjmercier@google.com Fixes: `83aeeada7c` ("vmscan: use atomic-long for shrinker batching") Fixes: `b0d40c92ad` ("superblock: introduce per-sb cache shrinker infrastructure") Signed-off-by: T.J. Mercier <tjmercier@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:55 -08:00
Mike Kravetz	369258ce41	hugetlb: remove duplicate mmu notifications The common hugetlb unmap routine __unmap_hugepage_range performs mmu notification calls. However, in the case where __unmap_hugepage_range is called via __unmap_hugepage_range_final, mmu notification calls are performed earlier in other calling routines. Remove mmu notification calls from __unmap_hugepage_range. Add notification calls to the only other caller: unmap_hugepage_range. unmap_hugepage_range is called for truncation and hole punch, so change notification type from UNMAP to CLEAR as this is more appropriate. Link: https://lkml.kernel.org/r/20221114235507.294320-4-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Suggested-by: Peter Xu <peterx@redhat.com> Cc: Wei Chen <harperchen1110@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:55 -08:00
Peter Xu	c2da319c2e	mm/uffd: sanity check write bit for uffd-wp protected ptes Let's add one sanity check for CONFIG_DEBUG_VM on the write bit in whatever chance we have when walking through the pgtables. It can bring the error earlier even before the app notices the data was corrupted on the snapshot. Also it helps us to identify this is a wrong pgtable setup, so hopefully a great information to have for debugging too. Link: https://lkml.kernel.org/r/20221114000447.1681003-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:55 -08:00
Yixuan Cao	25e9fa22fb	mm/kmemleak.c: fix a comment I noticed a typo in a code comment and I fixed it. Link: https://lkml.kernel.org/r/20221114171426.91745-1-caoyixuan2019@email.szu.edu.cn Signed-off-by: Yixuan Cao <caoyixuan2019@email.szu.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:54 -08:00
Jian Wen	9b34a307f3	docs: admin-guide: cgroup-v1: update description of inactive_file MADV_FREE pages have been moved into the LRU_INACTIVE_FILE list by commit `f7ad2a6cb9` ("mm: move MADV_FREE pages into LRU_INACTIVE_FILE list"). Link: https://lkml.kernel.org/r/20221111034639.3593380-1-wenjian1@xiaomi.com Signed-off-by: Jian Wen <wenjian1@xiaomi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:54 -08:00
Miaoqian Lin	4a625ceee8	mm/demotion: fix NULL vs IS_ERR checking in memory_tier_init alloc_memory_type() returns error pointers on error instead of NULL. Use IS_ERR() to check the return value to fix this. Link: https://lkml.kernel.org/r/20221110030751.1627266-1-linmq006@gmail.com Fixes: `7b88bda376` ("mm/demotion/dax/kmem: set node's abstract distance to MEMTIER_DEFAULT_DAX_ADISTANCE") Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:54 -08:00
Huang Ying	eaec4e639f	migrate: convert migrate_pages() to use folios Quite straightforward, the page functions are converted to corresponding folio functions. Same for comments. THP specific code are converted to be large folio. Link: https://lkml.kernel.org/r/20221109012348.93849-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:54 -08:00
Huang Ying	49f5185922	migrate: convert unmap_and_move() to use folios Patch series "migrate: convert migrate_pages()/unmap_and_move() to use folios", v2. The conversion is quite straightforward, just replace the page API to the corresponding folio API. migrate_pages() and unmap_and_move() mostly work with folios (head pages) only. This patch (of 2): Quite straightforward, the page functions are converted to corresponding folio functions. Same for comments. Link: https://lkml.kernel.org/r/20221109012348.93849-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20221109012348.93849-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:54 -08:00
Baolin Wang	16fd6b31dd	Revert "mm: migration: fix the FOLL_GET failure on following huge page" Revert commit `8315682148` ("mm: migration: fix the FOLL_GET failure on following huge page"), since after commit `1a6baaa0db` ("s390/hugetlb: switch to generic version of follow_huge_pud()") and commit `57a196a584` ("hugetlb: simplify hugetlb handling in follow_page_mask") were merged, now all the following huge page routines can support FOLL_GET operation. Link: https://lkml.kernel.org/r/496786039852aba90ffa68f10d0df3f4236a990b.1667983080.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Haiyue Wang <haiyue.wang@intel.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:53 -08:00
Pavankumar Kondeti	c66b6ead74	mm/kfence: remove hung_task cruft commit `fdf756f712` ("sched: Fix more TASK_state comparisons") makes hung_task not to monitor TASK_IDLE tasks. The special handling to workaround hung_task warnings is not required anymore. Link: https://lkml.kernel.org/r/1667986006-25420-1-git-send-email-quic_pkondeti@quicinc.com Signed-off-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:53 -08:00
Sergey Senozhatsky	c959a0e8de	Docs/ABI/zram: document zram recompress sysfs knobs Document zram re-compression sysfs knobs. Link: https://lkml.kernel.org/r/20221115020314.386235-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:53 -08:00
Sergey Senozhatsky	77db7bb56b	zram: add incompressible flag to read_block_state() Add a new flag to zram block state that shows if the page is incompressible: that none of the algorithm (including secondary ones) could compress it. Link: https://lkml.kernel.org/r/20221109115047.2921851-14-senozhatsky@chromium.org Suggested-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:53 -08:00
Sergey Senozhatsky	b46f9ea3cb	zram: add incompressible writeback Add support for incompressible pages writeback: echo incompressible > /sys/block/zramX/writeback Link: https://lkml.kernel.org/r/20221109115047.2921851-13-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:53 -08:00
Sergey Senozhatsky	443dd79806	documentation: add zram recompression documentation Document user-space visible device attributes that are enabled by ZRAM_MULTI_COMP. Link: https://lkml.kernel.org/r/20221109115047.2921851-12-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:52 -08:00
Sergey Senozhatsky	a55cf9648d	zram: add algo parameter support to zram_recompress() Recompression iterates through all the registered secondary compression algorithms in order of their priorities so that we have higher chances of finding the algorithm that compresses a particular page. This, however, may not always be best approach and sometimes we may want to limit recompression to only one particular algorithm. For instance, when a higher priority algorithm uses too much power and device has a relatively low battery level we may want to limit recompression to use only a lower priority algorithm, which uses less power. Introduce algo= parameter support to recompression sysfs knob so that user-sapce can request recompression with particular algorithm only: echo "type=idle algo=zstd" > /sys/block/zramX/recompress Link: https://lkml.kernel.org/r/20221109115047.2921851-11-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:52 -08:00
Sergey Senozhatsky	4942cf6ad0	zram: remove redundant checks from zram_recompress() Size class index comparison is powerful enough so we can remove object size comparisons. Link: https://lkml.kernel.org/r/20221109115047.2921851-10-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:52 -08:00
Alexey Romanov	7c2af309ab	zram: add size class equals check into recompression It makes no sense for us to recompress the object if it will be in the same size class. We anyway don't get any memory gain. But, at the same time, we get a CPU time overhead when inserting this object into zspage and decompressing it afterwards. [senozhatsky: rebased and fixed conflicts] Link: https://lkml.kernel.org/r/20221109115047.2921851-9-senozhatsky@chromium.org Signed-off-by: Alexey Romanov <avromanov@sberdevices.ru> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:52 -08:00
Sergey Senozhatsky	f24ee92cbe	zram: use IS_ERR_VALUE() to check for zs_malloc() errors Avoid typecasts that are needed for IS_ERR() and use IS_ERR_VALUE() instead. Link: https://lkml.kernel.org/r/20221109115047.2921851-8-senozhatsky@chromium.org Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:52 -08:00
Sergey Senozhatsky	9fda785dbd	zram: clarify writeback_store() comment Re-phrase writeback BIO error comment. Link: https://lkml.kernel.org/r/20221109115047.2921851-7-senozhatsky@chromium.org Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:52 -08:00
Sergey Senozhatsky	60e9b39ebe	zram: add recompress flag to read_block_state() Add a new flag to zram block state that shows if the page was recompressed (using alternative compression algorithm). Link: https://lkml.kernel.org/r/20221109115047.2921851-6-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:51 -08:00
Sergey Senozhatsky	84b33bf788	zram: introduce recompress sysfs knob Allow zram to recompress (using secondary compression streams) pages. Re-compression algorithms (we support up to 3 at this stage) are selected via recomp_algorithm: echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm Please read documentation for more details. We support several recompression modes: 1) IDLE pages recompression is activated by `idle` mode echo "type=idle" > /sys/block/zram0/recompress 2) Since there may be many idle pages user-space may pass a size threshold value (in bytes) and we will recompress pages only of equal or greater size: echo "threshold=888" > /sys/block/zram0/recompress 3) HUGE pages recompression is activated by `huge` mode echo "type=huge" > /sys/block/zram0/recompress 4) HUGE_IDLE pages recompression is activated by `huge_idle` mode echo "type=huge_idle" > /sys/block/zram0/recompress [senozhatsky@chromium.org: we should always zero out err variable in recompress loop[ Link: https://lkml.kernel.org/r/20221110143423.3250790-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20221109115047.2921851-5-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:51 -08:00
Sergey Senozhatsky	5561347aa5	zram: factor out WB and non-WB zram read functions We will use non-WB variant in ZRAM page recompression path. Link: https://lkml.kernel.org/r/20221109115047.2921851-4-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:51 -08:00
Sergey Senozhatsky	001d927357	zram: add recompression algorithm sysfs knob Introduce recomp_algorithm sysfs knob that controls secondary algorithm selection used for recompression. We will support up to 3 secondary compression algorithms which are sorted in order of their priority. To select an algorithm user has to provide its name and priority: echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm echo "algo=deflate priority=2" > /sys/block/zramX/recomp_algorithm During recompression zram iterates through the list of registered secondary algorithms in order of their priorities. We also have a short version for cases when there is only one secondary compression algorithm: echo "algo=zstd" > /sys/block/zramX/recomp_algorithm This will register zstd as the secondary algorithm with priority 1. Link: https://lkml.kernel.org/r/20221109115047.2921851-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:51 -08:00
Sergey Senozhatsky	7ac07a26de	zram: preparation for multi-zcomp support Patch series "zram: Support multiple compression streams", v5. This series adds support for multiple compression streams. The main idea is that different compression algorithms have different characteristics and zram may benefit when it uses a combination of algorithms: a default algorithm that is faster but have lower compression rate and a secondary algorithm that can use higher compression rate at a price of slower compression/decompression. There are several use-case for this functionality: - huge pages re-compression: zstd or deflate can successfully compress huge pages (~50% of huge pages on my synthetic ChromeOS tests), IOW pages that lzo was not able to compress. - idle pages re-compression: idle/cold pages sit in the memory and we may reduce zsmalloc memory usage if we recompress those idle pages. Userspace has a number of ways to control the behavior and impact of zram recompression: what type of pages should be recompressed, size watermarks, etc. Please refer to documentation patch. This patch (of 13): The patch turns compression streams and compressor algorithm name struct zram members into arrays, so that we can have multiple compression streams support (in the next patches). The patch uses a rather explicit API for compressor selection: - Get primary (default) compression stream zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP]) - Get secondary compression stream zcomp_stream_get(zram->comps[ZRAM_SECONDARY_COMP]) We use similar API for compression streams put(). At this point we always have just one compression stream, since CONFIG_ZRAM_MULTI_COMP is not yet defined. Link: https://lkml.kernel.org/r/20221109115047.2921851-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20221109115047.2921851-2-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Alexey Romanov <avromanov@sberdevices.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:51 -08:00
Alexander Gordeev	f036c8184f	mm: mmu_gather: do not expose delayed_rmap flag Flag delayed_rmap of 'struct mmu_gather' is rather a private member, but it is still accessed directly. Instead, let the TLB gather code access the flag. Link: https://lkml.kernel.org/r/Y3SWCu6NRaMQ5dbD@li-4a3a4a4c-28e5-11b2-a85c-a8d192c6f089.ibm.com Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:50 -08:00
Linus Torvalds	5df397dec7	mm: delay page_remove_rmap() until after the TLB has been flushed When we remove a page table entry, we are very careful to only free the page after we have flushed the TLB, because other CPUs could still be using the page through stale TLB entries until after the flush. However, we have removed the rmap entry for that page early, which means that functions like folio_mkclean() would end up not serializing with the page table lock because the page had already been made invisible to rmap. And that is a problem, because while the TLB entry exists, we could end up with the following situation: (a) one CPU could come in and clean it, never seeing our mapping of the page (b) another CPU could continue to use the stale and dirty TLB entry and continue to write to said page resulting in a page that has been dirtied, but then marked clean again, all while another CPU might have dirtied it some more. End result: possibly lost dirty data. This extends our current TLB gather infrastructure to optionally track a "should I do a delayed page_remove_rmap() for this page after flushing the TLB". It uses the newly introduced 'encoded page pointer' to do that without having to keep separate data around. Note, this is complicated by a couple of issues: - we want to delay the rmap removal, but not past the page table lock, because that simplifies the memcg accounting - only SMP configurations want to delay TLB flushing, since on UP there are obviously no remote TLBs to worry about, and the page table lock means there are no preemption issues either - s390 has its own mmu_gather model that doesn't delay TLB flushing, and as a result also does not want the delayed rmap. As such, we can treat S390 like the UP case and use a common fallback for the "no delays" case. - we can track an enormous number of pages in our mmu_gather structure, with MAX_GATHER_BATCH_COUNT batches of MAX_TABLE_BATCH pages each, all set up to be approximately 10k pending pages. We do not want to have a huge number of batched pages that we then need to check for delayed rmap handling inside the page table lock. Particularly that last point results in a noteworthy detail, where the normal page batch gathering is limited once we have delayed rmaps pending, in such a way that only the last batch (the so-called "active batch") in the mmu_gather structure can have any delayed entries. NOTE! While the "possibly lost dirty data" sounds catastrophic, for this all to happen you need to have a user thread doing either madvise() with MADV_DONTNEED or a full re-mmap() of the area concurrently with another thread continuing to use said mapping. So arguably this is about user space doing crazy things, but from a VM consistency standpoint it's better if we track the dirty bit properly even when user space goes off the rails. [akpm@linux-foundation.org: fix UP build, per Linus] Link: https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/ Link: https://lkml.kernel.org/r/20221109203051.1835763-4-torvalds@linux-foundation.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hugh Dickins <hughd@google.com> Reported-by: Nadav Amit <nadav.amit@gmail.com> Tested-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:50 -08:00
Linus Torvalds	7cc8f9c714	mm: mmu_gather: prepare to gather encoded page pointers with flags This is purely a preparatory patch that makes all the data structures ready for encoding flags with the mmu_gather page pointers. The code currently always sets the flag to zero and doesn't use it yet, but now it's tracking the type state along. The next step will be to actually start using it. Link: https://lkml.kernel.org/r/20221109203051.1835763-3-torvalds@linux-foundation.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 15:58:50 -08:00

1 2 3 4 5 ...

1137655 Commits All Branches Search

1137655 Commits

All Branches