Commit Graph

20054 Commits

Author SHA1 Message Date
Jakub Kicinski 681c5b51dc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Adjacent changes:

net/mptcp/protocol.h
  63740448a3 ("mptcp: fix accept vs worker race")
  2a6a870e44 ("mptcp: stops worker on unaccepted sockets at listener close")
  ddb1a072f8 ("mptcp: move first subflow allocation at mpc access time")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-20 16:29:51 -07:00
Mel Gorman 4d73ba5fa7 mm: page_alloc: skip regions with hugetlbfs pages when allocating 1G pages
A bug was reported by Yuanxi Liu where allocating 1G pages at runtime is
taking an excessive amount of time for large amounts of memory.  Further
testing allocating huge pages that the cost is linear i.e.  if allocating
1G pages in batches of 10 then the time to allocate nr_hugepages from
10->20->30->etc increases linearly even though 10 pages are allocated at
each step.  Profiles indicated that much of the time is spent checking the
validity within already existing huge pages and then attempting a
migration that fails after isolating the range, draining pages and a whole
lot of other useless work.

Commit eb14d4eefd ("mm,page_alloc: drop unnecessary checks from
pfn_range_valid_contig") removed two checks, one which ignored huge pages
for contiguous allocations as huge pages can sometimes migrate.  While
there may be value on migrating a 2M page to satisfy a 1G allocation, it's
potentially expensive if the 1G allocation fails and it's pointless to try
moving a 1G page for a new 1G allocation or scan the tail pages for valid
PFNs.

Reintroduce the PageHuge check and assume any contiguous region with
hugetlbfs pages is unsuitable for a new 1G allocation.

The hpagealloc test allocates huge pages in batches and reports the
average latency per page over time.  This test happens just after boot
when fragmentation is not an issue.  Units are in milliseconds.

hpagealloc
                               6.3.0-rc6              6.3.0-rc6              6.3.0-rc6
                                 vanilla   hugeallocrevert-v1r1   hugeallocsimple-v1r2
Min       Latency       26.42 (   0.00%)        5.07 (  80.82%)       18.94 (  28.30%)
1st-qrtle Latency      356.61 (   0.00%)        5.34 (  98.50%)       19.85 (  94.43%)
2nd-qrtle Latency      697.26 (   0.00%)        5.47 (  99.22%)       20.44 (  97.07%)
3rd-qrtle Latency      972.94 (   0.00%)        5.50 (  99.43%)       20.81 (  97.86%)
Max-1     Latency       26.42 (   0.00%)        5.07 (  80.82%)       18.94 (  28.30%)
Max-5     Latency       82.14 (   0.00%)        5.11 (  93.78%)       19.31 (  76.49%)
Max-10    Latency      150.54 (   0.00%)        5.20 (  96.55%)       19.43 (  87.09%)
Max-90    Latency     1164.45 (   0.00%)        5.53 (  99.52%)       20.97 (  98.20%)
Max-95    Latency     1223.06 (   0.00%)        5.55 (  99.55%)       21.06 (  98.28%)
Max-99    Latency     1278.67 (   0.00%)        5.57 (  99.56%)       22.56 (  98.24%)
Max       Latency     1310.90 (   0.00%)        8.06 (  99.39%)       26.62 (  97.97%)
Amean     Latency      678.36 (   0.00%)        5.44 *  99.20%*       20.44 *  96.99%*

                   6.3.0-rc6   6.3.0-rc6   6.3.0-rc6
                     vanilla   revert-v1   hugeallocfix-v2
Duration User           0.28        0.27        0.30
Duration System       808.66       17.77       35.99
Duration Elapsed      830.87       18.08       36.33

The vanilla kernel is poor, taking up to 1.3 second to allocate a huge
page and almost 10 minutes in total to run the test.  Reverting the
problematic commit reduces it to 8ms at worst and the patch takes 26ms. 
This patch fixes the main issue with skipping huge pages but leaves the
page_count() out because a page with an elevated count potentially can
migrate.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=217022
Link: https://lkml.kernel.org/r/20230414141429.pwgieuwluxwez3rj@techsingularity.net
Fixes: eb14d4eefd ("mm,page_alloc: drop unnecessary checks from pfn_range_valid_contig")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Yuanxi Liu <y.liu@naruida.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 14:22:14 -07:00
Liam R. Howlett 58c5d0d6d5 mm/mmap: regression fix for unmapped_area{_topdown}
The maple tree limits the gap returned to a window that specifically fits
what was asked.  This may not be optimal in the case of switching search
directions or a gap that does not satisfy the requested space for other
reasons.  Fix the search by retrying the operation and limiting the search
window in the rare occasion that a conflict occurs.

Link: https://lkml.kernel.org/r/20230414185919.4175572-1-Liam.Howlett@oracle.com
Fixes: 3499a13168 ("mm/mmap: use maple tree for unmapped_area{_topdown}")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reported-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 14:22:14 -07:00
Alexander Potapenko fdea03e12a mm: kmsan: handle alloc failures in kmsan_ioremap_page_range()
Similarly to kmsan_vmap_pages_range_noflush(), kmsan_ioremap_page_range()
must also properly handle allocation/mapping failures.  In the case of
such, it must clean up the already created metadata mappings and return an
error code, so that the error can be propagated to ioremap_page_range(). 
Without doing so, KMSAN may silently fail to bring the metadata for the
page range into a consistent state, which will result in user-visible
crashes when trying to access them.

Link: https://lkml.kernel.org/r/20230413131223.4135168-2-glider@google.com
Fixes: b073d7f8ae ("mm: kmsan: maintain KMSAN metadata for page operations")
Signed-off-by: Alexander Potapenko <glider@google.com>
Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com>
  Link: https://lore.kernel.org/linux-mm/CANX2M5ZRrRA64k0hOif02TjmY9kbbO2aCBPyq79es34RXZ=cAw@mail.gmail.com/
Reviewed-by: Marco Elver <elver@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 14:22:13 -07:00
Alexander Potapenko 47ebd0310e mm: kmsan: handle alloc failures in kmsan_vmap_pages_range_noflush()
As reported by Dipanjan Das, when KMSAN is used together with kernel fault
injection (or, generally, even without the latter), calls to kcalloc() or
__vmap_pages_range_noflush() may fail, leaving the metadata mappings for
the virtual mapping in an inconsistent state.  When these metadata
mappings are accessed later, the kernel crashes.

To address the problem, we return a non-zero error code from
kmsan_vmap_pages_range_noflush() in the case of any allocation/mapping
failure inside it, and make vmap_pages_range_noflush() return an error if
KMSAN fails to allocate the metadata.

This patch also removes KMSAN_WARN_ON() from vmap_pages_range_noflush(),
as these allocation failures are not fatal anymore.

Link: https://lkml.kernel.org/r/20230413131223.4135168-1-glider@google.com
Fixes: b073d7f8ae ("mm: kmsan: maintain KMSAN metadata for page operations")
Signed-off-by: Alexander Potapenko <glider@google.com>
Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com>
  Link: https://lore.kernel.org/linux-mm/CANX2M5ZRrRA64k0hOif02TjmY9kbbO2aCBPyq79es34RXZ=cAw@mail.gmail.com/
Reviewed-by: Marco Elver <elver@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 14:22:13 -07:00
Tetsuo Handa 1007843a91 mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
syzbot is reporting circular locking dependency which involves
zonelist_update_seq seqlock [1], for this lock is checked by memory
allocation requests which do not need to be retried.

One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler.

  CPU0
  ----
  __build_all_zonelists() {
    write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
    // e.g. timer interrupt handler runs at this moment
      some_timer_func() {
        kmalloc(GFP_ATOMIC) {
          __alloc_pages_slowpath() {
            read_seqbegin(&zonelist_update_seq) {
              // spins forever because zonelist_update_seq.seqcount is odd
            }
          }
        }
      }
    // e.g. timer interrupt handler finishes
    write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
  }

This deadlock scenario can be easily eliminated by not calling
read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation
requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation
requests.  But Michal Hocko does not know whether we should go with this
approach.

Another deadlock scenario which syzbot is reporting is a race between
kmalloc(GFP_ATOMIC) from tty_insert_flip_string_and_push_buffer() with
port->lock held and printk() from __build_all_zonelists() with
zonelist_update_seq held.

  CPU0                                   CPU1
  ----                                   ----
  pty_write() {
    tty_insert_flip_string_and_push_buffer() {
                                         __build_all_zonelists() {
                                           write_seqlock(&zonelist_update_seq);
                                           build_zonelists() {
                                             printk() {
                                               vprintk() {
                                                 vprintk_default() {
                                                   vprintk_emit() {
                                                     console_unlock() {
                                                       console_flush_all() {
                                                         console_emit_next_record() {
                                                           con->write() = serial8250_console_write() {
      spin_lock_irqsave(&port->lock, flags);
      tty_insert_flip_string() {
        tty_insert_flip_string_fixed_flag() {
          __tty_buffer_request_room() {
            tty_buffer_alloc() {
              kmalloc(GFP_ATOMIC | __GFP_NOWARN) {
                __alloc_pages_slowpath() {
                  zonelist_iter_begin() {
                    read_seqbegin(&zonelist_update_seq); // spins forever because zonelist_update_seq.seqcount is odd
                                                             spin_lock_irqsave(&port->lock, flags); // spins forever because port->lock is held
                    }
                  }
                }
              }
            }
          }
        }
      }
      spin_unlock_irqrestore(&port->lock, flags);
                                                             // message is printed to console
                                                             spin_unlock_irqrestore(&port->lock, flags);
                                                           }
                                                         }
                                                       }
                                                     }
                                                   }
                                                 }
                                               }
                                             }
                                           }
                                           write_sequnlock(&zonelist_update_seq);
                                         }
    }
  }

This deadlock scenario can be eliminated by

  preventing interrupt context from calling kmalloc(GFP_ATOMIC)

and

  preventing printk() from calling console_flush_all()

while zonelist_update_seq.seqcount is odd.

Since Petr Mladek thinks that __build_all_zonelists() can become a
candidate for deferring printk() [2], let's address this problem by

  disabling local interrupts in order to avoid kmalloc(GFP_ATOMIC)

and

  disabling synchronous printk() in order to avoid console_flush_all()

.

As a side effect of minimizing duration of zonelist_update_seq.seqcount
being odd by disabling synchronous printk(), latency at
read_seqbegin(&zonelist_update_seq) for both !__GFP_DIRECT_RECLAIM and
__GFP_DIRECT_RECLAIM allocation requests will be reduced.  Although, from
lockdep perspective, not calling read_seqbegin(&zonelist_update_seq) (i.e.
do not record unnecessary locking dependency) from interrupt context is
still preferable, even if we don't allow calling kmalloc(GFP_ATOMIC)
inside
write_seqlock(&zonelist_update_seq)/write_sequnlock(&zonelist_update_seq)
section...

Link: https://lkml.kernel.org/r/8796b95c-3da3-5885-fddd-6ef55f30e4d3@I-love.SAKURA.ne.jp
Fixes: 3d36424b3b ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2]
Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
  Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Petr Mladek <pmladek@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Patrick Daly <quic_pdaly@quicinc.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 14:22:12 -07:00
Baokun Li 1ba1199ec5 writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs
KASAN report null-ptr-deref:
==================================================================
BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0
Write of size 8 at addr 0000000000000000 by task sync/943
CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461
Call Trace:
 <TASK>
 dump_stack_lvl+0x7f/0xc0
 print_report+0x2ba/0x340
 kasan_report+0xc4/0x120
 kasan_check_range+0x1b7/0x2e0
 __kasan_check_write+0x24/0x40
 bdi_split_work_to_wbs+0x5c5/0x7b0
 sync_inodes_sb+0x195/0x630
 sync_inodes_one_sb+0x3a/0x50
 iterate_supers+0x106/0x1b0
 ksys_sync+0x98/0x160
[...]
==================================================================

The race that causes the above issue is as follows:

           cpu1                     cpu2
-------------------------|-------------------------
inode_switch_wbs
 INIT_WORK(&isw->work, inode_switch_wbs_work_fn)
 queue_rcu_work(isw_wq, &isw->work)
 // queue_work async
  inode_switch_wbs_work_fn
   wb_put_many(old_wb, nr_switched)
    percpu_ref_put_many
     ref->data->release(ref)
     cgwb_release
      queue_work(cgwb_release_wq, &wb->release_work)
      // queue_work async
       &wb->release_work
       cgwb_release_workfn
                            ksys_sync
                             iterate_supers
                              sync_inodes_one_sb
                               sync_inodes_sb
                                bdi_split_work_to_wbs
                                 kmalloc(sizeof(*work), GFP_ATOMIC)
                                 // alloc memory failed
        percpu_ref_exit
         ref->data = NULL
         kfree(data)
                                 wb_get(wb)
                                  percpu_ref_get(&wb->refcnt)
                                   percpu_ref_get_many(ref, 1)
                                    atomic_long_add(nr, &ref->data->count)
                                     atomic64_add(i, v)
                                     // trigger null-ptr-deref

bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all
wbs.  If the allocation of new work fails, the on-stack fallback will be
used and the reference count of the current wb is increased afterwards. 
If cgroup writeback membership switches occur before getting the reference
count and the current wb is released as old_wd, then calling wb_get() or
wb_put() will trigger the null pointer dereference above.

This issue was introduced in v4.3-rc7 (see fix tag1).  Both
sync_inodes_sb() and __writeback_inodes_sb_nr() calls to
bdi_split_work_to_wbs() can trigger this issue.  For scenarios called via
sync_inodes_sb(), originally commit 7fc5854f8c ("writeback: synchronize
sync(2) against cgroup writeback membership switches") reduced the
possibility of the issue by adding wb_switch_rwsem, but in v5.14-rc1 (see
fix tag2) removed the "inode_io_list_del_locked(inode, old_wb)" from
inode_switch_wbs_work_fn() so that wb->state contains WB_has_dirty_io,
thus old_wb is not skipped when traversing wbs in bdi_split_work_to_wbs(),
and the issue becomes easily reproducible again.

To solve this problem, percpu_ref_exit() is called under RCU protection to
avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs(). 
Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(),
and skip the current wb if wb_tryget() fails because the wb has already
been shutdown.

Link: https://lkml.kernel.org/r/20230410130826.1492525-1-libaokun1@huawei.com
Fixes: b817525a4a ("writeback: bdi_writeback iteration must not skip dying ones")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Hou Tao <houtao1@huawei.com>
Cc: yangerkun <yangerkun@huawei.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-16 10:41:26 -07:00
Liam R. Howlett f4e9e0e694 mm/mempolicy: fix use-after-free of VMA iterator
set_mempolicy_home_node() iterates over a list of VMAs and calls
mbind_range() on each VMA, which also iterates over the singular list of
the VMA passed in and potentially splits the VMA.  Since the VMA iterator
is not passed through, set_mempolicy_home_node() may now point to a stale
node in the VMA tree.  This can result in a UAF as reported by syzbot.

Avoid the stale maple tree node by passing the VMA iterator through to the
underlying call to split_vma().

mbind_range() is also overly complicated, since there are two calling
functions and one already handles iterating over the VMAs.  Simplify
mbind_range() to only handle merging and splitting of the VMAs.

Align the new loop in do_mbind() and existing loop in
set_mempolicy_home_node() to use the reduced mbind_range() function.  This
allows for a single location of the range calculation and avoids
constantly looking up the previous VMA (since this is a loop over the
VMAs).

Link: https://lore.kernel.org/linux-mm/000000000000c93feb05f87e24ad@google.com/
Fixes: 66850be55e ("mm/mempolicy: use vma iterator & maple state instead of vma linked list")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reported-by: syzbot+a7c1ec5b1d71ceaa5186@syzkaller.appspotmail.com
  Link: https://lkml.kernel.org/r/20230410152205.2294819-1-Liam.Howlett@oracle.com
Tested-by: syzbot+a7c1ec5b1d71ceaa5186@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-16 10:41:25 -07:00
Naoya Horiguchi 4737edbbdd mm/huge_memory.c: warn with pr_warn_ratelimited instead of VM_WARN_ON_ONCE_FOLIO
split_huge_page_to_list() WARNs when called for huge zero pages, which
sounds to me too harsh because it does not imply a kernel bug, but just
notifies the event to admins.  On the other hand, this is considered as
critical by syzkaller and makes its testing less efficient, which seems to
me harmful.

So replace the VM_WARN_ON_ONCE_FOLIO with pr_warn_ratelimited.

Link: https://lkml.kernel.org/r/20230406082004.2185420-1-naoya.horiguchi@linux.dev
Fixes: 478d134e95 ("mm/huge_memory: do not overkill when splitting huge_zero_page")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: syzbot+07a218429c8d19b1fb25@syzkaller.appspotmail.com
  Link: https://lore.kernel.org/lkml/000000000000a6f34a05e6efcd01@google.com/
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Xu Yu <xuyu@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-16 10:41:25 -07:00
Liam R. Howlett 82f951340f mm/mprotect: fix do_mprotect_pkey() return on error
When the loop over the VMA is terminated early due to an error, the return
code could be overwritten with ENOMEM.  Fix the return code by only
setting the error on early loop termination when the error is not set.

User-visible effects include: attempts to run mprotect() against a
special mapping or with a poorly-aligned hugetlb address should return
-EINVAL, but they presently return -ENOMEM.  In other cases an -EACCESS
should be returned.

Link: https://lkml.kernel.org/r/20230406193050.1363476-1-Liam.Howlett@oracle.com
Fixes: 2286a6914c ("mm: change mprotect_fixup to vma iterator")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-16 10:41:24 -07:00
Peter Xu dd47ac428c mm/khugepaged: check again on anon uffd-wp during isolation
Khugepaged collapse an anonymous thp in two rounds of scans.  The 2nd
round done in __collapse_huge_page_isolate() after
hpage_collapse_scan_pmd(), during which all the locks will be released
temporarily.  It means the pgtable can change during this phase before 2nd
round starts.

It's logically possible some ptes got wr-protected during this phase, and
we can errornously collapse a thp without noticing some ptes are
wr-protected by userfault.  e1e267c792 wanted to avoid it but it only
did that for the 1st phase, not the 2nd phase.

Since __collapse_huge_page_isolate() happens after a round of small page
swapins, we don't need to worry on any !present ptes - if it existed
khugepaged will already bail out.  So we only need to check present ptes
with uffd-wp bit set there.

This is something I found only but never had a reproducer, I thought it
was one caused a bug in Muhammad's recent pagemap new ioctl work, but it
turns out it's not the cause of that but an userspace bug.  However this
seems to still be a real bug even with a very small race window, still
worth to have it fixed and copy stable.

Link: https://lkml.kernel.org/r/20230405155120.3608140-1-peterx@redhat.com
Fixes: e1e267c792 ("khugepaged: skip collapse if uffd-wp detected")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-16 10:41:24 -07:00
David Hildenbrand 24bf08c437 mm/userfaultfd: fix uffd-wp handling for THP migration entries
Looks like what we fixed for hugetlb in commit 44f86392bd ("mm/hugetlb:
fix uffd-wp handling for migration entries in
hugetlb_change_protection()") similarly applies to THP.

Setting/clearing uffd-wp on THP migration entries is not implemented
properly.  Further, while removing migration PMDs considers the uffd-wp
bit, inserting migration PMDs does not consider the uffd-wp bit.

We have to set/clear independently of the migration entry type in
change_huge_pmd() and properly copy the uffd-wp bit in
set_pmd_migration_entry().

Verified using a simple reproducer that triggers migration of a THP, that
the set_pmd_migration_entry() no longer loses the uffd-wp bit.

Link: https://lkml.kernel.org/r/20230405160236.587705-2-david@redhat.com
Fixes: f45ec5ff16 ("userfaultfd: wp: support swap and page migration")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-16 10:41:24 -07:00
Qi Zheng 998ad18b00 mm: swap: fix performance regression on sparsetruncate-tiny
The ->percpu_pvec_drained was originally introduced by commit d9ed0d08b6
("mm: only drain per-cpu pagevecs once per pagevec usage") to drain
per-cpu pagevecs only once per pagevec usage.  But after converting the
swap code to be more folio-based, the commit c2bc16817a ("mm/swap: add
folio_batch_move_lru()") breaks this logic, which would cause
->percpu_pvec_drained to be reset to false, that means per-cpu pagevecs
will be drained multiple times per pagevec usage.

In theory, there should be no functional changes when converting code to
be more folio-based.  We should call folio_batch_reinit() in
folio_batch_move_lru() instead of folio_batch_init().  And to verify that
we still need ->percpu_pvec_drained, I ran mmtests/sparsetruncate-tiny and
got the following data:

                             baseline                   with
                            baseline/                 patch/
Min       Time      326.00 (   0.00%)      328.00 (  -0.61%)
1st-qrtle Time      334.00 (   0.00%)      336.00 (  -0.60%)
2nd-qrtle Time      338.00 (   0.00%)      341.00 (  -0.89%)
3rd-qrtle Time      343.00 (   0.00%)      347.00 (  -1.17%)
Max-1     Time      326.00 (   0.00%)      328.00 (  -0.61%)
Max-5     Time      327.00 (   0.00%)      330.00 (  -0.92%)
Max-10    Time      328.00 (   0.00%)      331.00 (  -0.91%)
Max-90    Time      350.00 (   0.00%)      357.00 (  -2.00%)
Max-95    Time      395.00 (   0.00%)      390.00 (   1.27%)
Max-99    Time      508.00 (   0.00%)      434.00 (  14.57%)
Max       Time      547.00 (   0.00%)      476.00 (  12.98%)
Amean     Time      344.61 (   0.00%)      345.56 *  -0.28%*
Stddev    Time       30.34 (   0.00%)       19.51 (  35.69%)
CoeffVar  Time        8.81 (   0.00%)        5.65 (  35.87%)
BAmean-99 Time      342.38 (   0.00%)      344.27 (  -0.55%)
BAmean-95 Time      338.58 (   0.00%)      341.87 (  -0.97%)
BAmean-90 Time      336.89 (   0.00%)      340.26 (  -1.00%)
BAmean-75 Time      335.18 (   0.00%)      338.40 (  -0.96%)
BAmean-50 Time      332.54 (   0.00%)      335.42 (  -0.87%)
BAmean-25 Time      329.30 (   0.00%)      332.00 (  -0.82%)

From the above it can be seen that we get similar data to when
->percpu_pvec_drained was introduced, so we still need it.  Let's call
folio_batch_reinit() in folio_batch_move_lru() to restore the original
logic.

Link: https://lkml.kernel.org/r/20230405161854.6931-1-zhengqi.arch@bytedance.com
Fixes: c2bc16817a ("mm/swap: add folio_batch_move_lru()")
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-16 10:41:24 -07:00
Jakub Kicinski c2865b1122 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZDhSiwAKCRDbK58LschI
 g8cbAQCH4xrquOeDmYyGXFQGchHZAIj++tKg8ABU4+hYeJtrlwEA6D4W6wjoSZRk
 mLSptZ9qro8yZA86BvyPvlBT1h9ELQA=
 =StAc
 -----END PGP SIGNATURE-----

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-04-13

We've added 260 non-merge commits during the last 36 day(s) which contain
a total of 356 files changed, 21786 insertions(+), 11275 deletions(-).

The main changes are:

1) Rework BPF verifier log behavior and implement it as a rotating log
   by default with the option to retain old-style fixed log behavior,
   from Andrii Nakryiko.

2) Adds support for using {FOU,GUE} encap with an ipip device operating
   in collect_md mode and add a set of BPF kfuncs for controlling encap
   params, from Christian Ehrig.

3) Allow BPF programs to detect at load time whether a particular kfunc
   exists or not, and also add support for this in light skeleton,
   from Alexei Starovoitov.

4) Optimize hashmap lookups when key size is multiple of 4,
   from Anton Protopopov.

5) Enable RCU semantics for task BPF kptrs and allow referenced kptr
   tasks to be stored in BPF maps, from David Vernet.

6) Add support for stashing local BPF kptr into a map value via
   bpf_kptr_xchg(). This is useful e.g. for rbtree node creation
   for new cgroups, from Dave Marchevsky.

7) Fix BTF handling of is_int_ptr to skip modifiers to work around
   tracing issues where a program cannot be attached, from Feng Zhou.

8) Migrate a big portion of test_verifier unit tests over to
   test_progs -a verifier_* via inline asm to ease {read,debug}ability,
   from Eduard Zingerman.

9) Several updates to the instruction-set.rst documentation
   which is subject to future IETF standardization
   (https://lwn.net/Articles/926882/), from Dave Thaler.

10) Fix BPF verifier in the __reg_bound_offset's 64->32 tnum sub-register
    known bits information propagation, from Daniel Borkmann.

11) Add skb bitfield compaction work related to BPF with the overall goal
    to make more of the sk_buff bits optional, from Jakub Kicinski.

12) BPF selftest cleanups for build id extraction which stand on its own
    from the upcoming integration work of build id into struct file object,
    from Jiri Olsa.

13) Add fixes and optimizations for xsk descriptor validation and several
    selftest improvements for xsk sockets, from Kal Conley.

14) Add BPF links for struct_ops and enable switching implementations
    of BPF TCP cong-ctls under a given name by replacing backing
    struct_ops map, from Kui-Feng Lee.

15) Remove a misleading BPF verifier env->bypass_spec_v1 check on variable
    offset stack read as earlier Spectre checks cover this,
    from Luis Gerhorst.

16) Fix issues in copy_from_user_nofault() for BPF and other tracers
    to resemble copy_from_user_nmi() from safety PoV, from Florian Lehner
    and Alexei Starovoitov.

17) Add --json-summary option to test_progs in order for CI tooling to
    ease parsing of test results, from Manu Bretelle.

18) Batch of improvements and refactoring to prep for upcoming
    bpf_local_storage conversion to bpf_mem_cache_{alloc,free} allocator,
    from Martin KaFai Lau.

19) Improve bpftool's visual program dump which produces the control
    flow graph in a DOT format by adding C source inline annotations,
    from Quentin Monnet.

20) Fix attaching fentry/fexit/fmod_ret/lsm to modules by extracting
    the module name from BTF of the target and searching kallsyms of
    the correct module, from Viktor Malik.

21) Improve BPF verifier handling of '<const> <cond> <non_const>'
    to better detect whether in particular jmp32 branches are taken,
    from Yonghong Song.

22) Allow BPF TCP cong-ctls to write app_limited of struct tcp_sock.
    A built-in cc or one from a kernel module is already able to write
    to app_limited, from Yixin Shen.

Conflicts:

Documentation/bpf/bpf_devel_QA.rst
  b7abcd9c65 ("bpf, doc: Link to submitting-patches.rst for general patch submission info")
  0f10f647f4 ("bpf, docs: Use internal linking for link to netdev subsystem doc")
https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/

include/net/ip_tunnels.h
  bc9d003dc4 ("ip_tunnel: Preserve pointer const in ip_tunnel_info_opts")
  ac931d4cde ("ipip,ip_tunnel,sit: Add FOU support for externally controlled ipip devices")
https://lore.kernel.org/all/20230413161235.4093777-1-broonie@kernel.org/

net/bpf/test_run.c
  e5995bc7e2 ("bpf, test_run: fix crashes due to XDP frame overwriting/corruption")
  294635a816 ("bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES")
https://lore.kernel.org/all/20230320102619.05b80a98@canb.auug.org.au/
====================

Link: https://lore.kernel.org/r/20230413191525.7295-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13 16:43:38 -07:00
Alexei Starovoitov d319f34456 mm: Fix copy_from_user_nofault().
There are several issues with copy_from_user_nofault():

- access_ok() is designed for user context only and for that reason
it has WARN_ON_IN_IRQ() which triggers when bpf, kprobe, eprobe
and perf on ppc are calling it from irq.

- it's missing nmi_uaccess_okay() which is a nop on all architectures
except x86 where it's required.
The comment in arch/x86/mm/tlb.c explains the details why it's necessary.
Calling copy_from_user_nofault() from bpf, [ke]probe without this check is not safe.

- __copy_from_user_inatomic() under CONFIG_HARDENED_USERCOPY is calling
check_object_size()->__check_object_size()->check_heap_object()->find_vmap_area()->spin_lock()
which is not safe to do from bpf, [ke]probe and perf due to potential deadlock.

Fix all three issues. At the end the copy_from_user_nofault() becomes
equivalent to copy_from_user_nmi() from safety point of view with
a difference in the return value.

Reported-by: Hsin-Wei Hung <hsinweih@uci.edu>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Florian Lehner <dev@der-flo.net>
Tested-by: Hsin-Wei Hung <hsinweih@uci.edu>
Tested-by: Florian Lehner <dev@der-flo.net>
Link: https://lore.kernel.org/r/20230410174345.4376-2-dev@der-flo.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-12 17:36:23 -07:00
Rongwei Wang 6fe7d6b992 mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()
The si->lock must be held when deleting the si from the available list. 
Otherwise, another thread can re-add the si to the available list, which
can lead to memory corruption.  The only place we have found where this
happens is in the swapoff path.  This case can be described as below:

core 0                       core 1
swapoff

del_from_avail_list(si)      waiting

try lock si->lock            acquire swap_avail_lock
                             and re-add si into
                             swap_avail_head

acquire si->lock but missing si already being added again, and continuing
to clear SWP_WRITEOK, etc.

It can be easily found that a massive warning messages can be triggered
inside get_swap_pages() by some special cases, for example, we call
madvise(MADV_PAGEOUT) on blocks of touched memory concurrently, meanwhile,
run much swapon-swapoff operations (e.g.  stress-ng-swap).

However, in the worst case, panic can be caused by the above scene.  In
swapoff(), the memory used by si could be kept in swap_info[] after
turning off a swap.  This means memory corruption will not be caused
immediately until allocated and reset for a new swap in the swapon path. 
A panic message caused: (with CONFIG_PLIST_DEBUG enabled)

------------[ cut here ]------------
top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a
prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d
next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a
WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70
Modules linked in: rfkill(E) crct10dif_ce(E)...
CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+
Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015
pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
pc : plist_check_prev_next_node+0x50/0x70
lr : plist_check_prev_next_node+0x50/0x70
sp : ffff0018009d3c30
x29: ffff0018009d3c40 x28: ffff800011b32a98
x27: 0000000000000000 x26: ffff001803908000
x25: ffff8000128ea088 x24: ffff800011b32a48
x23: 0000000000000028 x22: ffff001800875c00
x21: ffff800010f9e520 x20: ffff001800875c00
x19: ffff001800fdc6e0 x18: 0000000000000030
x17: 0000000000000000 x16: 0000000000000000
x15: 0736076307640766 x14: 0730073007380731
x13: 0736076307640766 x12: 0730073007380731
x11: 000000000004058d x10: 0000000085a85b76
x9 : ffff8000101436e4 x8 : ffff800011c8ce08
x7 : 0000000000000000 x6 : 0000000000000001
x5 : ffff0017df9ed338 x4 : 0000000000000001
x3 : ffff8017ce62a000 x2 : ffff0017df9ed340
x1 : 0000000000000000 x0 : 0000000000000000
Call trace:
 plist_check_prev_next_node+0x50/0x70
 plist_check_head+0x80/0xf0
 plist_add+0x28/0x140
 add_to_avail_list+0x9c/0xf0
 _enable_swap_info+0x78/0xb4
 __do_sys_swapon+0x918/0xa10
 __arm64_sys_swapon+0x20/0x30
 el0_svc_common+0x8c/0x220
 do_el0_svc+0x2c/0x90
 el0_svc+0x1c/0x30
 el0_sync_handler+0xa8/0xb0
 el0_sync+0x148/0x180
irq event stamp: 2082270

Now, si->lock locked before calling 'del_from_avail_list()' to make sure
other thread see the si had been deleted and SWP_WRITEOK cleared together,
will not reinsert again.

This problem exists in versions after stable 5.10.y.

Link: https://lkml.kernel.org/r/20230404154716.23058-1-rongwei.wang@linux.alibaba.com
Fixes: a2468cc9bf ("swap: choose swap device according to numa node") 
Tested-by: Yongchen Yin <wb-yyc939293@alibaba-inc.com>
Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 18:06:24 -07:00
Alistair Popple 7c7b962938 mm: take a page reference when removing device exclusive entries
Device exclusive page table entries are used to prevent CPU access to a
page whilst it is being accessed from a device.  Typically this is used to
implement atomic operations when the underlying bus does not support
atomic access.  When a CPU thread encounters a device exclusive entry it
locks the page and restores the original entry after calling mmu notifiers
to signal drivers that exclusive access is no longer available.

The device exclusive entry holds a reference to the page making it safe to
access the struct page whilst the entry is present.  However the fault
handling code does not hold the PTL when taking the page lock.  This means
if there are multiple threads faulting concurrently on the device
exclusive entry one will remove the entry whilst others will wait on the
page lock without holding a reference.

This can lead to threads locking or waiting on a folio with a zero
refcount.  Whilst mmap_lock prevents the pages getting freed via munmap()
they may still be freed by a migration.  This leads to warnings such as
PAGE_FLAGS_CHECK_AT_FREE due to the page being locked when the refcount
drops to zero.

Fix this by trying to take a reference on the folio before locking it. 
The code already checks the PTE under the PTL and aborts if the entry is
no longer there.  It is also possible the folio has been unmapped, freed
and re-allocated allowing a reference to be taken on an unrelated folio. 
This case is also detected by the PTE check and the folio is unlocked
without further changes.

Link: https://lkml.kernel.org/r/20230330012519.804116-1-apopple@nvidia.com
Fixes: b756a3b5e7 ("mm: device exclusive memory access")
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 18:06:24 -07:00
Yafang Shao f349b15e18 mm: vmalloc: avoid warn_alloc noise caused by fatal signal
There're some suspicious warn_alloc on my test serer, for example,

[13366.518837] warn_alloc: 81 callbacks suppressed
[13366.518841] test_verifier: vmalloc error: size 4096, page order 0, failed to allocate pages, mode:0x500dc2(GFP_HIGHUSER|__GFP_ZERO|__GFP_ACCOUNT), nodemask=(null),cpuset=/,mems_allowed=0-1
[13366.522240] CPU: 30 PID: 722463 Comm: test_verifier Kdump: loaded Tainted: G        W  O       6.2.0+ #638
[13366.524216] Call Trace:
[13366.524702]  <TASK>
[13366.525148]  dump_stack_lvl+0x6c/0x80
[13366.525712]  dump_stack+0x10/0x20
[13366.526239]  warn_alloc+0x119/0x190
[13366.526783]  ? alloc_pages_bulk_array_mempolicy+0x9e/0x2a0
[13366.527470]  __vmalloc_area_node+0x546/0x5b0
[13366.528066]  __vmalloc_node_range+0xc2/0x210
[13366.528660]  __vmalloc_node+0x42/0x50
[13366.529186]  ? bpf_prog_realloc+0x53/0xc0
[13366.529743]  __vmalloc+0x1e/0x30
[13366.530235]  bpf_prog_realloc+0x53/0xc0
[13366.530771]  bpf_patch_insn_single+0x80/0x1b0
[13366.531351]  bpf_jit_blind_constants+0xe9/0x1c0
[13366.531932]  ? __free_pages+0xee/0x100
[13366.532457]  ? free_large_kmalloc+0x58/0xb0
[13366.533002]  bpf_int_jit_compile+0x8c/0x5e0
[13366.533546]  bpf_prog_select_runtime+0xb4/0x100
[13366.534108]  bpf_prog_load+0x6b1/0xa50
[13366.534610]  ? perf_event_task_tick+0x96/0xb0
[13366.535151]  ? security_capable+0x3a/0x60
[13366.535663]  __sys_bpf+0xb38/0x2190
[13366.536120]  ? kvm_clock_get_cycles+0x9/0x10
[13366.536643]  __x64_sys_bpf+0x1c/0x30
[13366.537094]  do_syscall_64+0x38/0x90
[13366.537554]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[13366.538107] RIP: 0033:0x7f78310f8e29
[13366.538561] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 17 e0 2c 00 f7 d8 64 89 01 48
[13366.540286] RSP: 002b:00007ffe2a61fff8 EFLAGS: 00000206 ORIG_RAX: 0000000000000141
[13366.541031] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f78310f8e29
[13366.541749] RDX: 0000000000000080 RSI: 00007ffe2a6200b0 RDI: 0000000000000005
[13366.542470] RBP: 00007ffe2a620010 R08: 00007ffe2a6202a0 R09: 00007ffe2a6200b0
[13366.543183] R10: 00000000000f423e R11: 0000000000000206 R12: 0000000000407800
[13366.543900] R13: 00007ffe2a620540 R14: 0000000000000000 R15: 0000000000000000
[13366.544623]  </TASK>
[13366.545260] Mem-Info:
[13366.546121] active_anon:81319 inactive_anon:20733 isolated_anon:0
 active_file:69450 inactive_file:5624 isolated_file:0
 unevictable:0 dirty:10 writeback:0
 slab_reclaimable:69649 slab_unreclaimable:48930
 mapped:27400 shmem:12868 pagetables:4929
 sec_pagetables:0 bounce:0
 kernel_misc_reclaimable:0
 free:15870308 free_pcp:142935 free_cma:0
[13366.551886] Node 0 active_anon:224836kB inactive_anon:33528kB active_file:175692kB inactive_file:13752kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:59248kB dirty:32kB writeback:0kB shmem:18252kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:4616kB pagetables:10664kB sec_pagetables:0kB all_unreclaimable? no
[13366.555184] Node 1 active_anon:100440kB inactive_anon:49404kB active_file:102108kB inactive_file:8744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:50352kB dirty:8kB writeback:0kB shmem:33220kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:3896kB pagetables:9052kB sec_pagetables:0kB all_unreclaimable? no
[13366.558262] Node 0 DMA free:15360kB boost:0kB min:304kB low:380kB high:456kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[13366.560821] lowmem_reserve[]: 0 2735 31873 31873 31873
[13366.561981] Node 0 DMA32 free:2790904kB boost:0kB min:56028kB low:70032kB high:84036kB reserved_highatomic:0KB active_anon:1936kB inactive_anon:20kB active_file:396kB inactive_file:344kB unevictable:0kB writepending:0kB present:3129200kB managed:2801520kB mlocked:0kB bounce:0kB free_pcp:5188kB local_pcp:0kB free_cma:0kB
[13366.565148] lowmem_reserve[]: 0 0 29137 29137 29137
[13366.566168] Node 0 Normal free:28533824kB boost:0kB min:596740kB low:745924kB high:895108kB reserved_highatomic:28672KB active_anon:222900kB inactive_anon:33508kB active_file:175296kB inactive_file:13408kB unevictable:0kB writepending:32kB present:30408704kB managed:29837172kB mlocked:0kB bounce:0kB free_pcp:295724kB local_pcp:0kB free_cma:0kB
[13366.569485] lowmem_reserve[]: 0 0 0 0 0
[13366.570416] Node 1 Normal free:32141144kB boost:0kB min:660504kB low:825628kB high:990752kB reserved_highatomic:69632KB active_anon:100440kB inactive_anon:49404kB active_file:102108kB inactive_file:8744kB unevictable:0kB writepending:8kB present:33554432kB managed:33025372kB mlocked:0kB bounce:0kB free_pcp:270880kB local_pcp:46860kB free_cma:0kB
[13366.573403] lowmem_reserve[]: 0 0 0 0 0
[13366.574015] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15360kB
[13366.575474] Node 0 DMA32: 782*4kB (UME) 756*8kB (UME) 736*16kB (UME) 745*32kB (UME) 694*64kB (UME) 653*128kB (UME) 595*256kB (UME) 552*512kB (UME) 454*1024kB (UME) 347*2048kB (UME) 246*4096kB (UME) = 2790904kB
[13366.577442] Node 0 Normal: 33856*4kB (UMEH) 51815*8kB (UMEH) 42418*16kB (UMEH) 36272*32kB (UMEH) 22195*64kB (UMEH) 10296*128kB (UMEH) 7238*256kB (UMEH) 5638*512kB (UEH) 5337*1024kB (UMEH) 3506*2048kB (UMEH) 1470*4096kB (UME) = 28533784kB
[13366.580460] Node 1 Normal: 15776*4kB (UMEH) 37485*8kB (UMEH) 29509*16kB (UMEH) 21420*32kB (UMEH) 14818*64kB (UMEH) 13051*128kB (UMEH) 9918*256kB (UMEH) 7374*512kB (UMEH) 5397*1024kB (UMEH) 3887*2048kB (UMEH) 2002*4096kB (UME) = 32141240kB
[13366.583027] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[13366.584380] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[13366.585702] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[13366.587042] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[13366.588372] 87386 total pagecache pages
[13366.589266] 0 pages in swap cache
[13366.590327] Free swap  = 0kB
[13366.591227] Total swap = 0kB
[13366.592142] 16777082 pages RAM
[13366.593057] 0 pages HighMem/MovableOnly
[13366.594037] 357226 pages reserved
[13366.594979] 0 pages hwpoisoned

This failure really confuse me as there're still lots of available pages. 
Finally I figured out it was caused by a fatal signal.  When a process is
allocating memory via vm_area_alloc_pages(), it will break directly even
if it hasn't allocated the requested pages when it receives a fatal
signal.  In that case, we shouldn't show this warn_alloc, as it is
useless.  We only need to show this warning when there're really no enough
pages.

Link: https://lkml.kernel.org/r/20230330162625.13604-1-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 18:06:24 -07:00
Peter Xu 60d5b473d6 mm/hugetlb: fix uffd wr-protection for CoW optimization path
This patch fixes an issue that a hugetlb uffd-wr-protected mapping can be
writable even with uffd-wp bit set.  It only happens with hugetlb private
mappings, when someone firstly wr-protects a missing pte (which will
install a pte marker), then a write to the same page without any prior
access to the page.

Userfaultfd-wp trap for hugetlb was implemented in hugetlb_fault() before
reaching hugetlb_wp() to avoid taking more locks that userfault won't
need.  However there's one CoW optimization path that can trigger
hugetlb_wp() inside hugetlb_no_page(), which will bypass the trap.

This patch skips hugetlb_wp() for CoW and retries the fault if uffd-wp bit
is detected.  The new path will only trigger in the CoW optimization path
because generic hugetlb_fault() (e.g.  when a present pte was
wr-protected) will resolve the uffd-wp bit already.  Also make sure
anonymous UNSHARE won't be affected and can still be resolved, IOW only
skip CoW not CoR.

This patch will be needed for v5.19+ hence copy stable.

[peterx@redhat.com: v2]
  Link: https://lkml.kernel.org/r/ZBzOqwF2wrHgBVZb@x1n
[peterx@redhat.com: v3]
  Link: https://lkml.kernel.org/r/20230324142620.2344140-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20230321191840.1897940-1-peterx@redhat.com
Fixes: 166f3ecc0d ("mm/hugetlb: hook page faults for uffd write protection")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 18:06:22 -07:00
Liam R. Howlett 3dd4432549 mm: enable maple tree RCU mode by default
Use the maple tree in RCU mode for VMA tracking.

The maple tree tracks the stack and is able to update the pivot
(lower/upper boundary) in-place to allow the page fault handler to write
to the tree while holding just the mmap read lock.  This is safe as the
writes to the stack have a guard VMA which ensures there will always be a
NULL in the direction of the growth and thus will only update a pivot.

It is possible, but not recommended, to have VMAs that grow up/down
without guard VMAs.  syzbot has constructed a testcase which sets up a VMA
to grow and consume the empty space.  Overwriting the entire NULL entry
causes the tree to be altered in a way that is not safe for concurrent
readers; the readers may see a node being rewritten or one that does not
match the maple state they are using.

Enabling RCU mode allows the concurrent readers to see a stable node and
will return the expected result.

[Liam.Howlett@Oracle.com: we don't need to free the nodes with RCU[
Link: https://lore.kernel.org/linux-mm/000000000000b0a65805f663ace6@google.com/
Link: https://lkml.kernel.org/r/20230227173632.3292573-9-surenb@google.com
Fixes: d4af56c5c7 ("mm: start tracking VMAs with maple tree")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: syzbot+8d95422d3537159ca390@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05 18:06:22 -07:00
Muchun Song 1f2803b266 mm: kfence: fix handling discontiguous page
The struct pages could be discontiguous when the kfence pool is allocated
via alloc_contig_pages() with CONFIG_SPARSEMEM and
!CONFIG_SPARSEMEM_VMEMMAP.

This may result in setting PG_slab and memcg_data to a arbitrary
address (may be not used as a struct page), which in the worst case
might corrupt the kernel.

So the iteration should use nth_page().

Link: https://lkml.kernel.org/r/20230323025003.94447-1-songmuchun@bytedance.com
Fixes: 0ce20dd840 ("mm: add Kernel Electric-Fence infrastructure")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28 15:24:32 -07:00
Muchun Song 3ee2d7471f mm: kfence: fix PG_slab and memcg_data clearing
It does not reset PG_slab and memcg_data when KFENCE fails to initialize
kfence pool at runtime.  It is reporting a "Bad page state" message when
kfence pool is freed to buddy.  The checking of whether it is a compound
head page seems unnecessary since we already guarantee this when
allocating kfence pool.   Remove the check to simplify the code.

Link: https://lkml.kernel.org/r/20230320030059.20189-1-songmuchun@bytedance.com
Fixes: 0ce20dd840 ("mm: add Kernel Electric-Fence infrastructure")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28 15:24:32 -07:00
Linus Torvalds 65aca32efd 21 hotfixes, 8 of which are cc:stable. 11 are for MM, the remainder are
for other subsystems.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZB48xAAKCRDdBJ7gKXxA
 js2rAP4zvcMn90vBJhWNElsA7pBgDYD66QCK6JBDHGe3J1qdeQEA8D606pjMBWkL
 ly7NifwCjOtFhfDRgEHOXu8g8g1k1QM=
 =Cswg
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-03-24-17-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "21 hotfixes, 8 of which are cc:stable. 11 are for MM, the remainder
  are for other subsystems"

* tag 'mm-hotfixes-stable-2023-03-24-17-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits)
  mm: mmap: remove newline at the end of the trace
  mailmap: add entries for Richard Leitner
  kcsan: avoid passing -g for test
  kfence: avoid passing -g for test
  mm: kfence: fix using kfence_metadata without initialization in show_object()
  lib: dhry: fix unstable smp_processor_id(_) usage
  mailmap: add entry for Enric Balletbo i Serra
  mailmap: map Sai Prakash Ranjan's old address to his current one
  mailmap: map Rajendra Nayak's old address to his current one
  Revert "kasan: drop skip_kasan_poison variable in free_pages_prepare"
  mailmap: add entry for Tobias Klauser
  kasan, powerpc: don't rename memintrinsics if compiler adds prefixes
  mm/ksm: fix race with VMA iteration and mm_struct teardown
  kselftest: vm: fix unused variable warning
  mm: fix error handling for map_deny_write_exec
  mm: deduplicate error handling for map_deny_write_exec
  checksyscalls: ignore fstat to silence build warning on LoongArch
  nilfs2: fix kernel-infoleak in nilfs_ioctl_wrap_copy()
  test_maple_tree: add more testing for mas_empty_area()
  maple_tree: fix mas_skip_node() end slot detection
  ...
2023-03-24 18:06:11 -07:00
Linus Torvalds cb7f5b41f8 slab fix for 6.3-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmQduWQACgkQu+CwddJF
 iJrFXQf7BExyS3TCExTNSCXOR5KdHEkZvfbsWH4be9hngrbVbZhfbVLuxjYsfawS
 sWO8eobVy7EHXs2aaA1b9xrfkNIWy4Hy6HHaVNziVDi7inJq/mrAs/QJfMCJn1DM
 VXKG4KpP/mP568H9npehQGNPWa60epmiKlnD+sNLTYJGzYA7s2SJkqGGhP7qY7M9
 ceb8E/xQoM7kh6Z82wIscYP/uuaoWIVgdg/ww7BkvpKUF7i2w813GDencjvjF79y
 U/CdBLGkW32+UBPleH0XQfBhVQZWw7zR6URiBo02YId7aktzEFfUARJAxY/01+M8
 /ovR75329lfqsBjG9ErlD8kr3WbXRA==
 =+x78
 -----END PGP SIGNATURE-----

Merge tag 'slab-fix-for-6.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab fix from Vlastimil Babka:
 "A single build fix for a corner case configuration that is apparently
  possible to achieve on some arches, from Geert"

* tag 'slab-fix-for-6.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  mm/slab: Fix undefined init_cache_node_node() for NUMA and !SMP
2023-03-24 10:12:14 -07:00
Marco Elver 2e08ca1802 kfence: avoid passing -g for test
Nathan reported that when building with GNU as and a version of clang that
defaults to DWARF5:

  $ make -skj"$(nproc)" ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- \
			LLVM=1 LLVM_IAS=0 O=build \
			mrproper allmodconfig mm/kfence/kfence_test.o
  /tmp/kfence_test-08a0a0.s: Assembler messages:
  /tmp/kfence_test-08a0a0.s:14627: Error: non-constant .uleb128 is not supported
  /tmp/kfence_test-08a0a0.s:14628: Error: non-constant .uleb128 is not supported
  /tmp/kfence_test-08a0a0.s:14632: Error: non-constant .uleb128 is not supported
  /tmp/kfence_test-08a0a0.s:14633: Error: non-constant .uleb128 is not supported
  /tmp/kfence_test-08a0a0.s:14639: Error: non-constant .uleb128 is not supported
  ...

This is because `-g` defaults to the compiler debug info default.  If the
assembler does not support some of the directives used, the above errors
occur.  To fix, remove the explicit passing of `-g`.

All the test wants is that stack traces print valid function names, and
debug info is not required for that.  (I currently cannot recall why I
added the explicit `-g`.)

Link: https://lkml.kernel.org/r/20230316224705.709984-1-elver@google.com
Fixes: bc8fbc5f30 ("kfence: add test suite")
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-23 17:18:35 -07:00
Muchun Song 1c86a188e0 mm: kfence: fix using kfence_metadata without initialization in show_object()
The variable kfence_metadata is initialized in kfence_init_pool(), then,
it is not initialized if kfence is disabled after booting.  In this case,
kfence_metadata will be used (e.g.  ->lock and ->state fields) without
initialization when reading /sys/kernel/debug/kfence/objects.  There will
be a warning if you enable CONFIG_DEBUG_SPINLOCK.  Fix it by creating
debugfs files when necessary.

Link: https://lkml.kernel.org/r/20230315034441.44321-1-songmuchun@bytedance.com
Fixes: 0ce20dd840 ("mm: add Kernel Electric-Fence infrastructure")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Tested-by: Marco Elver <elver@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-23 17:18:35 -07:00
Peter Collingbourne f446883d12 Revert "kasan: drop skip_kasan_poison variable in free_pages_prepare"
This reverts commit 487a32ec24.

should_skip_kasan_poison() reads the PG_skip_kasan_poison flag from
page->flags.  However, this line of code in free_pages_prepare():

	page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;

clears most of page->flags, including PG_skip_kasan_poison, before calling
should_skip_kasan_poison(), which meant that it would never return true as
a result of the page flag being set.  Therefore, fix the code to call
should_skip_kasan_poison() before clearing the flags, as we were doing
before the reverted patch.

This fixes a measurable performance regression introduced in the reverted
commit, where munmap() takes longer than intended if HW tags KASAN is
supported and enabled at runtime.  Without this patch, we see a
single-digit percentage performance regression in a particular
mmap()-heavy benchmark when enabling HW tags KASAN, and with the patch,
there is no statistically significant performance impact when enabling HW
tags KASAN.

Link: https://lkml.kernel.org/r/20230310042914.3805818-2-pcc@google.com
Fixes: 487a32ec24 ("kasan: drop skip_kasan_poison variable in free_pages_prepare")
  Link: https://linux-review.googlesource.com/id/Ic4f13affeebd20548758438bb9ed9ca40e312b79
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: <stable@vger.kernel.org>	[6.1]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-23 17:18:34 -07:00
Liam R. Howlett 6db504ce55 mm/ksm: fix race with VMA iteration and mm_struct teardown
exit_mmap() will tear down the VMAs and maple tree with the mmap_lock held
in write mode.  Ensure that the maple tree is still valid by checking
ksm_test_exit() after taking the mmap_lock in read mode, but before the
for_each_vma() iterator dereferences a destroyed maple tree.

Since the maple tree is destroyed, the flags telling lockdep to check an
external lock has been cleared.  Skip the for_each_vma() iterator to avoid
dereferencing a maple tree without the external lock flag, which would
create a lockdep warning.

Link: https://lkml.kernel.org/r/20230308220310.3119196-1-Liam.Howlett@oracle.com
Fixes: a5f18ba072 ("mm/ksm: use vma iterators instead of vma linked list")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
  Link: https://lore.kernel.org/lkml/ZAdUUhSbaa6fHS36@xpf.sh.intel.com/
Reported-by: syzbot+2ee18845e89ae76342c5@syzkaller.appspotmail.com
  Link: https://syzkaller.appspot.com/bug?id=64a3e95957cd3deab99df7cd7b5a9475af92c93e
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <heng.su@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-23 17:18:33 -07:00
Joey Gouly 3d27a95b1d mm: fix error handling for map_deny_write_exec
Commit 4a18419f71 ("mm/mprotect: use mmu_gather") changed 'goto out;' to
'break' in the loop.

This wasn't noticed while rebasing the MDWE patches, so fix it now.

Link: https://lkml.kernel.org/r/20230308190423.46491-3-joey.gouly@arm.com
Fixes: b507808ebc ("mm: implement memory-deny-write-execute as a prctl")
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Reported-by: Alexey Izbyshev <izbyshev@ispras.ru>
  Link: https://lore.kernel.org/linux-arm-kernel/8408d8901e9d7ee6b78db4c6cba04b78@ispras.ru/
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: nd <nd@arm.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-23 17:18:33 -07:00
Joey Gouly 6bbf109067 mm: deduplicate error handling for map_deny_write_exec
Patch series "Fixes for MDWE prctl"

These are four small fixes for the recent memory-write-deny-execute prctl
patches [1].  Two reported by Alexey about error handling and two tooling
fixes by Peter.


This patch (of 4):

Commit cc8d1b097d ("mmap: clean up mmap_region() unrolling")
deduplicated the error handling, do the same for the return value of
`map_deny_write_exec`.

Link: https://lkml.kernel.org/r/20230308190423.46491-1-joey.gouly@arm.com
Link: https://lkml.kernel.org/r/20230308190423.46491-2-joey.gouly@arm.com
Link: https://lore.kernel.org/linux-arm-kernel/20230119160344.54358-1-joey.gouly@arm.com/ [1]
Fixes: b507808ebc ("mm: implement memory-deny-write-execute as a prctl")
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Reported-by: Alexey Izbyshev <izbyshev@ispras.ru>
  Link: https://lore.kernel.org/linux-arm-kernel/8408d8901e9d7ee6b78db4c6cba04b78@ispras.ru/
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: nd <nd@arm.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-23 17:18:32 -07:00
Michal Hocko e9c3cda4d8 mm, vmalloc: fix high order __GFP_NOFAIL allocations
Gao Xiang has reported that the page allocator complains about high order
__GFP_NOFAIL request coming from the vmalloc core:

 __alloc_pages+0x1cb/0x5b0 mm/page_alloc.c:5549
 alloc_pages+0x1aa/0x270 mm/mempolicy.c:2286
 vm_area_alloc_pages mm/vmalloc.c:2989 [inline]
 __vmalloc_area_node mm/vmalloc.c:3057 [inline]
 __vmalloc_node_range+0x978/0x13c0 mm/vmalloc.c:3227
 kvmalloc_node+0x156/0x1a0 mm/util.c:606
 kvmalloc include/linux/slab.h:737 [inline]
 kvmalloc_array include/linux/slab.h:755 [inline]
 kvcalloc include/linux/slab.h:760 [inline]

it seems that I have completely missed high order allocation backing
vmalloc areas case when implementing __GFP_NOFAIL support.  This means
that [k]vmalloc at al.  can allocate higher order allocations with
__GFP_NOFAIL which can trigger OOM killer for non-costly orders easily or
cause a lot of reclaim/compaction activity if those requests cannot be
satisfied.

Fix the issue by falling back to zero order allocations for __GFP_NOFAIL
requests if the high order request fails.

Link: https://lkml.kernel.org/r/ZAXynvdNqcI0f6Us@dhcp22.suse.cz
Fixes: 9376130c39 ("mm/vmalloc: add support for __GFP_NOFAIL")
Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com>
  Link: https://lkml.kernel.org/r/20230305053035.1911-1-hsiangkao@linux.alibaba.com
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Baoquan He <bhe@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-23 17:18:31 -07:00
Geert Uytterhoeven 66a1c22b70 mm/slab: Fix undefined init_cache_node_node() for NUMA and !SMP
sh/migor_defconfig:

    mm/slab.c: In function ‘slab_memory_callback’:
    mm/slab.c:1127:23: error: implicit declaration of function ‘init_cache_node_node’; did you mean ‘drain_cache_node_node’? [-Werror=implicit-function-declaration]
     1127 |                 ret = init_cache_node_node(nid);
	  |                       ^~~~~~~~~~~~~~~~~~~~
	  |                       drain_cache_node_node

The #ifdef condition protecting the definition of init_cache_node_node()
no longer matches the conditions protecting the (multiple) users.

Fix this by syncing the conditions.

Fixes: 76af6a054d ("mm/migrate: add CPU hotplug to demotion #ifdef")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/b5bdea22-ed2f-3187-6efe-0c72330270a4@infradead.org
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-03-22 12:11:43 +01:00
SeongJae Park dd52a61da0 mm/damon/paddr: fix folio_nr_pages() after folio_put() in damon_pa_mark_accessed_or_deactivate()
damon_pa_mark_accessed_or_deactivate() is accessing a folio via
folio_nr_pages() after folio_put() for the folio has invoked.  Fix it.

Link: https://lkml.kernel.org/r/20230304193949.296391-3-sj@kernel.org
Fixes: f70da5ee8f ("mm/damon: convert damon_pa_mark_accessed_or_deactivate() to use folios")
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-07 17:04:55 -08:00
SeongJae Park 751688b8be mm/damon/paddr: fix folio_size() call after folio_put() in damon_pa_young()
Patch series "mm/damon/paddr: Fix folio-use-after-put bugs".

There are two folio accesses after folio_put() in mm/damon/paddr.c file. 
Fix those.


This patch (of 2):

damon_pa_young() is accessing a folio via folio_size() after folio_put()
for the folio has invoked.  Fix it.

Link: https://lkml.kernel.org/r/20230304193949.296391-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20230304193949.296391-2-sj@kernel.org
Fixes: 397b0c3a58 ("mm/damon/paddr: remove folio_sz field from damon_pa_access_chk_result")
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: <stable@vger.kernel.org>	[6.2.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-07 17:04:55 -08:00
Huang Ying 2ef7dbb269 migrate_pages: try migrate in batch asynchronously firstly
When we have locked more than one folios, we cannot wait the lock or bit
(e.g., page lock, buffer head lock, writeback bit) synchronously. 
Otherwise deadlock may be triggered.  This make it hard to batch the
synchronous migration directly.

This patch re-enables batching synchronous migration via trying to migrate
in batch asynchronously firstly.  And any folios that are failed to be
migrated asynchronously will be migrated synchronously one by one.

Test shows that this can restore the TLB flushing batching performance for
synchronous migration effectively.

Link: https://lkml.kernel.org/r/20230303030155.160983-4-ying.huang@intel.com
Fixes: 5dfab109d5 ("migrate_pages: batch _unmap and _move")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Xu, Pengfei" <pengfei.xu@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Stefan Roesch <shr@devkernel.io>
Cc: Tejun Heo <tj@kernel.org>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-07 17:04:54 -08:00
Huang Ying a21d213321 migrate_pages: move split folios processing out of migrate_pages_batch()
To simplify the code logic and reduce the line number.

Link: https://lkml.kernel.org/r/20230303030155.160983-3-ying.huang@intel.com
Fixes: 5dfab109d5 ("migrate_pages: batch _unmap and _move")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Xu, Pengfei" <pengfei.xu@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Stefan Roesch <shr@devkernel.io>
Cc: Tejun Heo <tj@kernel.org>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-07 17:04:54 -08:00
Huang Ying fb3592c41a migrate_pages: fix deadlock in batched migration
Patch series "migrate_pages: fix deadlock in batched synchronous
migration", v2.

Two deadlock bugs were reported for the migrate_pages() batching series. 
Thanks Hugh and Pengfei.  Analysis shows that if we have locked some other
folios except the one we are migrating, it's not safe in general to wait
synchronously, for example, to wait the writeback to complete or wait to
lock the buffer head.

So 1/3 fixes the deadlock in a simple way, where the batching support for
the synchronous migration is disabled.  The change is straightforward and
easy to be understood.  While 3/3 re-introduce the batching for
synchronous migration via trying to migrate asynchronously in batch
optimistically, then fall back to migrate synchronously one by one for
fail-to-migrate folios.  Test shows that this can restore the TLB flushing
batching performance for synchronous migration effectively.


This patch (of 3):

Two deadlock bugs were reported for the migrate_pages() batching series. 
Thanks Hugh and Pengfei!  For example, in the following deadlock trace
snippet,

 INFO: task kworker/u4:0:9 blocked for more than 147 seconds.
       Not tainted 6.2.0-rc4-kvm+ #1314
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u4:0    state:D stack:0     pid:9     ppid:2      flags:0x00004000
 Workqueue: loop4 loop_rootcg_workfn
 Call Trace:
  <TASK>
  __schedule+0x43b/0xd00
  schedule+0x6a/0xf0
  io_schedule+0x4a/0x80
  folio_wait_bit_common+0x1b5/0x4e0
  ? __pfx_wake_page_function+0x10/0x10
  __filemap_get_folio+0x73d/0x770
  shmem_get_folio_gfp+0x1fd/0xc80
  shmem_write_begin+0x91/0x220
  generic_perform_write+0x10e/0x2e0
  __generic_file_write_iter+0x17e/0x290
  ? generic_write_checks+0x12b/0x1a0
  generic_file_write_iter+0x97/0x180
  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
  do_iter_readv_writev+0x13c/0x210
  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
  do_iter_write+0xf6/0x330
  vfs_iter_write+0x46/0x70
  loop_process_work+0x723/0xfe0
  loop_rootcg_workfn+0x28/0x40
  process_one_work+0x3cc/0x8d0
  worker_thread+0x66/0x630
  ? __pfx_worker_thread+0x10/0x10
  kthread+0x153/0x190
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x29/0x50
  </TASK>

 INFO: task repro:1023 blocked for more than 147 seconds.
       Not tainted 6.2.0-rc4-kvm+ #1314
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:repro           state:D stack:0     pid:1023  ppid:360    flags:0x00004004
 Call Trace:
  <TASK>
  __schedule+0x43b/0xd00
  schedule+0x6a/0xf0
  io_schedule+0x4a/0x80
  folio_wait_bit_common+0x1b5/0x4e0
  ? compaction_alloc+0x77/0x1150
  ? __pfx_wake_page_function+0x10/0x10
  folio_wait_bit+0x30/0x40
  folio_wait_writeback+0x2e/0x1e0
  migrate_pages_batch+0x555/0x1ac0
  ? __pfx_compaction_alloc+0x10/0x10
  ? __pfx_compaction_free+0x10/0x10
  ? __this_cpu_preempt_check+0x17/0x20
  ? lock_is_held_type+0xe6/0x140
  migrate_pages+0x100e/0x1180
  ? __pfx_compaction_free+0x10/0x10
  ? __pfx_compaction_alloc+0x10/0x10
  compact_zone+0xe10/0x1b50
  ? lock_is_held_type+0xe6/0x140
  ? check_preemption_disabled+0x80/0xf0
  compact_node+0xa3/0x100
  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
  ? _find_first_bit+0x7b/0x90
  sysctl_compaction_handler+0x5d/0xb0
  proc_sys_call_handler+0x29d/0x420
  proc_sys_write+0x2b/0x40
  vfs_write+0x3a3/0x780
  ksys_write+0xb7/0x180
  __x64_sys_write+0x26/0x30
  do_syscall_64+0x3b/0x90
  entry_SYSCALL_64_after_hwframe+0x72/0xdc
 RIP: 0033:0x7f3a2471f59d
 RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001
 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
 RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010
 R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0
 R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000
  </TASK>

The page migration task has held the lock of the shmem folio A, and is
waiting the writeback of the folio B of the file system on the loop block
device to complete.  While the loop worker task which writes back the
folio B is waiting to lock the shmem folio A, because the folio A backs
the folio B in the loop device.  Thus deadlock is triggered.

In general, if we have locked some other folios except the one we are
migrating, it's not safe to wait synchronously, for example, to wait the
writeback to complete or wait to lock the buffer head.

To fix the deadlock, in this patch, we avoid to batch the page migration
except for MIGRATE_ASYNC mode.  In MIGRATE_ASYNC mode, synchronous waiting
is avoided.

The fix can be improved further.  We will do that as soon as possible.

Link: https://lkml.kernel.org/r/20230303030155.160983-1-ying.huang@intel.com
Link: https://lore.kernel.org/linux-mm/87a6c8c-c5c1-67dc-1e32-eb30831d6e3d@google.com/
Link: https://lore.kernel.org/linux-mm/874jrg7kke.fsf@yhuang6-desk2.ccr.corp.intel.com/
Link: https://lore.kernel.org/linux-mm/20230227110614.dngdub2j3exr6dfp@quack3/
Link: https://lkml.kernel.org/r/20230303030155.160983-2-ying.huang@intel.com
Fixes: 5dfab109d5 ("migrate_pages: batch _unmap and _move")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: "Xu, Pengfei" <pengfei.xu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Stefan Roesch <shr@devkernel.io>
Cc: Tejun Heo <tj@kernel.org>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-07 17:04:54 -08:00
David Hildenbrand 42b2af2c9b mm/userfaultfd: propagate uffd-wp bit when PTE-mapping the huge zeropage
Currently, we'd lose the userfaultfd-wp marker when PTE-mapping a huge
zeropage, resulting in the next write faults in the PMD range not
triggering uffd-wp events.

Various actions (partial MADV_DONTNEED, partial mremap, partial munmap,
partial mprotect) could trigger this.  However, most importantly,
un-protecting a single sub-page from the userfaultfd-wp handler when
processing a uffd-wp event will PTE-map the shared huge zeropage and lose
the uffd-wp bit for the remainder of the PMD.

Let's properly propagate the uffd-wp bit to the PMDs.

 #define _GNU_SOURCE
 #include <stdio.h>
 #include <stdlib.h>
 #include <stdint.h>
 #include <stdbool.h>
 #include <inttypes.h>
 #include <fcntl.h>
 #include <unistd.h>
 #include <errno.h>
 #include <poll.h>
 #include <pthread.h>
 #include <sys/mman.h>
 #include <sys/syscall.h>
 #include <sys/ioctl.h>
 #include <linux/userfaultfd.h>

 static size_t pagesize;
 static int uffd;
 static volatile bool uffd_triggered;

 #define barrier() __asm__ __volatile__("": : :"memory")

 static void uffd_wp_range(char *start, size_t size, bool wp)
 {
 	struct uffdio_writeprotect uffd_writeprotect;

 	uffd_writeprotect.range.start = (unsigned long) start;
 	uffd_writeprotect.range.len = size;
 	if (wp) {
 		uffd_writeprotect.mode = UFFDIO_WRITEPROTECT_MODE_WP;
 	} else {
 		uffd_writeprotect.mode = 0;
 	}
 	if (ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect)) {
 		fprintf(stderr, "UFFDIO_WRITEPROTECT failed: %d\n", errno);
 		exit(1);
 	}
 }

 static void *uffd_thread_fn(void *arg)
 {
 	static struct uffd_msg msg;
 	ssize_t nread;

 	while (1) {
 		struct pollfd pollfd;
 		int nready;

 		pollfd.fd = uffd;
 		pollfd.events = POLLIN;
 		nready = poll(&pollfd, 1, -1);
 		if (nready == -1) {
 			fprintf(stderr, "poll() failed: %d\n", errno);
 			exit(1);
 		}

 		nread = read(uffd, &msg, sizeof(msg));
 		if (nread <= 0)
 			continue;

 		if (msg.event != UFFD_EVENT_PAGEFAULT ||
 		    !(msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP)) {
 			printf("FAIL: wrong uffd-wp event fired\n");
 			exit(1);
 		}

 		/* un-protect the single page. */
 		uffd_triggered = true;
 		uffd_wp_range((char *)(uintptr_t)msg.arg.pagefault.address,
 			      pagesize, false);
 	}
 	return arg;
 }

 static int setup_uffd(char *map, size_t size)
 {
 	struct uffdio_api uffdio_api;
 	struct uffdio_register uffdio_register;
 	pthread_t thread;

 	uffd = syscall(__NR_userfaultfd,
 		       O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
 	if (uffd < 0) {
 		fprintf(stderr, "syscall() failed: %d\n", errno);
 		return -errno;
 	}

 	uffdio_api.api = UFFD_API;
 	uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
 	if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) {
 		fprintf(stderr, "UFFDIO_API failed: %d\n", errno);
 		return -errno;
 	}

 	if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) {
 		fprintf(stderr, "UFFD_FEATURE_WRITEPROTECT missing\n");
 		return -ENOSYS;
 	}

 	uffdio_register.range.start = (unsigned long) map;
 	uffdio_register.range.len = size;
 	uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
 	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) < 0) {
 		fprintf(stderr, "UFFDIO_REGISTER failed: %d\n", errno);
 		return -errno;
 	}

 	pthread_create(&thread, NULL, uffd_thread_fn, NULL);

 	return 0;
 }

 int main(void)
 {
 	const size_t size = 4 * 1024 * 1024ull;
 	char *map, *cur;

 	pagesize = getpagesize();

 	map = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
 	if (map == MAP_FAILED) {
 		fprintf(stderr, "mmap() failed\n");
 		return -errno;
 	}

 	if (madvise(map, size, MADV_HUGEPAGE)) {
 		fprintf(stderr, "MADV_HUGEPAGE failed\n");
 		return -errno;
 	}

 	if (setup_uffd(map, size))
 		return 1;

 	/* Read the whole range, populating zeropages. */
 	madvise(map, size, MADV_POPULATE_READ);

 	/* Write-protect the whole range. */
 	uffd_wp_range(map, size, true);

 	/* Make sure uffd-wp triggers on each page. */
 	for (cur = map; cur < map + size; cur += pagesize) {
 		uffd_triggered = false;

 		barrier();
 		/* Trigger a write fault. */
 		*cur = 1;
 		barrier();

 		if (!uffd_triggered) {
 			printf("FAIL: uffd-wp did not trigger\n");
 			return 1;
 		}
 	}

 	printf("PASS: uffd-wp triggered\n");
 	return 0;
 }

Link: https://lkml.kernel.org/r/20230302175423.589164-1-david@redhat.com
Fixes: e06f1e1dd4 ("userfaultfd: wp: enabled write protection in userfaultfd API")
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-07 17:04:53 -08:00
James Houghton 63cf584203 mm: teach mincore_hugetlb about pte markers
By checking huge_pte_none(), we incorrectly classify PTE markers as
"present".  Instead, check huge_pte_none_mostly(), classifying PTE markers
the same as if the PTE were completely blank.

PTE markers, unlike other kinds of swap entries, don't reference any
physical page and don't indicate that a physical page was mapped
previously.  As such, treat them as non-present for the sake of mincore().

Link: https://lkml.kernel.org/r/20230302222404.175303-1-jthoughton@google.com
Fixes: 5c041f5d1f ("mm: teach core mm about pte markers")
Signed-off-by: James Houghton <jthoughton@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-07 17:04:53 -08:00
Linus Torvalds e77d587a2c mm: avoid gcc complaint about pointer casting
The migration code ends up temporarily stashing information of the wrong
type in unused fields of the newly allocated destination folio.  That
all works fine, but gcc does complain about the pointer type mis-use:

    mm/migrate.c: In function ‘__migrate_folio_extract’:
    mm/migrate.c:1050:20: note: randstruct: casting between randomized structure pointer types (ssa): ‘struct anon_vma’ and ‘struct address_space’

     1050 |         *anon_vmap = (void *)dst->mapping;
          |         ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~

and gcc is actually right to complain since it really doesn't understand
that this is a very temporary special case where this is ok.

This could be fixed in different ways by just obfuscating the assignment
sufficiently that gcc doesn't see what is going on, but the truly
"proper C" way to do this is by explicitly using a union.

Using unions for type conversions like this is normally hugely ugly and
syntactically nasty, but this really is one of the few cases where we
want to make it clear that we're not doing type conversion, we're really
re-using the value bit-for-bit just using another type.

IOW, this should not become a common pattern, but in this one case using
that odd union is probably the best way to document to the compiler what
is conceptually going on here.

[ Side note: there are valid cases where we convert pointers to other
  pointer types, notably the whole "folio vs page" situation, where the
  types actually have fundamental commonalities.

  The fact that the gcc note is limited to just randomized structures
  means that we don't see equivalent warnings for those cases, but it
  migth also mean that we miss other cases where we do play these kinds
  of dodgy games, and this kind of explicit conversion might be a good
  idea. ]

I verified that at least for an allmodconfig build on x86-64, this
generates the exact same code, apart from line numbers and assembler
comment changes.

Fixes: 64c8902ed4 ("migrate_pages: split unmap_and_move() to _unmap() and _move()")
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-03-04 14:03:27 -08:00
Linus Torvalds 20fdfd55ab 17 hotfixes. Eight are for MM and seven are for other parts of the
kernel.  Seven are cc:stable and eight address post-6.3 issues or were
 judged unsuitable for -stable backporting.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZAO0bAAKCRDdBJ7gKXxA
 jo73AP0Sbgd+E0u5Hs+aACHW28FpxleVRdyexc5chXD5QsyLKgEAwjntE7jfHHYK
 GkUKsoWQJblgjm3ksRxdLbVkDSQ8sQE=
 =CQ0B
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-03-04-13-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "17 hotfixes.

  Eight are for MM and seven are for other parts of the kernel. Seven
  are cc:stable and eight address post-6.3 issues or were judged
  unsuitable for -stable backporting"

* tag 'mm-hotfixes-stable-2023-03-04-13-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mailmap: map Dikshita Agarwal's old address to his current one
  mailmap: map Vikash Garodia's old address to his current one
  fs/cramfs/inode.c: initialize file_ra_state
  fs: hfsplus: fix UAF issue in hfsplus_put_super
  panic: fix the panic_print NMI backtrace setting
  lib: parser: update documentation for match_NUMBER functions
  kasan, x86: don't rename memintrinsics in uninstrumented files
  kasan: test: fix test for new meminstrinsic instrumentation
  kasan: treat meminstrinsic as builtins in uninstrumented files
  kasan: emit different calls for instrumentable memintrinsics
  ocfs2: fix non-auto defrag path not working issue
  ocfs2: fix defrag path triggering jbd2 ASSERT
  mailmap: map Georgi Djakov's old Linaro address to his current one
  mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON
  lib/zlib: DFLTCC deflate does not write all available bits for Z_NO_FLUSH
  mm/damon/paddr: fix missing folio_put()
  mm/mremap: fix dup_anon_vma() in vma_merge() case 4
2023-03-04 13:32:50 -08:00
Marco Elver 85f195b12d kasan: test: fix test for new meminstrinsic instrumentation
The tests for memset/memmove have been failing since they haven't been
instrumented in 69d4c0d321.

Fix the test to recognize when memintrinsics aren't instrumented, and skip
test cases accordingly.  We also need to conditionally pass -fno-builtin
to the test, otherwise the instrumentation pass won't recognize
memintrinsics and end up not instrumenting them either.

Link: https://lkml.kernel.org/r/20230224085942.1791837-3-elver@google.com
Fixes: 69d4c0d321 ("entry, kasan, x86: Disallow overriding mem*() functions")
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Nicolas Schier <nicolas@fjasle.eu>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-02 21:54:22 -08:00
Marco Elver 36be5cba99 kasan: treat meminstrinsic as builtins in uninstrumented files
Where the compiler instruments meminstrinsics by generating calls to
__asan/__hwasan_ prefixed functions, let the compiler consider
memintrinsics as builtin again.

To do so, never override memset/memmove/memcpy if the compiler does the
correct instrumentation - even on !GENERIC_ENTRY architectures.

[elver@google.com: powerpc: don't rename memintrinsics if compiler adds prefixes]
  Link: https://lore.kernel.org/all/20230224085942.1791837-1-elver@google.com/ [1]
  Link: https://lkml.kernel.org/r/20230227094726.3833247-1-elver@google.com
Link: https://lkml.kernel.org/r/20230224085942.1791837-2-elver@google.com
Fixes: 69d4c0d321 ("entry, kasan, x86: Disallow overriding mem*() functions")
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Nicolas Schier <nicolas@fjasle.eu>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-02 21:54:22 -08:00
Marco Elver 51287dcb00 kasan: emit different calls for instrumentable memintrinsics
Clang 15 provides an option to prefix memcpy/memset/memmove calls with
__asan_/__hwasan_ in instrumented functions:
https://reviews.llvm.org/D122724

GCC will add support in future:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108777

Use it to regain KASAN instrumentation of memcpy/memset/memmove on
architectures that require noinstr to be really free from instrumented
mem*() functions (all GENERIC_ENTRY architectures).

Link: https://lkml.kernel.org/r/20230224085942.1791837-1-elver@google.com
Fixes: 69d4c0d321 ("entry, kasan, x86: Disallow overriding mem*() functions")
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: kasan-dev@googlegroups.com
Cc: Kees Cook <keescook@chromium.org>
Cc: Linux Kernel Functional Testing <lkft@linaro.org>
Cc: Nathan Chancellor <nathan@kernel.org> # build only
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Nicolas Schier <nicolas@fjasle.eu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-02 21:54:22 -08:00
Naoya Horiguchi 6da6b1d4a7 mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON
After a memory error happens on a clean folio, a process unexpectedly
receives SIGBUS when it accesses the error page.  This SIGBUS killing is
pointless and simply degrades the level of RAS of the system, because the
clean folio can be dropped without any data lost on memory error handling
as we do for a clean pagecache.

When memory_failure() is called on a clean folio, try_to_unmap() is called
twice (one from split_huge_page() and one from hwpoison_user_mappings()). 
The root cause of the issue is that pte conversion to hwpoisoned entry is
now done in the first call of try_to_unmap() because PageHWPoison is
already set at this point, while it's actually expected to be done in the
second call.  This behavior disturbs the error handling operation like
removing pagecache, which results in the malfunction described above.

So convert TTU_IGNORE_HWPOISON into TTU_HWPOISON and set TTU_HWPOISON only
when we really intend to convert pte to hwpoison entry.  This can prevent
other callers of try_to_unmap() from accidentally converting to hwpoison
entries.

Link: https://lkml.kernel.org/r/20230221085905.1465385-1-naoya.horiguchi@linux.dev
Fixes: a42634a6c0 ("readahead: Use a folio in read_pages()")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-27 17:00:14 -08:00
andrew.yang 3f98c9a62c mm/damon/paddr: fix missing folio_put()
damon_get_folio() would always increase folio _refcount and
folio_isolate_lru() would increase folio _refcount if the folio's lru flag
is set.

If an unevictable folio isolated successfully, there will be two more
_refcount.  The one from folio_isolate_lru() will be decreased in
folio_puback_lru(), but the other one from damon_get_folio() will be left
behind.  This causes a pin page.

Whatever the case, the _refcount from damon_get_folio() should be
decreased.

Link: https://lkml.kernel.org/r/20230222064223.6735-1-andrew.yang@mediatek.com
Fixes: 57223ac295 ("mm/damon/paddr: support the pageout scheme")
Signed-off-by: andrew.yang <andrew.yang@mediatek.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org>	[5.16.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-27 17:00:14 -08:00
Vlastimil Babka 4c67599678 mm/mremap: fix dup_anon_vma() in vma_merge() case 4
In case 4, we are shrinking 'prev' (PPPP in the comment) and expanding
'mid' (NNNN).  So we need to make sure 'mid' clones the anon_vma from
'prev', if it doesn't have any.  After commit 0503ea8f5b ("mm/mmap:
remove __vma_adjust()") we can fail to do that due to wrong parameters for
dup_anon_vma().  The call is a no-op because res == next, adjust == mid
and mid == next.  Fix it.

Link: https://lkml.kernel.org/r/ad91d62b-37eb-4b73-707a-3c45c9e16256@suse.cz
Fixes: 0503ea8f5b ("mm/mmap: remove __vma_adjust()")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-27 17:00:14 -08:00
Linus Torvalds 5a6d92493b memblock: small optimizations
* fix off-by-one in the check whether memblock_add_range() should
   reallocate memory to accommodate newly inserted range
 * check only for relevant regions in memblock_merge_regions() rather than
   swipe over the entire array
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAmP8cSMQHHJwcHRAa2Vy
 bmVsLm9yZwAKCRA5A4Ymyw79kZvWCACv2Zq98ngzyporEXiB7nxvzy2nujdPwJDO
 AivRFIqpq6yD+RRYPu6C2ynvJ1o4+k5E3GTyNrecqhqY64y9zWkg0T+G3LBiVZRR
 IT2qIPMqqNsV6DCTSo2og7oTSw/rD3b1jKZvWXY0ulUzvXvNCsddrYD0uQQZ4cIp
 Mt4zt5SDpeJMDuuKrT31xeeMNAnUOfkxzresUq1iWNcqIS/bKQgPl+pwLAmgX4Th
 gVSxPChMO+ZNSUPp0znSt6jySCQKQygW5xfgRpaJJcYiiNlGWZvZRePMVYXmcwlt
 LrBuKZs8kpcJpSierBKwS7zz1Ctwa0lDEw8JZAoajFd8hmsVx1YV
 =Ky/Y
 -----END PGP SIGNATURE-----

Merge tag 'memblock-v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock

Pull memblock updates from Mike Rapoport:
 "Small optimizations:

   - fix off-by-one in the check whether memblock_add_range() should
     reallocate memory to accommodate newly inserted range

   - check only for relevant regions in memblock_merge_regions() rather
     than swipe over the entire array"

* tag 'memblock-v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  memblock: Avoid useless checks in memblock_merge_regions().
  memblock: Make a boundary tighter in memblock_add_range().
2023-02-27 09:34:53 -08:00
Liam R. Howlett 2fcd07b7cc mm/mprotect: Fix successful vma_merge() of next in do_mprotect_pkey()
If mprotect_fixup() successfully calls vma_merge() and replaces vma and
the next vma, then the tmp variable in the do_mprotect_pkey() is not
updated to point to the new vma end.  This results in the loop detecting
a gap between VMAs that does not exist.

Fix the faulty value of tmp by setting it to the end location of the vma
iterator at the end of the loop.

Link: https://lkml.kernel.org/r/20230224212055.1786100-1-Liam.Howlett@oracle.com
Fixes: 2286a6914c ("mm: change mprotect_fixup to vma iterator")
Link: https://lore.kernel.org/linux-mm/20230223120407.729110a6ecd1416ac59d9cb0@linux-foundation.org/
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reported-by: Bert Karwatzki <spasswolf@web.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217061
Tested-by: Bert Karwatzki <spasswolf@web.de>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-mm/CAHk-=wjFmVL7NiuxL54qLkoabni_yD-oF9=dpDgETtdsiCbhUg@mail.gmail.com/
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-02-25 20:16:48 -08:00
Linus Torvalds 3822a7c409 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X bit.
 
 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.
 
 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes
 
 - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
   does perform some memcg maintenance and cleanup work.
 
 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".  These filters provide users
   with finer-grained control over DAMOS's actions.  SeongJae has also done
   some DAMON cleanup work.
 
 - Kairui Song adds a series ("Clean up and fixes for swap").
 
 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".
 
 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series.  It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.
 
 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".
 
 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".
 
 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".
 
 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".
 
 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series "mm:
   support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
   PTEs".
 
 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".
 
 - Sergey Senozhatsky has improved zsmalloc's memory utilization with his
   series "zsmalloc: make zspage chain size configurable".
 
 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.  The previous BPF-based approach had
   shortcomings.  See "mm: In-kernel support for memory-deny-write-execute
   (MDWE)".
 
 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
 
 - T.J.  Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".
 
 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a per-node
   basis.  See the series "Introduce per NUMA node memory error
   statistics".
 
 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage during
   compaction".
 
 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".
 
 - Christoph Hellwig has removed block_device_operations.rw_page() in ths
   series "remove ->rw_page".
 
 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".
 
 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier functions".
 
 - Some pagemap cleanup and generalization work in Mike Rapoport's series
   "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
   "fixups for generic implementation of pfn_valid()"
 
 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
 
 - Jason Gunthorpe rationalized the GUP system's interface to the rest of
   the kernel in the series "Simplify the external interface for GUP".
 
 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface.  To support this, we'll temporarily be
   printing warnings when people use the debugfs interface.  See the series
   "mm/damon: deprecate DAMON debugfs interface".
 
 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.
 
 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".
 
 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
 jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
 DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
 =MlGs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
   F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X
   bit.

 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.

 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes

 - Johannes Weiner has a series ("mm: push down lock_page_memcg()")
   which does perform some memcg maintenance and cleanup work.

 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".

   These filters provide users with finer-grained control over DAMOS's
   actions. SeongJae has also done some DAMON cleanup work.

 - Kairui Song adds a series ("Clean up and fixes for swap").

 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".

 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.

 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".

 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".

 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".

 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".

 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series
   "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
   swap PTEs".

 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".

 - Sergey Senozhatsky has improved zsmalloc's memory utilization with
   his series "zsmalloc: make zspage chain size configurable".

 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.

   The previous BPF-based approach had shortcomings. See "mm: In-kernel
   support for memory-deny-write-execute (MDWE)".

 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".

 - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".

 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a
   per-node basis. See the series "Introduce per NUMA node memory error
   statistics".

 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage
   during compaction".

 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".

 - Christoph Hellwig has removed block_device_operations.rw_page() in
   ths series "remove ->rw_page".

 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".

 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier
   functions".

 - Some pagemap cleanup and generalization work in Mike Rapoport's
   series "mm, arch: add generic implementation of pfn_valid() for
   FLATMEM" and "fixups for generic implementation of pfn_valid()"

 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".

 - Jason Gunthorpe rationalized the GUP system's interface to the rest
   of the kernel in the series "Simplify the external interface for
   GUP".

 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface. To support this, we'll temporarily be
   printing warnings when people use the debugfs interface. See the
   series "mm/damon: deprecate DAMON debugfs interface".

 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.

 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".

 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".

* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
  include/linux/migrate.h: remove unneeded externs
  mm/memory_hotplug: cleanup return value handing in do_migrate_range()
  mm/uffd: fix comment in handling pte markers
  mm: change to return bool for isolate_movable_page()
  mm: hugetlb: change to return bool for isolate_hugetlb()
  mm: change to return bool for isolate_lru_page()
  mm: change to return bool for folio_isolate_lru()
  objtool: add UACCESS exceptions for __tsan_volatile_read/write
  kmsan: disable ftrace in kmsan core code
  kasan: mark addr_has_metadata __always_inline
  mm: memcontrol: rename memcg_kmem_enabled()
  sh: initialize max_mapnr
  m68k/nommu: add missing definition of ARCH_PFN_OFFSET
  mm: percpu: fix incorrect size in pcpu_obj_full_size()
  maple_tree: reduce stack usage with gcc-9 and earlier
  mm: page_alloc: call panic() when memoryless node allocation fails
  mm: multi-gen LRU: avoid futile retries
  migrate_pages: move THP/hugetlb migration support check to simplify code
  migrate_pages: batch flushing TLB
  migrate_pages: share more code between _unmap and _move
  ...
2023-02-23 17:09:35 -08:00