- Bruce steps down as NFSD maintainer
- Prepare for dynamic nfsd thread management
- More work on supporting re-exporting NFS mounts
- One fs/locks patch on behalf of Jeff Layton
Notable bug fixes:
- Fix zero-length NFSv3 WRITEs
- Fix directory cinfo on FS's that do not support iversion
- Fix WRITE verifiers for stable writes
- Fix crash on COPY_NOTIFY with a special state ID
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmHcWOMACgkQM2qzM29m
f5dh0Q/+MjEL0IK551FdChx9Es1JqKRggv9KwJkLIoa1bw/PMSwP2pnKz6eL0Yun
mdhE9AZQgyFH1IAGdqjeLZKIYRin6bvAdDrnlqQ9SvTviPLWniSUI6AuyUqK6Zyk
wMcXpyOze0fhpxkYmz8/g7i66w967tmLh5MRvV1dkpOYAe99rYwGhvj+9ZeEWfNI
TgmptntMG6YEb+xY0E73otXZHMr2DL67ZYvOUYWemJA1uxcX4joaWBg8sx74dB6k
DUB4BFuoURk6viDD1QYh3qPU3dz9RCJNMz/cWd8+2t7BdaujTSXRIcaFslrQnKfL
Rm+O7pi5W+XohFDjeuMZ1g0c1ot/aoZSaAz00LoCVhejJ/sK9NiPAN1+LyY91Lja
cUBMVPNfW7ClIpiZcORP/chNmVn2qlaL2nxzSY/Uegnd5pIIeVD0pFVgx4+NlEat
mbrrQBcMpBRM0B+RzHS6AusqHrGdSEcwqWoVXWdxsBigJQT/AxWmii3U88k0Z54i
ooMWLaQ9EBBmygV01JN/OBySW2M/dvbfz3eFROvAVqsIP9JWP3FlUOlRDl8GcjXA
azi9fTysBom7WtL6NPcxDJbJ2t9hYr2YaztTpdo9YCHOuQbSQT6IWR5PAa3zvwMu
Bfz6Y8Hoo/KZHCqmkPGYM+x1ENCyDPv788E+erdnw1PFP5F3Pbo=
=/kX3
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"Bruce has announced he is leaving Red Hat at the end of the month and
is stepping back from his role as NFSD co-maintainer. As a result,
this includes a patch removing him from the MAINTAINERS file.
There is one patch in here that Jeff Layton was carrying in the locks
tree. Since he had only one for this cycle, he asked us to send it to
you via the nfsd tree.
There continues to be 0-day reports from Robert Morris @MIT. This time
we include a fix for a crash in the COPY_NOTIFY operation.
Highlights:
- Bruce steps down as NFSD maintainer
- Prepare for dynamic nfsd thread management
- More work on supporting re-exporting NFS mounts
- One fs/locks patch on behalf of Jeff Layton
Notable bug fixes:
- Fix zero-length NFSv3 WRITEs
- Fix directory cinfo on FS's that do not support iversion
- Fix WRITE verifiers for stable writes
- Fix crash on COPY_NOTIFY with a special state ID"
* tag 'nfsd-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (51 commits)
SUNRPC: Fix sockaddr handling in svcsock_accept_class trace points
SUNRPC: Fix sockaddr handling in the svc_xprt_create_error trace point
fs/locks: fix fcntl_getlk64/fcntl_setlk64 stub prototypes
nfsd: fix crash on COPY_NOTIFY with special stateid
MAINTAINERS: remove bfields
NFSD: Move fill_pre_wcc() and fill_post_wcc()
Revert "nfsd: skip some unnecessary stats in the v4 case"
NFSD: Trace boot verifier resets
NFSD: Rename boot verifier functions
NFSD: Clean up the nfsd_net::nfssvc_boot field
NFSD: Write verifier might go backwards
nfsd: Add a tracepoint for errors in nfsd4_clone_file_range()
NFSD: De-duplicate net_generic(nf->nf_net, nfsd_net_id)
NFSD: De-duplicate net_generic(SVC_NET(rqstp), nfsd_net_id)
NFSD: Clean up nfsd_vfs_write()
nfsd: Replace use of rwsem with errseq_t
NFSD: Fix verifier returned in stable WRITEs
nfsd: Retry once in nfsd_open on an -EOPENSTALE return
nfsd: Add errno mapping for EREMOTEIO
nfsd: map EBADF
...
- fix possible uninitialized memory usage for setattr
- fix fscache reading hole in a file just after it's been grown
- split net/9p/trans_fd.c in its own module like other transports
that module defaults to 9P_NET and is autoloaded if required so
users should not be impacted
- add Christian Schoenebeck to 9p reviewers
- some more trivial cleanup
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmHgt18ACgkQq06b7GqY
5nAdUA//ZHFTIcnbTiVkqbw0YztxedTOhdGCWcYsszux0pNQ/ZCjbP5NxESWiFxs
raYWcvE98Xvq4fbs8b+m7YKBJzWF+Km/v1MKfgvZZWFLZR3MfWiL8iojlsaUgG9U
rBmEUyaTWw2COIvFN7EqnwT5mwzNCli5d3AzaAmgffWsHEi/+EVU35YX70ySUkjW
nVf08oX3dBB425oArXOZApOZSVRsUr5YDSuQGiFHBL+hvPTrvPumu/AXbjLbTbmf
NuUxu1Akw8/jhWNATLHzjCdzJzBSfF0zRs+oH9Qt8MKkUrlTfjxc/2MbQYL1p1IN
84XxiG02ebw+Mx05cydP9/Ll7gOo2p6ORGXMHcoPIAMy6zFiCKo3TRdKDgmYDauC
K8c3v+osNwl2GFn1+2XoQIWnqXb7bJ1debkFtWVEGOPd/Yr6CYCZyfiZamoPJXUO
TjkoLAJXJGKlV66ZCuVeAsySdUxdtwbj2WxTliDUWtxvQ+m1jt06S3f/TbA7a9bB
8aBhx7FKzQ/UL8zhm4+3WLPlWkoSUXhFSZZUknvhuaHFV/S1xglXox5OAQJrJMy+
qOzmOjCg14T1TC2WkJlEsedLUH8ACU+XueAPT6uqSdA4rvQEVzk32zTm/GfFu3Q3
0RhcGlW5kAAYBTBLXvmKsjyW2OmPCSycgbWCu3E8A/1gubbRE40=
=o2Lc
-----END PGP SIGNATURE-----
Merge tag '9p-for-5.17-rc1' of git://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
"Fixes, split 9p_net_fd, and new reviewer:
- fix possible uninitialized memory usage for setattr
- fix fscache reading hole in a file just after it's been grown
- split net/9p/trans_fd.c in its own module like other transports.
The new transport module defaults to 9P_NET and is autoloaded if
required so users should not be impacted
- add Christian Schoenebeck to 9p reviewers
- some more trivial cleanup"
* tag '9p-for-5.17-rc1' of git://github.com/martinetd/linux:
9p: fix enodata when reading growing file
net/9p: show error message if user 'msize' cannot be satisfied
MAINTAINERS: 9p: add Christian Schoenebeck as reviewer
9p: only copy valid iattrs in 9P2000.L setattr implementation
9p: Use BUG_ON instead of if condition followed by BUG.
net/p9: load default transports
9p/xen: autoload when xenbus service is available
9p/trans_fd: split into dedicated module
fs: 9p: remove unneeded variable
9p/trans_virtio: Fix typo in the comment for p9_virtio_create()
Merge misc updates from Andrew Morton:
"146 patches.
Subsystems affected by this patch series: kthread, ia64, scripts,
ntfs, squashfs, ocfs2, vfs, and mm (slab-generic, slab, kmemleak,
dax, kasan, debug, pagecache, gup, shmem, frontswap, memremap,
memcg, selftests, pagemap, dma, vmalloc, memory-failure, hugetlb,
userfaultfd, vmscan, mempolicy, oom-kill, hugetlbfs, migration, thp,
ksm, page-poison, percpu, rmap, zswap, zram, cleanups, hmm, and
damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (146 commits)
mm/damon: hide kernel pointer from tracepoint event
mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log
mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging
mm/damon/dbgfs: remove an unnecessary variable
mm/damon: move the implementation of damon_insert_region to damon.h
mm/damon: add access checking for hugetlb pages
Docs/admin-guide/mm/damon/usage: update for schemes statistics
mm/damon/dbgfs: support all DAMOS stats
Docs/admin-guide/mm/damon/reclaim: document statistics parameters
mm/damon/reclaim: provide reclamation statistics
mm/damon/schemes: account how many times quota limit has exceeded
mm/damon/schemes: account scheme actions that successfully applied
mm/damon: remove a mistakenly added comment for a future feature
Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk|rm)_contexts
Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning
Docs/admin-guide/mm/damon/usage: remove redundant information
Docs/admin-guide/mm/damon/usage: update for scheme quotas and watermarks
mm/damon: convert macro functions to static inline functions
mm/damon: modify damon_rand() macro to static inline function
mm/damon: move damon_rand() definition into damon.h
...
Pass "end - 1" instead of "end" when walking the interval tree in
hugetlb_vmdelete_list() to fix an inclusive vs. exclusive bug. The two
callers that pass a non-zero "end" treat it as exclusive, whereas the
interval tree iterator expects an inclusive "last". E.g. punching a
hole in a file that precisely matches the size of a single hugepage,
with a vma starting right on the boundary, will result in
unmap_hugepage_range() being called twice, with the second call having
start==end.
The off-by-one error doesn't cause functional problems as
__unmap_hugepage_range() turns into a massive nop due to
short-circuiting its for-loop on "address < end". But, the mmu_notifier
invocations to invalid_range_{start,end}() are passed a bogus zero-sized
range, which may be unexpected behavior for secondary MMUs.
The bug was exposed by commit ed922739c9 ("KVM: Use interval tree to
do fast hva lookup in memslots"), currently queued in the KVM tree for
5.17, which added a WARN to detect ranges with start==end.
Link: https://lkml.kernel.org/r/20211228234257.1926057-1-seanjc@google.com
Fixes: 1bfad99ab4 ("hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reported-by: syzbot+4e697fe80a31aa7efe21@syzkaller.appspotmail.com
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Various places in the kernel - largely in filesystems - respond to a
memory allocation failure by looping around and re-trying. Some of
these cannot conveniently use __GFP_NOFAIL, for reasons such as:
- a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on
- a need to check for the process being signalled between failures
- the possibility that other recovery actions could be performed
- the allocation is quite deep in support code, and passing down an
extra flag to say if __GFP_NOFAIL is wanted would be clumsy.
Many of these currently use congestion_wait() which (in almost all
cases) simply waits the given timeout - congestion isn't tracked for
most devices.
It isn't clear what the best delay is for loops, but it is clear that
the various filesystems shouldn't be responsible for choosing a timeout.
This patch introduces memalloc_retry_wait() with takes on that
responsibility. Code that wants to retry a memory allocation can call
this function passing the GFP flags that were used. It will wait
however is appropriate.
For now, it only considers __GFP_NORETRY and whatever
gfpflags_allow_blocking() tests. If blocking is allowed without
__GFP_NORETRY, then alloc_page either made some reclaim progress, or
waited for a while, before failing. So there is no need for much
further waiting. memalloc_retry_wait() will wait until the current
jiffie ends. If this condition is not met, then alloc_page() won't have
waited much if at all. In that case memalloc_retry_wait() waits about
200ms. This is the delay that most current loops uses.
linux/sched/mm.h needs to be included in some files now,
but linux/backing-dev.h does not.
Link: https://lkml.kernel.org/r/163754371968.13692.1277530886009912421@noble.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patch to add anonymous vma names causes a build failure in some
configurations:
include/linux/mm_types.h: In function 'is_same_vma_anon_name':
include/linux/mm_types.h:924:37: error: implicit declaration of function 'strcmp' [-Werror=implicit-function-declaration]
924 | return name && vma_name && !strcmp(name, vma_name);
| ^~~~~~
include/linux/mm_types.h:22:1: note: 'strcmp' is defined in header '<string.h>'; did you forget to '#include <string.h>'?
This should not really be part of linux/mm_types.h in the first place,
as that header is meant to only contain structure defintions and need a
minimum set of indirect includes itself.
While the header clearly includes more than it should at this point,
let's not make it worse by including string.h as well, which would pull
in the expensive (compile-speed wise) fortify-string logic.
Move the new functions into a separate header that only needs to be
included in a couple of locations.
Link: https://lkml.kernel.org/r/20211207125710.2503446-1-arnd@kernel.org
Fixes: "mm: add a field to store names for private anonymous memory"
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Colin Cross <ccross@google.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In many userspace applications, and especially in VM based applications
like Android uses heavily, there are multiple different allocators in
use. At a minimum there is libc malloc and the stack, and in many cases
there are libc malloc, the stack, direct syscalls to mmap anonymous
memory, and multiple VM heaps (one for small objects, one for big
objects, etc.). Each of these layers usually has its own tools to
inspect its usage; malloc by compiling a debug version, the VM through
heap inspection tools, and for direct syscalls there is usually no way
to track them.
On Android we heavily use a set of tools that use an extended version of
the logic covered in Documentation/vm/pagemap.txt to walk all pages
mapped in userspace and slice their usage by process, shared (COW) vs.
unique mappings, backing, etc. This can account for real physical
memory usage even in cases like fork without exec (which Android uses
heavily to share as many private COW pages as possible between
processes), Kernel SamePage Merging, and clean zero pages. It produces
a measurement of the pages that only exist in that process (USS, for
unique), and a measurement of the physical memory usage of that process
with the cost of shared pages being evenly split between processes that
share them (PSS).
If all anonymous memory is indistinguishable then figuring out the real
physical memory usage (PSS) of each heap requires either a pagemap
walking tool that can understand the heap debugging of every layer, or
for every layer's heap debugging tools to implement the pagemap walking
logic, in which case it is hard to get a consistent view of memory
across the whole system.
Tracking the information in userspace leads to all sorts of problems.
It either needs to be stored inside the process, which means every
process has to have an API to export its current heap information upon
request, or it has to be stored externally in a filesystem that somebody
needs to clean up on crashes. It needs to be readable while the process
is still running, so it has to have some sort of synchronization with
every layer of userspace. Efficiently tracking the ranges requires
reimplementing something like the kernel vma trees, and linking to it
from every layer of userspace. It requires more memory, more syscalls,
more runtime cost, and more complexity to separately track regions that
the kernel is already tracking.
This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
userspace-provided name for anonymous vmas. The names of named
anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
[anon:<name>].
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)
Setting the name to NULL clears it. The name length limit is 80 bytes
including NUL-terminator and is checked to contain only printable ascii
characters (including space), except '[',']','\','$' and '`'.
Ascii strings are being used to have a descriptive identifiers for vmas,
which can be understood by the users reading /proc/pid/maps or
/proc/pid/smaps. Names can be standardized for a given system and they
can include some variable parts such as the name of the allocator or a
library, tid of the thread using it, etc.
The name is stored in a pointer in the shared union in vm_area_struct
that points to a null terminated string. Anonymous vmas with the same
name (equivalent strings) and are otherwise mergeable will be merged.
The name pointers are not shared between vmas even if they contain the
same name. The name pointer is stored in a union with fields that are
only used on file-backed mappings, so it does not increase memory usage.
CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
feature. It keeps the feature disabled by default to prevent any
additional memory overhead and to avoid confusing procfs parsers on
systems which are not ready to support named anonymous vmas.
The patch is based on the original patch developed by Colin Cross, more
specifically on its latest version [1] posted upstream by Sumit Semwal.
It used a userspace pointer to store vma names. In that design, name
pointers could be shared between vmas. However during the last
upstreaming attempt, Kees Cook raised concerns [2] about this approach
and suggested to copy the name into kernel memory space, perform
validity checks [3] and store as a string referenced from
vm_area_struct.
One big concern is about fork() performance which would need to strdup
anonymous vma names. Dave Hansen suggested experimenting with
worst-case scenario of forking a process with 64k vmas having longest
possible names [4]. I ran this experiment on an ARM64 Android device
and recorded a worst-case regression of almost 40% when forking such a
process.
This regression is addressed in the followup patch which replaces the
pointer to a name with a refcounted structure that allows sharing the
name pointer between vmas of the same name. Instead of duplicating the
string during fork() or when splitting a vma it increments the refcount.
[1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
[2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
[3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
[4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/
Changes for prctl(2) manual page (in the options section):
PR_SET_VMA
Sets an attribute specified in arg2 for virtual memory areas
starting from the address specified in arg3 and spanning the
size specified in arg4. arg5 specifies the value of the attribute
to be set. Note that assigning an attribute to a virtual memory
area might prevent it from being merged with adjacent virtual
memory areas due to the difference in that attribute's value.
Currently, arg2 must be one of:
PR_SET_VMA_ANON_NAME
Set a name for anonymous virtual memory areas. arg5 should
be a pointer to a null-terminated string containing the
name. The name length including null byte cannot exceed
80 bytes. If arg5 is NULL, the name of the appropriate
anonymous virtual memory areas will be reset. The name
can contain only printable ascii characters (including
space), except '[',']','\','$' and '`'.
This feature is available only if the kernel is built with
the CONFIG_ANON_VMA_NAME option enabled.
[surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table]
Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com
[surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
work here was done by Colin Cross, therefore, with his permission, keeping
him as the author]
Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.com
Signed-off-by: Colin Cross <ccross@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Glauber <jan.glauber@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rob Landley <rob@landley.net>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Shaohua Li <shli@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dump_mapping() is a big chunk of dump_page(), and it'd be handy to be
able to call it when we don't have a struct page. Split it out and move
it to fs/inode.c. Take the opportunity to simplify some of the debug
messages a little.
Link: https://lkml.kernel.org/r/20211121121056.2870061-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__user annotations are used by the checker (e.g sparse) to mark user
pointers. However here __user is applied to a struct directly, without a
pointer being directly involved.
Although the presence of __user does not cause sparse to emit a warning,
__user should be removed for consistency with other uses of offsetof().
Note: No functional changes intended.
Link: https://lkml.kernel.org/r/20211122101256.7875-1-amit.kachhap@arm.com
Signed-off-by: Amit Daniel Kachhap <amit.kachhap@arm.com>
Cc: Vincenzo Frascino <Vincenzo.Frascino@arm.com>
Cc: Kevin Brodsky <Kevin.Brodsky@arm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The variable 'free_space' is being initialized with a value that is not
read, it is being re-assigned later in the two paths of an if statement.
The early initialization is redundant and can be removed.
Link: https://lkml.kernel.org/r/20220112230411.1090761-1-colin.i.king@gmail.com
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are currently two ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field.
Move the ocfs2 cluster sysfs code to use default_groups field which has
been the preferred way since aa30f47cf6 ("kobject: Add support for
default attribute groups to kobj_type") so that we can soon get rid of
the obsolete default_attrs field.
Link: https://lkml.kernel.org/r/20220106102028.3345634-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The variable 'root_bh' is being initialized with a value that is not
read, it is being re-assigned later on closer to its use. The early
initialization is redundant and can be removed.
Link: https://lkml.kernel.org/r/20211228013719.620923-1-colin.i.king@gmail.com
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are currently two ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field.
Move the ocfs2 code to use default_groups field which has been the
preferred way since aa30f47cf6 ("kobject: Add support for default
attribute groups to kobj_type") so that we can soon get rid of the
obsolete default_attrs field.
Link: https://lkml.kernel.org/r/20211228144517.391660-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2_grab_pages_for_write() may return -EAGAIN if write context type is
mmap and it could not lock the target page. In this case, we exit with
no error and no target page. And then trigger the caller page_mkwrite()
to retry.
Since there are other caller types, e.g. buffer and direct io, make the
return value handling more clear.
Link: https://lkml.kernel.org/r/20211206065051.103353-1-joseph.qi@linux.alibaba.com
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This issue was detected with the help of Coccinelle.
Link: https://lkml.kernel.org/r/20211105014424.75372-1-zhang.mingyu@zte.com.cn
Signed-off-by: Zhang Mingyu <zhang.mingyu@zte.com.cn>
Reported-by: Zeal Robot <zealci@zte.com.cn>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit c1f6925e10 ("mm: put readahead pages in cache earlier") causes
the read performance of squashfs to deteriorate.Through testing, we find
that the performance will be back by closing the readahead of squashfs.
So we want to learn the way of ubifs, provides backing_dev_info and
disable read-ahead
We tested the following data by fio.
squashfs image blocksize=128K
test command:
fio --name basic --bs=? --filename="/mnt/test_file" --rw=? --iodepth=1 --ioengine=psync --runtime=200 --time_based
turn on squashfs readahead in 5.10 kernel
bs(k) read/randread MB/s
4 randread 271
128 randread 231
1024 randread 246
4 read 310
128 read 245
1024 read 247
turn off squashfs readahead in 5.10 kernel
bs(k) read/randread MB/s
4 randread 293
128 randread 330
1024 randread 363
4 read 338
128 read 360
1024 read 365
turn on squashfs readahead and revert the
commit c1f6925e1091("mm: put readahead
pages in cache earlier") in 5.10 kernel
bs(k) read/randread MB/s
4 randread 289
128 randread 306
1024 randread 335
4 read 337
128 read 336
1024 read 338
Link: https://lkml.kernel.org/r/20211116113141.1391026-1-zhengliang6@huawei.com
Signed-off-by: Zheng Liang <zhengliang6@huawei.com>
Reviewed-by: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Hou Tao <houtao1@huawei.com>
Cc: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The comments for the file should not be in kernel-doc format:
/**
* attrib.c - NTFS attribute operations. Part of the Linux-NTFS
as it causes it to be incorrectly identified for function
ntfs_map_runlist_nolock(), causing some warnings found by running
scripts/kernel-doc.:
fs/ntfs/attrib.c:25: warning: Incorrect use of kernel-doc format: * ntfs_map_runlist_nolock - map (a part of) a runlist of an ntfs inode
fs/ntfs/attrib.c:71: warning: Function parameter or member 'ni' not described in 'ntfs_map_runlist_nolock'
fs/ntfs/attrib.c:71: warning: Function parameter or member 'vcn' not described in 'ntfs_map_runlist_nolock'
fs/ntfs/attrib.c:71: warning: Function parameter or member 'ctx' not described in 'ntfs_map_runlist_nolock'
fs/ntfs/attrib.c:71: warning: expecting prototype for attrib.c - NTFS attribute operations. Part of the Linux(). Prototype was for ntfs_map_runlist_nolock() instead
Link: https://lkml.kernel.org/r/20220106015145.67067-1-yang.lee@linux.alibaba.com
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Fix a minor locking inconsistency in readdir
- Fix incorrect fs feature bit validation for secondary superblocks
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmHfFFYACgkQ+H93GTRK
tOvrQA//RnDMit4zLoPo3WeZTDjpv1CQkOXFgQSba+Owgz1RvGn5f29R4ICKW6Uy
NfnPcq5AeNhysOAi0uZPl4EC6140QgMrD4aJjKY1JAc/BkeBVEN05NOUFfCuBXU/
RMzL8nGD5zsN7fPIWVGS60iitou8IvoX8ky4dKx7XcsbFzbBMtnJIsUpfqVutY9u
i+zub6sNRkstr4uBRk+1S8uAqHUAW+21YwfKqB6pgpCoO5BHu2e4eN01ohWvF8Ru
0ujJ2j4YlfjYmtQypFk3rNgQoI0oXY6mYWZPKr7fYvhDpoKEodUvHRLIJEHqoD+y
fX6Ey5XxpHmDxSJnWRC7Vznl56VEmMndUcWEq2ZqROp/r/zZp7StXyHLO7DR9nEs
mp+55a4tcKhHa/KnjbqexAaN4a1NTpryMqjsPHP7VTNu5Dq8CK4kHtrcEVrKCZFq
ExRFMfoUDxap6iJaxoKAz5CtZyJeuZO8bLCa7jq/2F91EWp+2aclxEU6VY5r6B2X
dFlIY8XnZEfJxE3xnhH/aDs6IKWH4YmvgtwxNb+RIupyMTfJzpcysjV9NRER8fAv
9rLfWNz+nx0efNyXle+h+vrzT/zXgyi/0PSjAQS9/xErTPRFAmXFefLl/0oMQSHC
XFIVivcR0lxKpZhB5CIOfVj1Vu8VM52JvlXjHPU3ObqEbFDxtf4=
=RYJ4
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.17-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"These are the last few obvious fixes that I found while stress testing
online fsck for XFS prior to initiating a design review of the whole
giant machinery.
- Fix a minor locking inconsistency in readdir
- Fix incorrect fs feature bit validation for secondary superblocks"
* tag 'xfs-5.17-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix online fsck handling of v5 feature bits on secondary supers
xfs: take the ILOCK when readdir inspects directory mapping data
- Simplify the dax_operations API
- Eliminate bdev_dax_pgoff() in favor of the filesystem maintaining
and applying a partition offset to all its DAX iomap operations.
- Remove wrappers and device-mapper stacked callbacks for
->copy_from_iter() and ->copy_to_iter() in favor of moving
block_device relative offset responsibility to the
dax_direct_access() caller.
- Remove the need for an @bdev in filesystem-DAX infrastructure
- Remove unused uio helpers copy_from_iter_flushcache() and
copy_mc_to_iter() as only the non-check_copy_size() versions are
used for DAX.
- Prepare XFS for the pending (next merge window) DAX+reflink support
- Remove deprecated DEV_DAX_PMEM_COMPAT support
- Cleanup a straggling misuse of the GUID api
Tags offered after the branch was cut:
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/Ydb/3P+8nvjCjYfO@redhat.com
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCYd3dTAAKCRDfioYZHlFs
Z//UAP9zetoTE+O7zJG7CXja4jSopSadbdbh6QKSXaqfKBPvQQD+N4US3wA2bGv8
f/qCY62j2Hj3hUTGHs9RvTyw3JsSYAA=
=QvDs
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull dax and libnvdimm updates from Dan Williams:
"The bulk of this is a rework of the dax_operations API after
discovering the obstacles it posed to the work-in-progress DAX+reflink
support for XFS and other copy-on-write filesystem mechanics.
Primarily the need to plumb a block_device through the API to handle
partition offsets was a sticking point and Christoph untangled that
dependency in addition to other cleanups to make landing the
DAX+reflink support easier.
The DAX_PMEM_COMPAT option has been around for 4 years and not only
are distributions shipping userspace that understand the current
configuration API, but some are not even bothering to turn this option
on anymore, so it seems a good time to remove it per the deprecation
schedule. Recall that this was added after the device-dax subsystem
moved from /sys/class/dax to /sys/bus/dax for its sysfs organization.
All recent functionality depends on /sys/bus/dax.
Some other miscellaneous cleanups and reflink prep patches are
included as well.
Summary:
- Simplify the dax_operations API:
- Eliminate bdev_dax_pgoff() in favor of the filesystem
maintaining and applying a partition offset to all its DAX iomap
operations.
- Remove wrappers and device-mapper stacked callbacks for
->copy_from_iter() and ->copy_to_iter() in favor of moving
block_device relative offset responsibility to the
dax_direct_access() caller.
- Remove the need for an @bdev in filesystem-DAX infrastructure
- Remove unused uio helpers copy_from_iter_flushcache() and
copy_mc_to_iter() as only the non-check_copy_size() versions are
used for DAX.
- Prepare XFS for the pending (next merge window) DAX+reflink support
- Remove deprecated DEV_DAX_PMEM_COMPAT support
- Cleanup a straggling misuse of the GUID api"
* tag 'libnvdimm-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (38 commits)
iomap: Fix error handling in iomap_zero_iter()
ACPI: NFIT: Import GUID before use
dax: remove the copy_from_iter and copy_to_iter methods
dax: remove the DAXDEV_F_SYNC flag
dax: simplify dax_synchronous and set_dax_synchronous
uio: remove copy_from_iter_flushcache() and copy_mc_to_iter()
iomap: turn the byte variable in iomap_zero_iter into a ssize_t
memremap: remove support for external pgmap refcounts
fsdax: don't require CONFIG_BLOCK
iomap: build the block based code conditionally
dax: fix up some of the block device related ifdefs
fsdax: shift partition offset handling into the file systems
dax: return the partition offset from fs_dax_get_by_bdev
iomap: add a IOMAP_DAX flag
xfs: pass the mapping flags to xfs_bmbt_to_iomap
xfs: use xfs_direct_write_iomap_ops for DAX zeroing
xfs: move dax device handling into xfs_{alloc,free}_buftarg
ext4: cleanup the dax handling in ext4_fill_super
ext2: cleanup the dax handling in ext2_fill_super
fsdax: decouple zeroing from the iomap buffered I/O code
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmHeBGsACgkQ+7dXa6fL
C2tyLw/8C2Gs/XvOZvRO7KPetKI9BbQSFoCe7uvGbiPq5CEmgcjWzQxvQGklBiZD
qYa6pMNye1iGpsHOY3Yu210b7vMQiRLnnxvVle0UrjpZR7CcxYS0gGV+6yRdbDGy
W1X6GFiX06qiNsgBH4msYp0SmbhhfkTyAx1BeBZAEtX8iFgaPfOldPY2nLMcTDD6
6FT1nTzRcMHx9IUQZJtpeatzc70Qg8+fOr2UAY2nOIypXh6+vAMBO80xtUjGVU+1
pWD1E+8cXSLfwEEzquFWoWTsTX7hNfsesEN10FmBf1bVCH9ZDFE01MOl6B8+CkFl
+xfkvDNFC3yyUwAMVAV4+A4Be+cVLSqN2R91QIKJnAj9w1OjxASrwZJ1YeZp6KP4
h0XKuPs3sRwwbNPVL/nP0UPNexoJnOUAaHesl4uKkRrExmxz9xGOIqIri2+tUIO+
HkGyNns1huymj1K1ja4AQbDiZZX39GgYVleyg9g3uuy1FS4k+/myJcXo/CqWn3ON
4oeNwxwLvlcqIQnPrESvwev50lFZYB4pfwvez6T2C5dL/Wk/xdeJK9iG81RWgx7y
5XcDeoGDE08gMCGWVPjuhOCXypeiRGHhRNlcxTtq5kLwBZGkcYg/wFFnWn+6hzc4
kyXw2kS5WZq4Q/FPh7BdY0eHp6xv0EpAOZwceneLB9lhNINdxcQ=
=ISJ6
-----END PGP SIGNATURE-----
Merge tag 'fscache-rewrite-20220111' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull fscache rewrite from David Howells:
"This is a set of patches that rewrites the fscache driver and the
cachefiles driver, significantly simplifying the code compared to
what's upstream, removing the complex operation scheduling and object
state machine in favour of something much smaller and simpler.
The series is structured such that the first few patches disable
fscache use by the network filesystems using it, remove the cachefiles
driver entirely and as much of the fscache driver as can be got away
with without causing build failures in the network filesystems.
The patches after that recreate fscache and then cachefiles,
attempting to add the pieces in a logical order. Finally, the
filesystems are reenabled and then the very last patch changes the
documentation.
[!] Note: I have dropped the cifs patch for the moment, leaving local
caching in cifs disabled. I've been having trouble getting that
working. I think I have it done, but it needs more testing (there
seem to be some test failures occurring with v5.16 also from
xfstests), so I propose deferring that patch to the end of the
merge window.
WHY REWRITE?
============
Fscache's operation scheduling API was intended to handle sequencing
of cache operations, which were all required (where possible) to run
asynchronously in parallel with the operations being done by the
network filesystem, whilst allowing the cache to be brought online and
offline and to interrupt service for invalidation.
With the advent of the tmpfile capacity in the VFS, however, an
opportunity arises to do invalidation much more simply, without having
to wait for I/O that's actually in progress: Cachefiles can simply
create a tmpfile, cut over the file pointer for the backing object
attached to a cookie and abandon the in-progress I/O, dismissing it
upon completion.
Future work here would involve using Omar Sandoval's vfs_link() with
AT_LINK_REPLACE[1] to allow an extant file to be displaced by a new
hard link from a tmpfile as currently I have to unlink the old file
first.
These patches can also simplify the object state handling as I/O
operations to the cache don't all have to be brought to a stop in
order to invalidate a file. To that end, and with an eye on to writing
a new backing cache model in the future, I've taken the opportunity to
simplify the indexing structure.
I've separated the index cookie concept from the file cookie concept
by C type now. The former is now called a "volume cookie" (struct
fscache_volume) and there is a container of file cookies. There are
then just the two levels. All the index cookie levels are collapsed
into a single volume cookie, and this has a single printable string as
a key. For instance, an AFS volume would have a key of something like
"afs,example.com,1000555", combining the filesystem name, cell name
and volume ID. This is freeform, but must not have '/' chars in it.
I've also eliminated all pointers back from fscache into the network
filesystem. This required the duplication of a little bit of data in
the cookie (cookie key, coherency data and file size), but it's not
actually that much. This gets rid of problems with making sure we keep
netfs data structures around so that the cache can access them.
These patches mean that most of the code that was in the drivers
before is simply gone and those drivers are now almost entirely new
code. That being the case, there doesn't seem any particular reason to
try and maintain bisectability across it. Further, there has to be a
point in the middle where things are cut over as there's a single
point everything has to go through (ie. /dev/cachefiles) and it can't
be in use by two drivers at once.
ISSUES YET OUTSTANDING
======================
There are some issues still outstanding, unaddressed by this patchset,
that will need fixing in future patchsets, but that don't stop this
series from being usable:
(1) The cachefiles driver needs to stop using the backing filesystem's
metadata to store information about what parts of the cache are
populated. This is not reliable with modern extent-based
filesystems.
Fixing this is deferred to a separate patchset as it involves
negotiation with the network filesystem and the VM as to how much
data to download to fulfil a read - which brings me on to (2)...
(2) NFS (and CIFS with the dropped patch) do not take account of how
the cache would like I/O to be structured to meet its granularity
requirements. Previously, the cache used page granularity, which
was fine as the network filesystems also dealt in page
granularity, and the backing filesystem (ext4, xfs or whatever)
did whatever it did out of sight. However, we now have folios to
deal with and the cache will now have to store its own metadata to
track its contents.
The change I'm looking at making for cachefiles is to store
content bitmaps in one or more xattrs and making a bit in the map
correspond to something like a 256KiB block. However, the size of
an xattr and the fact that they have to be read/updated in one go
means that I'm looking at covering 1GiB of data per 512-byte map
and storing each map in an xattr. Cachefiles has the potential to
grow into a fully fledged filesystem of its very own if I'm not
careful.
However, I'm also looking at changing things even more radically
and going to a different model of how the cache is arranged and
managed - one that's more akin to the way, say, openafs does
things - which brings me on to (3)...
(3) The way cachefilesd does culling is very inefficient for large
caches and it would be better to move it into the kernel if I can
as cachefilesd has to keep asking the kernel if it can cull a
file. Changing the way the backend works would allow this to be
addressed.
BITS THAT MAY BE CONTROVERSIAL
==============================
There are some bits I've added that may be controversial:
(1) I've provided a flag, S_KERNEL_FILE, that cachefiles uses to check
if a files is already being used by some other kernel service
(e.g. a duplicate cachefiles cache in the same directory) and
reject it if it is. This isn't entirely necessary, but it helps
prevent accidental data corruption.
I don't want to use S_SWAPFILE as that has other effects, but
quite possibly swapon() should set S_KERNEL_FILE too.
Note that it doesn't prevent userspace from interfering, though
perhaps it should. (I have made it prevent a marked directory from
being rmdir-able).
(2) Cachefiles wants to keep the backing file for a cookie open whilst
we might need to write to it from network filesystem writeback.
The problem is that the network filesystem unuses its cookie when
its file is closed, and so we have nothing pinning the cachefiles
file open and it will get closed automatically after a short time
to avoid EMFILE/ENFILE problems.
Reopening the cache file, however, is a problem if this is being
done due to writeback triggered by exit(). Some filesystems will
oops if we try to open a file in that context because they want to
access current->fs or suchlike.
To get around this, I added the following:
(A) An inode flag, I_PINNING_FSCACHE_WB, to be set on a network
filesystem inode to indicate that we have a usage count on the
cookie caching that inode.
(B) A flag in struct writeback_control, unpinned_fscache_wb, that
is set when __writeback_single_inode() clears the last dirty
page from i_pages - at which point it clears
I_PINNING_FSCACHE_WB and sets this flag.
This has to be done here so that clearing I_PINNING_FSCACHE_WB
can be done atomically with the check of PAGECACHE_TAG_DIRTY
that clears I_DIRTY_PAGES.
(C) A function, fscache_set_page_dirty(), which if it is not set,
sets I_PINNING_FSCACHE_WB and calls fscache_use_cookie() to
pin the cache resources.
(D) A function, fscache_unpin_writeback(), to be called by
->write_inode() to unuse the cookie.
(E) A function, fscache_clear_inode_writeback(), to be called when
the inode is evicted, before clear_inode() is called. This
cleans up any lingering I_PINNING_FSCACHE_WB.
The network filesystem can then use these tools to make sure that
fscache_write_to_cache() can write locally modified data to the
cache as well as to the server.
For the future, I'm working on write helpers for netfs lib that
should allow this facility to be removed by keeping track of the
dirty regions separately - but that's incomplete at the moment and
is also going to be affected by folios, one way or another, since
it deals with pages"
Link: https://lore.kernel.org/all/510611.1641942444@warthog.procyon.org.uk/
Tested-by: Dominique Martinet <asmadeus@codewreck.org> # 9p
Tested-by: kafs-testing@auristor.com # afs
Tested-by: Jeff Layton <jlayton@kernel.org> # ceph
Tested-by: Dave Wysochanski <dwysocha@redhat.com> # nfs
Tested-by: Daire Byrne <daire@dneg.com> # nfs
* tag 'fscache-rewrite-20220111' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (67 commits)
9p, afs, ceph, nfs: Use current_is_kswapd() rather than gfpflags_allow_blocking()
fscache: Add a tracepoint for cookie use/unuse
fscache: Rewrite documentation
ceph: add fscache writeback support
ceph: conversion to new fscache API
nfs: Implement cache I/O by accessing the cache directly
nfs: Convert to new fscache volume/cookie API
9p: Copy local writes to the cache when writing to the server
9p: Use fscache indexing rewrite and reenable caching
afs: Skip truncation on the server of data we haven't written yet
afs: Copy local writes to the cache when writing to the server
afs: Convert afs to use the new fscache API
fscache, cachefiles: Display stat of culling events
fscache, cachefiles: Display stats of no-space events
cachefiles: Allow cachefiles to actually function
fscache, cachefiles: Store the volume coherency data
cachefiles: Implement the I/O routines
cachefiles: Implement cookie resize for truncate
cachefiles: Implement begin and end I/O operation
cachefiles: Implement backing file wrangling
...
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYdw8xgAKCRDh3BK/laaZ
PFy3AQCHSltzy6f234CcsFk3mtJn0im0tDbRoEYFD731JOR1YAD9HQKtJRn/sMCF
r0PnZnOWJ35RWB3o8uEqptsgrZXJkQo=
=HosR
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Fix a regression introduced in 5.15
- Extend the size of the FUSE_INIT request to accommodate for more
flags. There's a slight possibility of a regression for obscure fuse
servers; if this happens, then more complexity will need to be added
to the protocol
- Allow the DAX property to be controlled by the server on a per-inode
basis in virtiofs
- Allow sending security context to the server when creating a file or
directory
* tag 'fuse-update-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
Documentation/filesystem/dax: DAX on virtiofs
fuse: mark inode DONT_CACHE when per inode DAX hint changes
fuse: negotiate per inode DAX in FUSE_INIT
fuse: enable per inode DAX
fuse: support per inode DAX in fuse protocol
fuse: make DAX mount option a tri-state
fuse: add fuse_should_enable_dax() helper
fuse: Pass correct lend value to filemap_write_and_wait_range()
fuse: send security context of inode on file
fuse: extend init flags
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmHek9YACgkQnJ2qBz9k
QNmgUQf+NpBuhoBI83dWI7NvwMZ1srtOiFdtVctq4jT8rS5m3KnwaTbq9HZ2y+WL
6+Tp8B2Qs/C2X6DbX87N1RBaP6mpMQCY4ADfbYNV6KOhUQqGBo/zxyM4nAk0WOBR
wXFWNhd/c3Xr3eb5Ggaus11UdQxXBFtZ72Azm2eUXnIKuSeAH1KWbyElQGzLqMDE
NLS68XFOcxLLScWA0lTExtouV6ZLUHm6EyYRH+TmIjAwb01JDwceXmdqzXNO7uiV
cKalWV8GcUH2OOLdX/KXvAwDVwuJwq4PCGCajGPEI8Ebct5Eh0YZuHVcvmfReOde
dnyZNiMOH8GwYVDHUDCNI8zD/BpyJQ==
=4gnz
-----END PGP SIGNATURE-----
Merge tag 'fs_for_v5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF / reiserfs updates from Jan Kara:
"One UDF fix and one reiserfs cleanup"
* tag 'fs_for_v5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Fix error handling in udf_new_inode()
reiserfs: don't use congestion_wait()
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmHektkACgkQnJ2qBz9k
QNk73Qf9G/8C29GC7lTZvW8ZOqM2onvP16/wpMU98jkw6/xpsdGLMmQlP4JCdB6Y
mTttvPxwMMVrdRmt+MNIgvWPOK2LKHdkbNbj0DRw4JxLEIm+zXddAXAcvX2vLeUn
N24idiVQxIKWS3zJhQ6H7sFjSrZ7ztjkeHWff70LIzURIupYO0+rRmKDgOqo6D5g
cKAdb8x9ZQLr0CTQoHKQxA8IeCSirSblVaDMdes2KrJH+i05s3aqS9jBvY16nrLo
Xvm4Vv7ecQXcmKDswuxVcmBVlVVsWxoYv1w8+v9qRGJnbgpQkjjgCGjCh54M18Do
/OovTVlqGWRfmf2Eo1FCXXz90ppGlQ==
=gvMh
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fanotify updates from Jan Kara:
"Support for new FAN_RENAME fanotify event and support for reporting
child info in directory fanotify events (FAN_REPORT_TARGET_FID)"
* tag 'fsnotify_for_v5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fanotify: wire up FAN_RENAME event
fanotify: report old and/or new parent+name in FAN_RENAME event
fanotify: record either old name new name or both for FAN_RENAME
fanotify: record old and new parent and name in FAN_RENAME event
fanotify: support secondary dir fh and name in fanotify_info
fanotify: use helpers to parcel fanotify_info buffer
fanotify: use macros to get the offset to fanotify_info buffer
fsnotify: generate FS_RENAME event with rich information
fanotify: introduce group flag FAN_REPORT_TARGET_FID
fsnotify: separate mark iterator type from object type enum
fsnotify: clarify object type argument
This should be all that is needed for XFS to use large folios.
There is no code in this pull request to create large folios, but
no additional changes should be needed to XFS or iomap once they
are created.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmHcpaUACgkQDpNsjXcp
gj4MUAf+ItcKfgFo1QCMT+6Y0mohVqPme/vdyOCNv6yOOfZZqN5ZQc+2hmxXrRz9
XPOPwZKL0TttlHSYEJmrm8mqwN8UXl0kqMu4kQqOXMziiD9qpVlaLXOZ7iLdkQxu
z/xe1iACcGfJUaQCsaMP6BZqp6iETA4qP72dBE4jc6PC4H3OI0pN/900gEbAcLxD
Yn0a5NhrdS/EySU2aHLB6OcwhqnSiHBVjUbFiuXxuvOYyzLaERIh00Kx3jLdj4DR
82K4TF8h2IZpALfIDSt0JG+gHLCc+EfF7Yd/xkeEv0md3ncyi+jWvFCFPNJbyFjm
cYoDTSunfbxwszA2n01R4JM8/KkGwA==
=IeFX
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.17' of git://git.infradead.org/users/willy/linux
Pull iomap updates from Matthew Wilcox:
"Convert xfs/iomap to use folios.
This should be all that is needed for XFS to use large folios. There
is no code in this pull request to create large folios, but no
additional changes should be needed to XFS or iomap once they are
created.
Usually this would have come from Darrick, and we had intended that it
would come that route. Between the holidays and various things which
Darrick needed to work on, he asked if I could send things directly.
There weren't any other iomap patches pending for this release, which
probably also played a role"
* tag 'iomap-5.17' of git://git.infradead.org/users/willy/linux: (26 commits)
iomap: Inline __iomap_zero_iter into its caller
xfs: Support large folios
iomap: Support large folios in invalidatepage
iomap: Convert iomap_migrate_page() to use folios
iomap: Convert iomap_add_to_ioend() to take a folio
iomap: Simplify iomap_do_writepage()
iomap: Simplify iomap_writepage_map()
iomap,xfs: Convert ->discard_page to ->discard_folio
iomap: Convert iomap_write_end_inline to take a folio
iomap: Convert iomap_write_begin() and iomap_write_end() to folios
iomap: Convert __iomap_zero_iter to use a folio
iomap: Allow iomap_write_begin() to be called with the full length
iomap: Convert iomap_page_mkwrite to use a folio
iomap: Convert readahead and readpage to use a folio
iomap: Convert iomap_read_inline_data to take a folio
iomap: Use folio offsets instead of page offsets
iomap: Convert bio completions to use folios
iomap: Pass the iomap_page into iomap_set_range_uptodate
iomap: Add iomap_invalidate_folio
iomap: Convert iomap_releasepage to use a folio
...
This patchset stops just short of actually enabling large folios.
It converts everything that I noticed needs to be converted, but there may
still be places I've overlooked which still have page size assumptions.
The big change here is using large entries in the page cache XArray
instead of many small entries. That only affects shmem for now, but
it's a pretty big change for shmem since it changes where memory needs
to be allocated (at split time instead of insertion).
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmHcraoACgkQDpNsjXcp
gj7C3wgAl0cjtdVzTpkLmbnInsicW1m3thnbkSXYbpqRccFjpu2kEBGj31PT+oGz
dzgXP7SNZ/VkFT+qWtmHSRF/J41B6f9bFojO81B2aQdpRiziU+5QbSbXbfUjwVhE
GJF0WGSJtVqySKynXP/iYTEt2zj6BiVperAwIqzhZpPY7gNoyDgeRD34Xy5bQqdD
ey6/Uwkh7oFHLEDcgxsEnyF0tUR3q+gpe5XZW1fb79p3crWw44xATc3UvKv8qCLC
Rd4oHmKkOj4MvdiUxJEfXI+XxgrkQ8XRO70B+p6ZljhDaoDZYw7ullxA0gvlSpNX
6pnjSQlKA1VQXsi6PMSt+9vf26XxaQ==
=KeYZ
-----END PGP SIGNATURE-----
Merge tag 'folio-5.17' of git://git.infradead.org/users/willy/pagecache
Pull folio conversion updates from Matthew Wilcox:
"Convert much of the page cache to use folios
This stops just short of actually enabling large folios. It converts
everything that I noticed needs to be converted, but there may still
be places I've overlooked which still have page size assumptions.
The big change here is using large entries in the page cache XArray
instead of many small entries. That only affects shmem for now, but
it's a pretty big change for shmem since it changes where memory needs
to be allocated (at split time instead of insertion)"
* tag 'folio-5.17' of git://git.infradead.org/users/willy/pagecache: (49 commits)
mm: Use multi-index entries in the page cache
XArray: Add xas_advance()
truncate,shmem: Handle truncates that split large folios
truncate: Convert invalidate_inode_pages2_range to folios
fs: Convert vfs_dedupe_file_range_compare to folios
mm: Remove pagevec_remove_exceptionals()
mm: Convert find_lock_entries() to use a folio_batch
filemap: Return only folios from find_get_entries()
filemap: Convert filemap_get_read_batch() to use a folio_batch
filemap: Convert filemap_read() to use a folio
truncate: Add invalidate_complete_folio2()
truncate: Convert invalidate_inode_pages2_range() to use a folio
truncate: Skip known-truncated indices
truncate,shmem: Add truncate_inode_folio()
shmem: Convert part of shmem_undo_range() to use a folio
mm: Add unmap_mapping_folio()
truncate: Add truncate_cleanup_folio()
filemap: Add filemap_release_folio()
filemap: Use a folio in filemap_page_mkwrite
filemap: Use a folio in filemap_map_pages
...
Here is the set of changes for the driver core for 5.17-rc1.
Lots of little things here, including:
- kobj_type cleanups
- auxiliary_bus documentation updates
- auxiliary_device conversions for some drivers (relevant
subsystems all have provided acks for these)
- kernfs lock contention reduction for some workloads
- other tiny cleanups and changes.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYd7deA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ym8ngCgw0ANwrRPE5b1dthEmfU2f8Knk5kAn0pHQv6R
VRZJypgNfU/Pt0ykstZD
=CO9J
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the set of changes for the driver core for 5.17-rc1.
Lots of little things here, including:
- kobj_type cleanups
- auxiliary_bus documentation updates
- auxiliary_device conversions for some drivers (relevant subsystems
all have provided acks for these)
- kernfs lock contention reduction for some workloads
- other tiny cleanups and changes.
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (43 commits)
kobject documentation: remove default_attrs information
drivers/firmware: Add missing platform_device_put() in sysfb_create_simplefb
debugfs: lockdown: Allow reading debugfs files that are not world readable
driver core: Make bus notifiers in right order in really_probe()
driver core: Move driver_sysfs_remove() after driver_sysfs_add()
firmware: edd: remove empty default_attrs array
firmware: dmi-sysfs: use default_groups in kobj_type
qemu_fw_cfg: use default_groups in kobj_type
firmware: memmap: use default_groups in kobj_type
sh: sq: use default_groups in kobj_type
headers/uninline: Uninline single-use function: kobject_has_children()
devtmpfs: mount with noexec and nosuid
driver core: Simplify async probe test code by using ktime_ms_delta()
nilfs2: use default_groups in kobj_type
kobject: remove kset from struct kset_uevent_ops callbacks
driver core: make kobj_type constant.
driver core: platform: document registration-failure requirement
vdpa/mlx5: Use auxiliary_device driver data helpers
net/mlx5e: Use auxiliary_device driver data helpers
soundwire: intel: Use auxiliary_device driver data helpers
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmHd8BkQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpkgED/4vyoNlKfuTqRzuHAZzq+PBYlMpVbn54ebm
z2AJ2StLrZ8rflz/9ERNVYxbSS6ELf/8H2Mdjnb5uCoRza/QkMf4JnGtgT0pBavV
uoruKTiYgQAQJmGuhxAo0TQwyW7F6qadp1eZH8CqxB2Cwj5wgSb3SB/yIlNkj6+2
fSzVGCh0CljhrkgFG10z507VGJYmgBqE9kl2tnfYkAA4blqyowZN4boTMx3l9fm5
2GWt0xg80lgMCzHb5m/sbqdOrMJV77TkuiAuVc+FjIfdGiVWo7XCn3q3e0qqJhRM
A3x9mmUyF/xNrRLMgTyYxHzKLrytBibr+VxvDLr/LmLtV8viVCsgJiP5EIiAWIAt
BCBks31QUxHaB9fyiQ43/nsRZLHmhXErLEK4D0loAXa50p4Xl4ZdaegD6txSH3FI
mGprv4Si1kMNqSxmzrOUfFk8Xdv/vC08RWL4UbRa8xkgwWUAIhpRaYmAtmw913kb
MgwFQuGpE9b7ae7HNdgWzyI424srCwY5UawgpzI25ZwGAVXTXDQ2qw1Lmyyo6swT
bsYVyH/vJLvCS1tjmBtrKfJQ0Mokm1sJpaMeTs/SfSyHmAUXEsEFdXVu6bRnVkYF
9vjHZKOl5jA5nx6/JDpW993GnHV3FGkCDfENXs3wUYY7Hu0DDYZt0CGiqGLjC7Oq
Ow4q3aEJQA==
=syIB
-----END PGP SIGNATURE-----
Merge tag 'for-5.17/io_uring-2022-01-11' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
- Support for prioritized work completions (Hao)
- Simplification of reissue (Pavel)
- Add support for CQE skip (Pavel)
- Memory leak fix going to 5.15-stable (Pavel)
- Re-write of internal poll. This both cleans up that code, and gets us
ready to fix the POLLFREE issue (Pavel)
- Various cleanups (GuoYong, Pavel, Hao)
* tag 'for-5.17/io_uring-2022-01-11' of git://git.kernel.dk/linux-block: (31 commits)
io_uring: fix not released cached task refs
io_uring: remove redundant tab space
io_uring: remove unused function parameter
io_uring: use completion batching for poll rem/upd
io_uring: single shot poll removal optimisation
io_uring: poll rework
io_uring: kill poll linking optimisation
io_uring: move common poll bits
io_uring: refactor poll update
io_uring: remove double poll on poll update
io_uring: code clean for some ctx usage
io_uring: batch completion in prior_task_list
io_uring: split io_req_complete_post() and add a helper
io_uring: add helper for task work execution code
io_uring: add a priority tw list for irq completion work
io-wq: add helper to merge two wq_lists
io_uring: reuse io_req_task_complete for timeouts
io_uring: tweak iopoll CQE_SKIP event counting
io_uring: simplify selected buf handling
io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
...
While I was auditing the code in xfs_repair that adds feature bits to
existing V5 filesystems, I decided to have a look at how online fsck
handles feature bits, and I found a few problems:
1) ATTR2 is added to the primary super when an xattr is set to a file,
but that isn't consistently propagated to secondary supers. This isn't
a corruption, merely a discrepancy that repair will fix if it ever has
to restore the primary from a secondary. Hence, if we find a mismatch
on a secondary, this is a preen condition, not a corruption.
2) There are more compat and ro_compat features now than there used to
be, but we mask off the newer features from testing. This means we
ignore inconsistencies in the INOBTCOUNT and BIGTIME features, which is
wrong. Get rid of the masking and compare directly.
3) NEEDSREPAIR, when set on a secondary, is ignored by everyone. Hence
a mismatch here should also be flagged for preening, and online repair
should clear the flag. Right now we ignore it due to (2).
4) log_incompat features are ephemeral, since we can clear the feature
bit as soon as the log no longer contains live records for a particular
log feature. As such, the only copy we care about is the one in the
primary super. If we find any bits set in the secondary super, we
should flag that for preening, and clear the bits if the user elects to
repair it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
- set_fs removal
- Devicetree support
- Many cleanups from Al
- Various virtio and build related fixes
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAmHbPpwWHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wXkcD/9UfDRFqvgtZIlacQ/0IvN24xeq
+f4aXoXEVyVsCd02jv9pUk3IAezIQyJf3MGtNJ4D/UFXtYfjEYjK5kJpPDP6umaZ
ZDnpTzn29HW2aGlgOxW9gU7a3Yze629QasIRP6x7Ht+Hk5eXrvRYrgcmKtw1mm04
SA5v5ZqP3P5r623fpsFiw4Dvl7l6MhDyFeyA2tabNnmv93HgB76PHDtV2Z+SWrC+
ubjlfBQc87QGHW+eTvce+0qw9APMoJpNFjNN4H8P/9VcDTvw+KL2JqQ02HSMWh4z
HeHKsv6hbty+GskBhbaWDW7867fPJ3e08TFAAAjeEiBP/CDBwjOTSr3eOw1eHgzU
xdAqC2Bz0e5G3shClmVEzzvcP6R2cgNZjeBze5m3wQ1NKHEddk6N9t5K+4NrOpgp
gbNN5Q4FAVOBKeQsZWG81bJKGcu7SbShgiKjlxcaRpMyp6LwyD4naauGjmCzYsbf
Pd4ilLO1Yocf7nFs2C4vWxE4iAZ6hfQtukerIxCQfb/Y2BaWT3bcWWYHFRFy6Lq+
hTFGnjf+Ro65QCoa1idaLaUdhwAGi6U9sjjL6G/JdQCCE3ftcXLVA9TJz9CNdMb5
98IGznxhczOZc7rHNXOF4km5+OUrU6N+C0WRp3yOoUWcI+Ms4PXHzqIwC5cde/V7
O/o9O1BAoBP6LE1pPg==
=5J6E
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml
Pull UML updates from Richard Weinberger:
- set_fs removal
- Devicetree support
- Many cleanups from Al
- Various virtio and build related fixes
* tag 'for-linus-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: (31 commits)
um: virtio_uml: Allow probing from devicetree
um: Add devicetree support
um: Extract load file helper from initrd.c
um: remove set_fs
hostfs: Fix writeback of dirty pages
um: Use swap() to make code cleaner
um: header debriding - sigio.h
um: header debriding - os.h
um: header debriding - net_*.h
um: header debriding - mem_user.h
um: header debriding - activate_ipi()
um: common-offsets.h debriding...
um, x86: bury crypto_tfm_ctx_offset
um: unexport handle_page_fault()
um: remove a dangling extern of syscall_trace()
um: kill unused cpu()
uml/i386: missing include in barrier.h
um: stop polluting the namespace with registers.h contents
logic_io instance of iounmap() needs volatile on argument
um: move amd64 variant of mmap(2) to arch/x86/um/syscalls_64.c
...
JFFS2:
- Fix for a deadlock in jffs2_write_begin()
UBI:
- Fixes in comments
UBIFS:
- Expose error counters in sysfs
- Many bugfixes found by Hulk Robot and others
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAmHbQUIWHHJpY2hhcmRA
c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wQRhD/9dMymYBFc0pku9SPgQrbieCXTf
O73rtO/MQTgIZR7y0VueosoUE+6P5672VKMLKt8Dp67tqoqI/LI8YiC0eYGQwklm
U7Ui3ze3ClNoRoPWflFKpm7c9/5UthWpUdVIRQ0Q3U7qvUPHswlPMoEKzTdMZqXa
zaOKGuUGO+4USBod4Jooo5I8KyTUezvjVQRii2xcjIy9LOgKNXImuGIX4QnLkH8Q
GbzgSMtYsHvdo8pqE68sZJ8D7vgEpOHEKTqcVroqO3dO8cmnk41pI5hNMnYherN3
+2yPefRza0ukxowTquM+eJB7D+/Req1ZjpthRMGVtStmmhXxYwp+wX4JLWta6Tr2
A6mgd3j8IxU0BlhP+AAXwIcIPvYyPtGYcfY8tqjorErHHOESvk8Hhifmq49pB8qy
UpIPMGSohZDrIIN4DIvb5tahVI0lYq7ozgPz9MJbvVfTI5Z63lntbfwGydsLCJLy
VLuVl3tZaMg/Pwr0qIGERQbgAqTJ0nzeH68CumSTmQxOFR9J51H2jQZLJMkZOohn
b7B6XkJ4H8xv6U2cFU9JjaaNJCJK6A3DcdCJKyzIYKwoWKPXjEGPmYEg2SgKFIhY
jzAUYUlN+ZBI4UTZH/ua627pdgzjR/ho1Y6tqd1OetlGzZq1a6UnTmXifud8h/mX
B3kRcodjNzbEm9dx2w==
=rpfR
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull JFFS2, UBI and UBIFS updates from Richard Weinberger:
"JFFS2:
- Fix for a deadlock in jffs2_write_begin()
UBI:
- Fixes in comments
UBIFS:
- Expose error counters in sysfs
- Many bugfixes found by Hulk Robot and others"
* tag 'for-linus-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
jffs2: GC deadlock reading a page that is used in jffs2_write_begin()
ubifs: read-only if LEB may always be taken in ubifs_garbage_collect
ubifs: fix double return leb in ubifs_garbage_collect
ubifs: fix slab-out-of-bounds in ubifs_change_lp
ubifs: fix snprintf() length check
ubifs: Document sysfs nodes
ubifs: Export filesystem error counters
ubifs: Error path in ubifs_remount_rw() seems to wrongly free write buffers
ubifs: Make use of the helper macro kthread_run()
ubi: Fix a mistake in comment
ubifs: Fix spelling mistakes
This set includes the normal collection of minor fixes and
cleanups, new kmem caches for network messaging structs,
a start on some basic tracepoints, and some new debugfs
files for inserting test messages.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJh3J8BAAoJEDgbc8f8gGmqC0IP/0xCyLKDLmv1KcR/SbUxiP9Y
ZmiuF1FdxqRLPN5HfU1GU4734EbWEiIZFGXDk36b4kE3UpbeddUMA6dDf9sDOJ73
H5BYgNfSgzybIq1D3wZe4C3nH4YTSrqvZhAI2bD3mcNnUDx20LzUMaGqLkzJcsOa
iFDVDlkNrXee/RPSuZDv/5U+L5YEC7WlKEHJ2u/INAFitakb0i1ChJGAgrJZADCw
giznPhfkmWyesgfOxOQ7JcxcVQSedQ+7vtHYhZ5tZ5v1h2yM9LPYJ0bNpNKfHUiA
0gf0DIpOjt4q9ClYxCsaLeK0t32qWOjiNoZaAm+lrLvZWE75CC+WP6KrDR+2x3p5
JlgBZ24h3v7t/ldYmEo7SSZ/1lfb4+0fyk7UjPzko9ErhRFrAopAfl924bRLN3ES
9iSgDslBT5apivNwrByDKI9flhDpMnfWtSufnYeMzFAbHzJCWJuUS8bbi9M+K0mK
ENjmXQu1OVOmht8WM7HOA91tIUcLhr/YD2fKtYNx054TErKKRjlRb/V+NHffyYJc
SqqyaLsAamhMBcLrAk3a88hj6bp8N4NYwrMQMus0xv3aZ23OaGCvtrj7fXuYMlcu
sjLOTQfFFDPN1B1WvI18TDdh/TIgP/8AC+KbHf4Y+hbntO0BgEzL1/CKqT3WqN8y
hAnK+Avj6CILRvO8ardm
=Cst4
-----END PGP SIGNATURE-----
Merge tag 'dlm-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"This set includes the normal collection of minor fixes and cleanups,
new kmem caches for network messaging structs, a start on some basic
tracepoints, and some new debugfs files for inserting test messages"
* tag 'dlm-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: (32 commits)
fs: dlm: print cluster addr if non-cluster node connects
fs: dlm: memory cache for lowcomms hotpath
fs: dlm: memory cache for writequeue_entry
fs: dlm: memory cache for midcomms hotpath
fs: dlm: remove wq_alloc mutex
fs: dlm: use event based wait for pending remove
fs: dlm: check for pending users filling buffers
fs: dlm: use list_empty() to check last iteration
fs: dlm: fix build with CONFIG_IPV6 disabled
fs: dlm: replace use of socket sk_callback_lock with sock_lock
fs: dlm: don't call kernel_getpeername() in error_report()
fs: dlm: fix potential buffer overflow
fs: dlm:Remove unneeded semicolon
fs: dlm: remove double list_first_entry call
fs: dlm: filter user dlm messages for kernel locks
fs: dlm: add lkb waiters debugfs functionality
fs: dlm: add lkb debugfs functionality
fs: dlm: allow create lkb with specific id range
fs: dlm: add debugfs rawmsg send functionality
fs: dlm: let handle callback data as void
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmHdubYUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpngxAAg+N74AZhPeUoPsvbVf4LKJ8eJRTY
4329sUjORwpy2g2wN1I53vOnpSjFQ0uWxaOwsuBTmTzM2lr9+J3w/a7Chx5q6M7D
o8dHh4BAwKGmXoCqMxvSiBtFiD/cht7YjjGGb+Q6+qa8viLky3Oksps9SsaZqMCf
RD+33fflp3eppPZTVgo9n+zuxbMz4wT9GpS/GNZy8GhKhPM+v5cPCw1pV0Wy1b7T
2FXsKBtk+odTzbfj88gQ+b2hoRMMQ2FLqIsqyJKajMuBwiBpuPzrm03J2B9N1OJ0
R0Ir0mXV627SoQ4N7v12MfmSF3+XMUTEMrdxMGbjsOz35TtspfOc5+mmveVF4DBD
T4zsK9ju6N/4OtsetOB3MF1AV5Hj0zOvuFZ1y71Bk9RBAJvLrzbHc17KtMEYbVgQ
XUQYY6IGYiVBr8n8x9t3LaPhOzYXwvgFwAmetZE8vYkLZQEQlgRVYJymFR8iluLo
oIPy1ciZgyMRDCa1/OHAIV604qwTChRLzO6WnDhdwWPwF/Sx4cHnhNVyR/5cp1ub
GG6E4o/jopeblftQHO5j+GIZAAwo8abU4ahJXUJaPQVdyY3VjOPhd+ooTrNbEEug
qijL1fIojcalDyCG2cIsXlFtXSb9Txr3L1fxOkluwmwVdR9LHJSEeDxYvIN20L/+
yqnHctyg9+HNsIc=
=icLi
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.16-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
"Various minor gfs2 cleanups and fixes"
* tag 'gfs2-v5.16-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: dump inode object for iopen glocks
gfs2: Fix gfs2_instantiate description
gfs2: Remove redundant check for GLF_INSTANTIATE_NEEDED
gfs2: remove redundant set of INSTANTIATE_NEEDED
gfs2: Fix __gfs2_holder_init function name in kernel-doc comment
I was poking around in the directory code while diagnosing online fsck
bugs, and noticed that xfs_readdir doesn't actually take the directory
ILOCK when it calls xfs_dir2_isblock. xfs_dir_open most probably loaded
the data fork mappings and the VFS took i_rwsem (aka IOLOCK_SHARED) so
we're protected against writer threads, but we really need to follow the
locking model like we do in other places.
To avoid unnecessarily cycling the ILOCK for fairly small directories,
change the block/leaf _getdents functions to consume the ILOCK hold that
the parent readdir function took to decide on a _getdents implementation.
It is ok to cycle the ILOCK in readdir because the VFS takes the IOLOCK
in the appropriate mode during lookups and writes, and we don't want to
be holding the ILOCK when we copy directory entries to userspace in case
there's a page fault. We really only need it to protect against data
fork lookups, like we do for other files.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
FS_IOC_GETFSLABEL and FS_IOC_SETFSLABEL ioctls. In addition the usual
large number of clean ups and bug fixes, in particular for the
fast_commit feature.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmHcr8sACgkQ8vlZVpUN
gaOuMwgAgunXEni8yED1QXAp+F1nigxx8K6O6fpjzVjiUFlH/cgc3M5PEZiV7Pgj
rIsNliCLQjDQPA7XxInvMuDXEvGRdleaqGS5MCKBR9898acaCTn0eBCK8ZKfhlIs
tJus706kOjy8IXRgEJkCrxEdrsJ57rgqHk/bstipNVg4lnfYYSUKH8k0ob4aaoiW
irufs1dSo7KmpwL0c/z9yIei3ST4DA2Dc/eBS0hsdgKtr/zMmwSanxqbQJTXz8ra
FTZ23HAu/zxWZ9GM4p3DhuVaMWaulm6eEqJKO+loeU9xOeItn0WmSJW260KR0cmj
0nPAflMmpQVBhtO9gjXcSiORbd6/zg==
=9HNa
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Convert ext4 to use the new mount API, and add support for the
FS_IOC_GETFSLABEL and FS_IOC_SETFSLABEL ioctls.
In addition the usual large number of clean ups and bug fixes, in
particular for the fast_commit feature"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (48 commits)
ext4: don't use the orphan list when migrating an inode
ext4: use BUG_ON instead of if condition followed by BUG
ext4: fix a copy and paste typo
ext4: set csum seed in tmp inode while migrating to extents
ext4: remove unnecessary 'offset' assignment
ext4: remove redundant o_start statement
ext4: drop an always true check
ext4: remove unused assignments
ext4: remove redundant statement
ext4: remove useless resetting io_end_size in mpage_process_page()
ext4: allow to change s_last_trim_minblks via sysfs
ext4: change s_last_trim_minblks type to unsigned long
ext4: implement support for get/set fs label
ext4: only set EXT4_MOUNT_QUOTA when journalled quota file is specified
ext4: don't use kfree() on rcu protected pointer sbi->s_qf_names
ext4: avoid trim error on fs with small groups
ext4: fix an use-after-free issue about data=journal writeback mode
ext4: fix null-ptr-deref in '__ext4_journal_ensure_credits'
ext4: initialize err_blk before calling __ext4_get_inode_loc
ext4: fix a possible ABBA deadlock due to busy PA
...
- Fix log recovery with da btree buffers when metauuid is in use.
- Fix type coercion problems in xattr buffer size validation.
- Fix a bug in online scrub dir leaf bestcount checking.
- Only run COW recovery when recovering the log.
- Fix symlink target buffer UAF problems and symlink locking problems
by not exposing xfs innards to the VFS.
- Fix incorrect quotaoff lock usage.
- Don't let transactions cancel cleanly if they have deferred work
items attached.
- Fix a UAF when we're deciding if we need to relog an intent item.
- Reduce kvmalloc overhead for log shadow buffers.
- Clean up sysfs attr group usage.
- Fix a bug where scrub's bmap/rmap checking could race with a quota
file block allocation due to insufficient locking.
- Teach scrub to complain about invalid project ids.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmHXOL0ACgkQ+H93GTRK
tOulDQ/+PwvFyROQePQo+5Z7fDb92VAKX6UxFYzWZGXj3zhRmTX7fvNSniPxKqHO
K0GMnZ9rJxvuo/MRuSEPILl+2rKh59Efo9IPcs2TR+xO+vI/EKnch2/gPLBN1y7Y
A4peQuK6CvUjYTyMazR2PcK8y76ZW8iXyuQMxnpmYldZBCZ+v7myVoVlPCRrZ68W
EoEQ/cDf4fmsycSG4Eh9WTddbqd9bfl1+jXEz/pzlbhJSvQOaqynR+21ahxhEQ/f
CzjSM4kXQ9AWTfHOPnEXs+Hf02NfcKRMuguzO69gJZhooEXJlgeYAZXUKqXpdmbC
2Tc+x/rJnQv9PtzEw/WoWMem4rhwjBX7cVi2E/SJHB46MLI5rjhQilIdWzecWd8q
nOrjIZlReI0vxYSqe1ifejxiCsV6gv3XdyLNacjxf7ISkU4XF0caO8f+3/ocIScV
E7k+XN13pArUiRZXOt8UL571Hoa6nhtHjyAUg6i62AFI2R9XYUEjEXKMwX7ovRVO
8eGdeQqioHkmOkF8HGyyurB9+itccuW8dtl7YOqEQJMGXeyk7rsvpcPahKKKoiDr
8w3bpZlNiOfZEnRcNjO9B/Egu6LLMudti2ptwlUvfvWsfa8VTO8v0h5pqVhwuwrk
lwQThWJ5THe/MptWybdZ3kSjhMvzqi4GO75T30Hj6XhecC3dFfo=
=RYEb
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.17-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"The big new feature here is that the mount code now only bothers to
try to free stale COW staging extents if the fs unmounted uncleanly.
This should reduce mount times, particularly on filesystems supporting
reflink and containing a large number of allocation groups.
Everything else this cycle are bugfixes, as the iomap folios
conversion should be plenty enough excitement for anyone. That and I
ran out of brain bandwidth after Thanksgiving last year.
Summary:
- Fix log recovery with da btree buffers when metauuid is in use.
- Fix type coercion problems in xattr buffer size validation.
- Fix a bug in online scrub dir leaf bestcount checking.
- Only run COW recovery when recovering the log.
- Fix symlink target buffer UAF problems and symlink locking problems
by not exposing xfs innards to the VFS.
- Fix incorrect quotaoff lock usage.
- Don't let transactions cancel cleanly if they have deferred work
items attached.
- Fix a UAF when we're deciding if we need to relog an intent item.
- Reduce kvmalloc overhead for log shadow buffers.
- Clean up sysfs attr group usage.
- Fix a bug where scrub's bmap/rmap checking could race with a quota
file block allocation due to insufficient locking.
- Teach scrub to complain about invalid project ids"
* tag 'xfs-5.17-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: warn about inodes with project id of -1
xfs: hold quota inode ILOCK_EXCL until the end of dqalloc
xfs: Remove redundant assignment of mp
xfs: reduce kvmalloc overhead for CIL shadow buffers
xfs: sysfs: use default_groups in kobj_type
xfs: prevent UAF in xfs_log_item_in_current_chkpt
xfs: prevent a WARN_ONCE() in xfs_ioc_attr_list()
xfs: Fix comments mentioning xfs_ialloc
xfs: check sb_meta_uuid for dabuf buffer recovery
xfs: fix a bug in the online fsck directory leaf1 bestcount check
xfs: only run COW extent recovery when there are no live extents
xfs: don't expose internal symlink metadata buffers to the vfs
xfs: fix quotaoff mutex usage now that we don't support disabling it
xfs: shut down filesystem if we xfs_trans_cancel with deferred work items
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmHcX64ACgkQxWXV+ddt
WDsIyA//UGMG0waxz40RksQL+4AnTIJ+apmmxq5VGEE8y6aUPAGKm++5Cs9cC+Os
5xuZhNnD8hjB+eWH+tRCx8ll+T4g4skGKDCyD0fWNtZAlW8pnsIbztdO0nNYx9C0
V++vu/hQR6M8E7ORlayEKBWy2/UnBG5p/XVLPG4RJ4vMETJPl2RLWVDpu3dj09kf
YnD3AY0vmKEyCu/b9NtSzfZMO2/lXT2U41ezLJJfmPAXcMJ0EeSAazACVDQyq24p
wnnr6xmdo7ZR0oGFLUmBmfxbKwd3l0JIUsi/XRysXe+8y7raIE0gKCg7NpCS276T
BZrKmefxHhdMCA1HLlH6AkrKmQUgIaceLkXTanTYv4cnVzb6XoeV6R4IO9/JQ0yv
YsdCL7eZ4Vl3ToPlQkYWdwUNP5UjVMg5qMwxchigbwq7jViLJtiu3WKy0O0TitzB
n4o0cCdlv3lHRp8FS6cFmbCrsavFT8/q3vz/aRdkrojOKE0jEWVJiz0bf39j5BVO
IiJ3RF3kbpG7TZz0+eNUbgebME8zwaWfyd6t2L7Ztvx9F4elzSW95iwGc9//TZvl
ciNFI9LQvnZRymUItxg0HNXfbJMXJAE9ImTxA8HLkfQYpaBlwbt4avkQK56LhpN+
nFCd5cxJy5HLFIvkRDfqpF+C247p5FICzd/hs59/cXlCynulglg=
=9lg+
-----END PGP SIGNATURE-----
Merge tag 'for-5.17-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This end of the year branch is intentionally not that exciting. Most
of the changes are under the hood, but there are some minor user
visible improvements and several performance improvements too.
Features:
- make send work with concurrent block group relocation.
We're not allowed to prevent send failing or silently producing
some bad stream but with more fine grained locking and checks it's
possible. The send vs deduplication exclusion could reuse the same
logic in the future.
- new exclusive operation 'balance paused' to allow adding a device
to filesystem with paused balance
- new sysfs file for fsid stored in the per-device directory to help
distinguish devices when seeding is enabled, the fsid may differ
from the one reported by the filesystem
Performance improvements:
- less metadata needed for directory logging, directory deletion is
20-40% faster
- in zoned mode, cache zone information during mount to speed up
repeated queries (about 50% speedup)
- free space tree entries get indexed and searched by size (latency
-30%, search run time -30%)
- less contention in tree node locking when inserting a key and no
splits are needed (files/sec in fsmark improves by 1-20%)
Fixes:
- fix ENOSPC failure when attempting direct IO write into NOCOW range
- fix deadlock between quota enable and other quota operations
- global reserve minimum calculations fixed to account for free space
tree
- in zoned mode, fix condition for chunk allocation that may not find
the right zone for reuse and could lead to early ENOSPC
Core:
- global reserve stealing got simplified and cleaned up in evict
- remove async transaction commit based on manual transaction refs,
reuse existing kthread and mechanisms to let it commit transaction
before timeout
- preparatory work for extent tree v2, add wrappers for global tree
roots, truncation path cleanups
- remove readahead framework, it's a bit overengineered and used only
for scrub, and yet it does not cover all its needs, there is
another readahead built in the b-tree search that is now used,
performance drop on HDD is about 5% which is acceptable and scrub
is often throttled anyway, on SSDs there's no reported drop but
slight improvement
- self tests report extent tree state when error occurs
- replace assert with debugging information when an uncommitted
transaction is found at unmount time
Other:
- error handling improvements
- other cleanups and refactoring"
* tag 'for-5.17-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits)
btrfs: output more debug messages for uncommitted transaction
btrfs: respect the max size in the header when activating swap file
btrfs: fix argument list that the kdoc format and script verified
btrfs: remove unnecessary parameter type from compression_decompress_bio
btrfs: selftests: dump extent io tree if extent-io-tree test failed
btrfs: scrub: cleanup the argument list of scrub_stripe()
btrfs: scrub: cleanup the argument list of scrub_chunk()
btrfs: remove reada infrastructure
btrfs: scrub: use btrfs_path::reada for extent tree readahead
btrfs: scrub: remove the unnecessary path parameter for scrub_raid56_parity()
btrfs: refactor unlock_up
btrfs: skip transaction commit after failure to create subvolume
btrfs: zoned: fix chunk allocation condition for zoned allocator
btrfs: add extent allocator hook to decide to allocate chunk or not
btrfs: zoned: unset dedicated block group on allocation failure
btrfs: zoned: drop redundant check for REQ_OP_ZONE_APPEND and btrfs_is_zoned
btrfs: zoned: sink zone check into btrfs_repair_one_zone
btrfs: zoned: simplify btrfs_check_meta_write_pointer
btrfs: zoned: encapsulate inode locking for zoned relocation
btrfs: sysfs: add devinfo/fsid to retrieve actual fsid from the device
...
- add sysfs interface and a sysfs node to control sync decompression;
- add tail-packing inline support for compressed files;
- get rid of erofs_get_meta_page().
-----BEGIN PGP SIGNATURE-----
iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYduH/BEceGlhbmdAa2Vy
bmVsLm9yZwAKCRA5NzHcH7XmBNPdAP9hKomD1hRiFeCWlLA1nDXYkqGbbt6+D3HT
cm4G7DgVBAD+O+RWv6JVYg1zAAFlKmxqEKEfoDLKI65wAjH1V/h/dQE=
=cEam
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"In this cycle, tail-packing data inline for compressed files is now
supported so that tail pcluster can be stored and read together with
inode metadata in order to save data I/O and storage space.
In addition to that, to prepare for the upcoming subpage, folio and
fscache features, we also introduce meta buffers to get rid of
erofs_get_meta_page() since it was too close to the page itself.
In addition, in order to show supported kernel features and control
sync decompression strategy, new sysfs nodes are introduced in this
cycle as well.
Summary:
- add sysfs interface and a sysfs node to control sync decompression
- add tail-packing inline support for compressed files
- get rid of erofs_get_meta_page()"
* tag 'erofs-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: use meta buffers for zmap operations
erofs: use meta buffers for xattr operations
erofs: use meta buffers for super operations
erofs: use meta buffers for inode operations
erofs: introduce meta buffer operations
erofs: add on-disk compressed tail-packing inline support
erofs: support inline data decompression
erofs: support unaligned data decompression
erofs: introduce z_erofs_fixup_insize
erofs: tidy up z_erofs_lz4_decompress
erofs: clean up erofs_map_blocks tracepoints
erofs: Replace zero-length array with flexible-array member
erofs: add sysfs node to control sync decompression strategy
erofs: add sysfs interface
erofs: rename lz4_0pading to zero_padding
In 9p, afs ceph, and nfs, gfpflags_allow_blocking() (which wraps a
test for __GFP_DIRECT_RECLAIM being set) is used to determine if
->releasepage() should wait for the completion of a DIO write to fscache
with something like:
if (folio_test_fscache(folio)) {
if (!gfpflags_allow_blocking(gfp) || !(gfp & __GFP_FS))
return false;
folio_wait_fscache(folio);
}
Instead, current_is_kswapd() should be used instead.
Note that this is based on a patch originally by Zhaoyang Huang[1]. In
addition to extending it to the other network filesystems and putting it on
top of my fscache rewrite, it also needs to include linux/swap.h in a bunch
of places. Can current_is_kswapd() be moved to linux/mm.h?
Changes
=======
ver #5:
- Dropping the changes for cifs.
Originally-signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Co-developed-by: David Howells <dhowells@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Steve French <smfrench@gmail.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: linux-cachefs@redhat.com
cc: v9fs-developer@lists.sourceforge.net
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/1638952658-20285-1-git-send-email-huangzhaoyang@gmail.com/ [1]
Link: https://lore.kernel.org/r/164021590773.640689.16777975200823659231.stgit@warthog.procyon.org.uk/ # v4
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYdRCkgAKCRCRxhvAZXjc
olrvAQCdp8LWkT8TauJSl8wmUm3mZhNy+5+fXuCUSwe3PyUtTQEAq4fxm41JpG8u
WCZTrrxVhaXwgUY3aWzzeQnLCZjtEQw=
=woqV
-----END PGP SIGNATURE-----
Merge tag 'fs.idmapped.v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull fs idmapping updates from Christian Brauner:
"This contains the work to enable the idmapping infrastructure to
support idmapped mounts of filesystems mounted with an idmapping.
In addition this contains various cleanups that avoid repeated
open-coding of the same functionality and simplify the code in quite a
few places.
We also finish the renaming of the mapping helpers we started a few
kernel releases back and move them to a dedicated header to not
continue polluting the fs header needlessly with low-level idmapping
helpers. With this series the fs header only contains idmapping
helpers that interact with fs objects.
Currently we only support idmapped mounts for filesystems mounted
without an idmapping themselves. This was a conscious decision
mentioned in multiple places (cf. [1]).
As explained at length in [3] it is perfectly fine to extend support
for idmapped mounts to filesystem's mounted with an idmapping should
the need arise. The need has been there for some time now (cf. [2]).
Before we can port any filesystem that is mountable with an idmapping
to support idmapped mounts in the coming cycles, we need to first
extend the mapping helpers to account for the filesystem's idmapping.
This again, is explained at length in our documentation at [3] and
also in the individual commit messages so here's an overview.
Currently, the low-level mapping helpers implement the remapping
algorithms described in [3] in a simplified manner as we could rely on
the fact that all filesystems supporting idmapped mounts are mounted
without an idmapping.
In contrast, filesystems mounted with an idmapping are very likely to
not use an identity mapping and will instead use a non-identity
mapping. So the translation step from or into the filesystem's
idmapping in the remapping algorithm cannot be skipped for such
filesystems.
Non-idmapped filesystems and filesystems not supporting idmapped
mounts are unaffected by this change as the remapping algorithms can
take the same shortcut as before. If the low-level helpers detect that
they are dealing with an idmapped mount but the underlying filesystem
is mounted without an idmapping we can rely on the previous shortcut
and can continue to skip the translation step from or into the
filesystem's idmapping. And of course, if the low-level helpers detect
that they are not dealing with an idmapped mount they can simply
return the relevant id unchanged; no remapping needs to be performed
at all.
These checks guarantee that only the minimal amount of work is
performed. As before, if idmapped mounts aren't used the low-level
helpers are idempotent and no work is performed at all"
Link: 2ca4dcc490 ("fs/mount_setattr: tighten permission checks") [1]
Link: https://github.com/containers/podman/issues/10374 [2]
Link: Documentations/filesystems/idmappings.rst [3]
Link: a65e58e791 ("fs: document and rename fsid helpers") [4]
* tag 'fs.idmapped.v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
fs: support mapped mounts of mapped filesystems
fs: add i_user_ns() helper
fs: port higher-level mapping helpers
fs: remove unused low-level mapping helpers
fs: use low-level mapping helpers
docs: update mapping documentation
fs: account for filesystem mappings
fs: tweak fsuidgid_has_mapping()
fs: move mapping helpers
fs: add is_idmapped_mnt() helper
A task can end up indefinitely sleeping in do_select() ->
poll_schedule_timeout() when the following race happens:
TASK1 (thread1) TASK2 TASK1 (thread2)
do_select()
setup poll_wqueues table
with 'fd'
write data to 'fd'
pollwake()
table->triggered = 1
closes 'fd' thread1 is
waiting for
poll_schedule_timeout()
- sees table->triggered
table->triggered = 0
return -EINTR
loop back in do_select()
But at this point when TASK1 loops back, the fdget() in the setup of
poll_wqueues fails. So now so we never find 'fd' is ready for reading
and sleep in poll_schedule_timeout() indefinitely.
Treat an fd that got closed as a fd on which some event happened. This
makes sure cannot block indefinitely in do_select().
Another option would be to return -EBADF in this case but that has a
potential of subtly breaking applications that excercise this behavior
and it happens to work for them. So returning fd as active seems like a
safer choice.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before this patch, glock dumps would not dump the gl_object for iopen
glocks. This information can help us debug problems related to eviction:
when AN iopen glock is blocked we can see the status of its underlying
inode and its flags, etc.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reading from a file that was just extended by a write, but the write had
not yet reached the server would return ENODATA as illustrated by this
command:
$ xfs_io -c 'open -ft test' -c 'w 4096 1000' -c 'r 0 1000'
wrote 1000/1000 bytes at offset 4096
1000.000000 bytes, 1 ops; 0.0001 sec (5.610 MiB/sec and 5882.3529 ops/sec)
pread: No data available
Fix this case by having netfs assume zeroes when reads from server come
short like AFS and CEPH do
Link: https://lkml.kernel.org/r/20220110111444.926753-1-asmadeus@codewreck.org
Cc: stable@vger.kernel.org
Fixes: eb497943fa ("9p: Convert to using the netfs helper lib to do reads and caching")
Co-authored-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Core
----
- Defer freeing TCP skbs to the BH handler, whenever possible,
or at least perform the freeing outside of the socket lock section
to decrease cross-CPU allocator work and improve latency.
- Add netdevice refcount tracking to locate sources of netdevice
and net namespace refcount leaks.
- Make Tx watchdog less intrusive - avoid pausing Tx and restarting
all queues from a single CPU removing latency spikes.
- Various small optimizations throughout the stack from Eric Dumazet.
- Make netdev->dev_addr[] constant, force modifications to go via
appropriate helpers to allow us to keep addresses in ordered data
structures.
- Replace unix_table_lock with per-hash locks, improving performance
of bind() calls.
- Extend skb drop tracepoint with a drop reason.
- Allow SO_MARK and SO_PRIORITY setsockopt under CAP_NET_RAW.
BPF
---
- New helpers:
- bpf_find_vma(), find and inspect VMAs for profiling use cases
- bpf_loop(), runtime-bounded loop helper trading some execution
time for much faster (if at all converging) verification
- bpf_strncmp(), improve performance, avoid compiler flakiness
- bpf_get_func_arg(), bpf_get_func_ret(), bpf_get_func_arg_cnt()
for tracing programs, all inlined by the verifier
- Support BPF relocations (CO-RE) in the kernel loader.
- Further the support for BTF_TYPE_TAG annotations.
- Allow access to local storage in sleepable helpers.
- Convert verifier argument types to a composable form with different
attributes which can be shared across types (ro, maybe-null).
- Prepare libbpf for upcoming v1.0 release by cleaning up APIs,
creating new, extensible ones where missing and deprecating those
to be removed.
Protocols
---------
- WiFi (mac80211/cfg80211):
- notify user space about long "come back in N" AP responses,
allow it to react to such temporary rejections
- allow non-standard VHT MCS 10/11 rates
- use coarse time in airtime fairness code to save CPU cycles
- Bluetooth:
- rework of HCI command execution serialization to use a common
queue and work struct, and improve handling errors reported
in the middle of a batch of commands
- rework HCI event handling to use skb_pull_data, avoiding packet
parsing pitfalls
- support AOSP Bluetooth Quality Report
- SMC:
- support net namespaces, following the RDMA model
- improve connection establishment latency by pre-clearing buffers
- introduce TCP ULP for automatic redirection to SMC
- Multi-Path TCP:
- support ioctls: SIOCINQ, OUTQ, and OUTQNSD
- support socket options: IP_TOS, IP_FREEBIND, IP_TRANSPARENT,
IPV6_FREEBIND, and IPV6_TRANSPARENT, TCP_CORK and TCP_NODELAY
- support cmsgs: TCP_INQ
- improvements in the data scheduler (assigning data to subflows)
- support fastclose option (quick shutdown of the full MPTCP
connection, similar to TCP RST in regular TCP)
- MCTP (Management Component Transport) over serial, as defined by
DMTF spec DSP0253 - "MCTP Serial Transport Binding".
Driver API
----------
- Support timestamping on bond interfaces in active/passive mode.
- Introduce generic phylink link mode validation for drivers which
don't have any quirks and where MAC capability bits fully express
what's supported. Allow PCS layer to participate in the validation.
Convert a number of drivers.
- Add support to set/get size of buffers on the Rx rings and size of
the tx copybreak buffer via ethtool.
- Support offloading TC actions as first-class citizens rather than
only as attributes of filters, improve sharing and device resource
utilization.
- WiFi (mac80211/cfg80211):
- support forwarding offload (ndo_fill_forward_path)
- support for background radar detection hardware
- SA Query Procedures offload on the AP side
New hardware / drivers
----------------------
- tsnep - FPGA based TSN endpoint Ethernet MAC used in PLCs with
real-time requirements for isochronous communication with protocols
like OPC UA Pub/Sub.
- Qualcomm BAM-DMUX WWAN - driver for data channels of modems
integrated into many older Qualcomm SoCs, e.g. MSM8916 or
MSM8974 (qcom_bam_dmux).
- Microchip LAN966x multi-port Gigabit AVB/TSN Ethernet Switch
driver with support for bridging, VLANs and multicast forwarding
(lan966x).
- iwlmei driver for co-operating between Intel's WiFi driver and
Intel's Active Management Technology (AMT) devices.
- mse102x - Vertexcom MSE102x Homeplug GreenPHY chips
- Bluetooth:
- MediaTek MT7921 SDIO devices
- Foxconn MT7922A
- Realtek RTL8852AE
Drivers
-------
- Significantly improve performance in the datapaths of:
lan78xx, ax88179_178a, lantiq_xrx200, bnxt.
- Intel Ethernet NICs:
- igb: support PTP/time PEROUT and EXTTS SDP functions on
82580/i354/i350 adapters
- ixgbevf: new PF -> VF mailbox API which avoids the risk of
mailbox corruption with ESXi
- iavf: support configuration of VLAN features of finer granularity,
stacked tags and filtering
- ice: PTP support for new E822 devices with sub-ns precision
- ice: support firmware activation without reboot
- Mellanox Ethernet NICs (mlx5):
- expose control over IRQ coalescing mode (CQE vs EQE) via ethtool
- support TC forwarding when tunnel encap and decap happen between
two ports of the same NIC
- dynamically size and allow disabling various features to save
resources for running in embedded / SmartNIC scenarios
- Broadcom Ethernet NICs (bnxt):
- use page frag allocator to improve Rx performance
- expose control over IRQ coalescing mode (CQE vs EQE) via ethtool
- Other Ethernet NICs:
- amd-xgbe: add Ryzen 6000 (Yellow Carp) Ethernet support
- Microsoft cloud/virtual NIC (mana):
- add XDP support (PASS, DROP, TX)
- Mellanox Ethernet switches (mlxsw):
- initial support for Spectrum-4 ASICs
- VxLAN with IPv6 underlay
- Marvell Ethernet switches (prestera):
- support flower flow templates
- add basic IP forwarding support
- NXP embedded Ethernet switches (ocelot & felix):
- support Per-Stream Filtering and Policing (PSFP)
- enable cut-through forwarding between ports by default
- support FDMA to improve packet Rx/Tx to CPU
- Other embedded switches:
- hellcreek: improve trapping management (STP and PTP) packets
- qca8k: support link aggregation and port mirroring
- Qualcomm 802.11ax WiFi (ath11k):
- qca6390, wcn6855: enable 802.11 power save mode in station mode
- BSS color change support
- WCN6855 hw2.1 support
- 11d scan offload support
- scan MAC address randomization support
- full monitor mode, only supported on QCN9074
- qca6390/wcn6855: report signal and tx bitrate
- qca6390: rfkill support
- qca6390/wcn6855: regdb.bin support
- Intel WiFi (iwlwifi):
- support SAR GEO Offset Mapping (SGOM) and Time-Aware-SAR (TAS)
in cooperation with the BIOS
- support for Optimized Connectivity Experience (OCE) scan
- support firmware API version 68
- lots of preparatory work for the upcoming Bz device family
- MediaTek WiFi (mt76):
- Specific Absorption Rate (SAR) support
- mt7921: 160 MHz channel support
- RealTek WiFi (rtw88):
- Specific Absorption Rate (SAR) support
- scan offload
- Other WiFi NICs
- ath10k: support fetching (pre-)calibration data from nvmem
- brcmfmac: configure keep-alive packet on suspend
- wcn36xx: beacon filter support
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmHbkZAACgkQMUZtbf5S
IruYkQ//XX7BggcwBfukPK83j0dONolClijqKcKR08g4vB5L8GXvv6OErKIWrh4k
h8JanCH352ZkbCSw3MvFdm825UYQv8vPMd6Qks/LJ4aSKqCuy4MIlAo+yOw4Km3O
i7++lRfma6DqHHI59wvLjWoxZSPu8lL+rI8UsZ5qMOlnNlGAOXsNrzRjaqQ3FddY
AMxZeBUtrPqUCCQZFq3U8apkYzUp7CA/3XR9zRcja3uPbrtOV2G+4whRF90qGNWz
Tm/QvJ9F/Ab292cbhxR4KuaQ3hUhaCQyDjbZk3+FZzZpAVhYTVqcNjny6+yXmbiP
NXRtwemnl1NlWKMnJM8lEeY48u626tRIkxA/Wtd61uoO5uKUSxfGP+UpUi+DfXbF
yIw50VQ7L2bpxXP/HjtmhVgZDaWKYyh22Zw4Hp/muMJz0hgUB0KODY3tf2jUWbjJ
0oEgocWyzhhwMQKqupTDCIaRgIs2ewYr4ZrFDhI3HnHC/vv1VjoPRUPIyxwppD2N
cXvZb3B1sWK8iX5gCbISGzyU4bB7I0rvJSTU42ueti7n6NqRFZ79qHQpYnnY+JdO
z1qOwY/d/yWfBoXVKRtRg2qz6CdEt5BQklwAgVEBgrFpf58gp694EwGMb1htY14J
r/k9bVpmyIFpUnBH2CPMRfBVA3tUTqzyzzFV4AMw40NYLKmhLdo=
=KLm3
-----END PGP SIGNATURE-----
Merge tag '5.17-net-next' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core
----
- Defer freeing TCP skbs to the BH handler, whenever possible, or at
least perform the freeing outside of the socket lock section to
decrease cross-CPU allocator work and improve latency.
- Add netdevice refcount tracking to locate sources of netdevice and
net namespace refcount leaks.
- Make Tx watchdog less intrusive - avoid pausing Tx and restarting
all queues from a single CPU removing latency spikes.
- Various small optimizations throughout the stack from Eric Dumazet.
- Make netdev->dev_addr[] constant, force modifications to go via
appropriate helpers to allow us to keep addresses in ordered data
structures.
- Replace unix_table_lock with per-hash locks, improving performance
of bind() calls.
- Extend skb drop tracepoint with a drop reason.
- Allow SO_MARK and SO_PRIORITY setsockopt under CAP_NET_RAW.
BPF
---
- New helpers:
- bpf_find_vma(), find and inspect VMAs for profiling use cases
- bpf_loop(), runtime-bounded loop helper trading some execution
time for much faster (if at all converging) verification
- bpf_strncmp(), improve performance, avoid compiler flakiness
- bpf_get_func_arg(), bpf_get_func_ret(), bpf_get_func_arg_cnt()
for tracing programs, all inlined by the verifier
- Support BPF relocations (CO-RE) in the kernel loader.
- Further the support for BTF_TYPE_TAG annotations.
- Allow access to local storage in sleepable helpers.
- Convert verifier argument types to a composable form with different
attributes which can be shared across types (ro, maybe-null).
- Prepare libbpf for upcoming v1.0 release by cleaning up APIs,
creating new, extensible ones where missing and deprecating those
to be removed.
Protocols
---------
- WiFi (mac80211/cfg80211):
- notify user space about long "come back in N" AP responses,
allow it to react to such temporary rejections
- allow non-standard VHT MCS 10/11 rates
- use coarse time in airtime fairness code to save CPU cycles
- Bluetooth:
- rework of HCI command execution serialization to use a common
queue and work struct, and improve handling errors reported in
the middle of a batch of commands
- rework HCI event handling to use skb_pull_data, avoiding packet
parsing pitfalls
- support AOSP Bluetooth Quality Report
- SMC:
- support net namespaces, following the RDMA model
- improve connection establishment latency by pre-clearing buffers
- introduce TCP ULP for automatic redirection to SMC
- Multi-Path TCP:
- support ioctls: SIOCINQ, OUTQ, and OUTQNSD
- support socket options: IP_TOS, IP_FREEBIND, IP_TRANSPARENT,
IPV6_FREEBIND, and IPV6_TRANSPARENT, TCP_CORK and TCP_NODELAY
- support cmsgs: TCP_INQ
- improvements in the data scheduler (assigning data to subflows)
- support fastclose option (quick shutdown of the full MPTCP
connection, similar to TCP RST in regular TCP)
- MCTP (Management Component Transport) over serial, as defined by
DMTF spec DSP0253 - "MCTP Serial Transport Binding".
Driver API
----------
- Support timestamping on bond interfaces in active/passive mode.
- Introduce generic phylink link mode validation for drivers which
don't have any quirks and where MAC capability bits fully express
what's supported. Allow PCS layer to participate in the validation.
Convert a number of drivers.
- Add support to set/get size of buffers on the Rx rings and size of
the tx copybreak buffer via ethtool.
- Support offloading TC actions as first-class citizens rather than
only as attributes of filters, improve sharing and device resource
utilization.
- WiFi (mac80211/cfg80211):
- support forwarding offload (ndo_fill_forward_path)
- support for background radar detection hardware
- SA Query Procedures offload on the AP side
New hardware / drivers
----------------------
- tsnep - FPGA based TSN endpoint Ethernet MAC used in PLCs with
real-time requirements for isochronous communication with protocols
like OPC UA Pub/Sub.
- Qualcomm BAM-DMUX WWAN - driver for data channels of modems
integrated into many older Qualcomm SoCs, e.g. MSM8916 or MSM8974
(qcom_bam_dmux).
- Microchip LAN966x multi-port Gigabit AVB/TSN Ethernet Switch driver
with support for bridging, VLANs and multicast forwarding
(lan966x).
- iwlmei driver for co-operating between Intel's WiFi driver and
Intel's Active Management Technology (AMT) devices.
- mse102x - Vertexcom MSE102x Homeplug GreenPHY chips
- Bluetooth:
- MediaTek MT7921 SDIO devices
- Foxconn MT7922A
- Realtek RTL8852AE
Drivers
-------
- Significantly improve performance in the datapaths of: lan78xx,
ax88179_178a, lantiq_xrx200, bnxt.
- Intel Ethernet NICs:
- igb: support PTP/time PEROUT and EXTTS SDP functions on
82580/i354/i350 adapters
- ixgbevf: new PF -> VF mailbox API which avoids the risk of
mailbox corruption with ESXi
- iavf: support configuration of VLAN features of finer
granularity, stacked tags and filtering
- ice: PTP support for new E822 devices with sub-ns precision
- ice: support firmware activation without reboot
- Mellanox Ethernet NICs (mlx5):
- expose control over IRQ coalescing mode (CQE vs EQE) via ethtool
- support TC forwarding when tunnel encap and decap happen between
two ports of the same NIC
- dynamically size and allow disabling various features to save
resources for running in embedded / SmartNIC scenarios
- Broadcom Ethernet NICs (bnxt):
- use page frag allocator to improve Rx performance
- expose control over IRQ coalescing mode (CQE vs EQE) via ethtool
- Other Ethernet NICs:
- amd-xgbe: add Ryzen 6000 (Yellow Carp) Ethernet support
- Microsoft cloud/virtual NIC (mana):
- add XDP support (PASS, DROP, TX)
- Mellanox Ethernet switches (mlxsw):
- initial support for Spectrum-4 ASICs
- VxLAN with IPv6 underlay
- Marvell Ethernet switches (prestera):
- support flower flow templates
- add basic IP forwarding support
- NXP embedded Ethernet switches (ocelot & felix):
- support Per-Stream Filtering and Policing (PSFP)
- enable cut-through forwarding between ports by default
- support FDMA to improve packet Rx/Tx to CPU
- Other embedded switches:
- hellcreek: improve trapping management (STP and PTP) packets
- qca8k: support link aggregation and port mirroring
- Qualcomm 802.11ax WiFi (ath11k):
- qca6390, wcn6855: enable 802.11 power save mode in station mode
- BSS color change support
- WCN6855 hw2.1 support
- 11d scan offload support
- scan MAC address randomization support
- full monitor mode, only supported on QCN9074
- qca6390/wcn6855: report signal and tx bitrate
- qca6390: rfkill support
- qca6390/wcn6855: regdb.bin support
- Intel WiFi (iwlwifi):
- support SAR GEO Offset Mapping (SGOM) and Time-Aware-SAR (TAS)
in cooperation with the BIOS
- support for Optimized Connectivity Experience (OCE) scan
- support firmware API version 68
- lots of preparatory work for the upcoming Bz device family
- MediaTek WiFi (mt76):
- Specific Absorption Rate (SAR) support
- mt7921: 160 MHz channel support
- RealTek WiFi (rtw88):
- Specific Absorption Rate (SAR) support
- scan offload
- Other WiFi NICs
- ath10k: support fetching (pre-)calibration data from nvmem
- brcmfmac: configure keep-alive packet on suspend
- wcn36xx: beacon filter support"
* tag '5.17-net-next' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2048 commits)
tcp: tcp_send_challenge_ack delete useless param `skb`
net/qla3xxx: Remove useless DMA-32 fallback configuration
rocker: Remove useless DMA-32 fallback configuration
hinic: Remove useless DMA-32 fallback configuration
lan743x: Remove useless DMA-32 fallback configuration
net: enetc: Remove useless DMA-32 fallback configuration
cxgb4vf: Remove useless DMA-32 fallback configuration
cxgb4: Remove useless DMA-32 fallback configuration
cxgb3: Remove useless DMA-32 fallback configuration
bnx2x: Remove useless DMA-32 fallback configuration
et131x: Remove useless DMA-32 fallback configuration
be2net: Remove useless DMA-32 fallback configuration
vmxnet3: Remove useless DMA-32 fallback configuration
bna: Simplify DMA setting
net: alteon: Simplify DMA setting
myri10ge: Simplify DMA setting
qlcnic: Simplify DMA setting
net: allwinner: Fix print format
page_pool: remove spinlock in page_pool_refill_alloc_cache()
amt: fix wrong return type of amt_send_membership_update()
...
We probably want to remove the indirect block to extents migration
feature after a deprecation window, but until then, let's fix a
potential data loss problem caused by the fact that we put the
tmp_inode on the orphan list. In the unlikely case where we crash and
do a journal recovery, the data blocks belonging to the inode being
migrated are also represented in the tmp_inode on the orphan list ---
and so its data blocks will get marked unallocated, and available for
reuse.
Instead, stop putting the tmp_inode on the oprhan list. So in the
case where we crash while migrating the inode, we'll leak an inode,
which is not a disaster. It will be easily fixed the next time we run
fsck, and it's better than potentially having blocks getting claimed
by two different files, and losing data as a result.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@kernel.org