Pull gfs2 setattr updates from Al Viro:
"Make it possible for filesystems to use a generic 'may_setattr()' and
switch gfs2 to using it"
* 'work.gfs2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
gfs2: Switch to may_setattr in gfs2_setattr
fs: Move notify_change permission checks into may_setattr
Pull root filesystem type handling updates from Al Viro:
"Teach init/do_mounts.c to handle non-block filesystems, hopefully
preventing even more special-cased kludges (such as root=/dev/nfs,
etc)"
* 'work.init' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: simplify get_filesystem_list / get_all_fs_names
init: allow mounting arbitrary non-blockdevice filesystems as root
init: split get_fs_names
Pull MAP_DENYWRITE removal from David Hildenbrand:
"Remove all in-tree usage of MAP_DENYWRITE from the kernel and remove
VM_DENYWRITE.
There are some (minor) user-visible changes:
- We no longer deny write access to shared libaries loaded via legacy
uselib(); this behavior matches modern user space e.g. dlopen().
- We no longer deny write access to the elf interpreter after exec
completed, treating it just like shared libraries (which it often
is).
- We always deny write access to the file linked via /proc/pid/exe:
sys_prctl(PR_SET_MM_MAP/EXE_FILE) will fail if write access to the
file cannot be denied, and write access to the file will remain
denied until the link is effectivel gone (exec, termination,
sys_prctl(PR_SET_MM_MAP/EXE_FILE)) -- just as if exec'ing the file.
Cross-compiled for a bunch of architectures (alpha, microblaze, i386,
s390x, ...) and verified via ltp that especially the relevant tests
(i.e., creat07 and execve04) continue working as expected"
* tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux:
fs: update documentation of get_write_access() and friends
mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
mm: remove VM_DENYWRITE
binfmt: remove in-tree usage of MAP_DENYWRITE
kernel/fork: always deny write access to current MM exe_file
kernel/fork: factor out replacing the current MM exe_file
binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()
As VM_DENYWRITE does no longer exists, let's spring-clean the
documentation of get_write_access() and friends.
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYTDKKAAKCRDh3BK/laaZ
PG9PAQCUF0fdBlCKudwSEt5PV5xemycL9OCAlYCd7d4XbBIe9wEA6sVJL9J+OwV2
aF0NomiXtJccE+S9+byjVCyqSzQJGQQ=
=6L2Y
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs update from Miklos Szeredi:
- Copy up immutable/append/sync/noatime attributes (Amir Goldstein)
- Improve performance by enabling RCU lookup.
- Misc fixes and improvements
The reason this touches so many files is that the ->get_acl() method now
gets a "bool rcu" argument. The ->get_acl() API was updated based on
comments from Al and Linus:
Link: https://lore.kernel.org/linux-fsdevel/CAJfpeguQxpd6Wgc0Jd3ks77zcsAv_bn0q17L3VNnnmPKu11t8A@mail.gmail.com/
* tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: enable RCU'd ->get_acl()
vfs: add rcu argument to ->get_acl() callback
ovl: fix BUG_ON() in may_delete() when called from ovl_cleanup()
ovl: use kvalloc in xattr copy-up
ovl: update ctime when changing fileattr
ovl: skip checking lower file's i_writecount on truncate
ovl: relax lookup error on mismatch origin ftype
ovl: do not set overlay.opaque for new directories
ovl: add ovl_allow_offline_changes() helper
ovl: disable decoding null uuid with redirect_dir
ovl: consistent behavior for immutable/append-only inodes
ovl: copy up sync/noatime fileattr flags
ovl: pass ovl_fs to ovl_check_setxattr()
fs: add generic helper for filling statx attribute flags
- Support for server-side disconnect injection via debugfs
- Protocol definitions for new RPC_AUTH_TLS authentication flavor
Performance improvements:
- Reduce page allocator traffic in the NFSD splice read actor
- Reduce CPU utilization in svcrdma's Send completion handler
Notable bug fixes:
- Stabilize lockd operation when re-exporting NFS mounts
- Fix the use of %.*s in NFSD tracepoints
- Fix /proc/sys/fs/nfs/nsm_use_hostnames
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmEqq0AACgkQM2qzM29m
f5dYig/5AaPN2BWYf4D1VkrAS3+zGS+3IN23WVgpbA54jgfjPEH+Aa00YhEQQa0j
Y5u/jE5g/tWvenDefq5BmvdRfZMWCVc2JkngctOSflhaREUWK+HgCkH+5DQs6zUM
rbX7qy0v6wJnEMSlwCKJ2AuZbYw7Bsg2nvOgEbb718/ent3umeoXEK09x3HTWLEp
eVcMU5uicB5wRRPpROYG792oWzUScQ8kyiRCKJfQDoR7bINhBeVHObAIFMBo1UaH
x9CMX4RlPYGmoMYUc+AqcOM7hizucHpXqM1r3oVjQ7FyI+pmDLuLL/3OTjtRUX7+
nYLqNW/PijH9PjFe4BPjGHAUQfKiTIXANAe8VdjQj70D40jYkP+jQ9SPdV+pEgi4
U4azfK3S+85/bRYYq/1alcLiP1+6dgcL++rVvnKESTH9NRgNoEw2WZHeKxXiYaxU
p7oOC4XdnYDwcz/3QVWa0sK2kA5IJHzOsCQR7OilD09NAJ+AbJTAp0H3xFXTllzb
AV2CAEBVZlP+pZYOehuVnKpZPa7YAWx92wRK2anbRUMZN3lF1wWBEOTd6KweIpTx
l2GJSf3GWBqL1x9PjSet/cBusxYjTA+S1hE7KMrsNPhzbvpIgAZEtSqOfn9apDCV
uAFIN2DSiHm3Tv0aFSJWo+CMyKkyktuiS8JFKaFdzCp9NtsBM2M=
=TGkK
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"New features:
- Support for server-side disconnect injection via debugfs
- Protocol definitions for new RPC_AUTH_TLS authentication flavor
Performance improvements:
- Reduce page allocator traffic in the NFSD splice read actor
- Reduce CPU utilization in svcrdma's Send completion handler
Notable bug fixes:
- Stabilize lockd operation when re-exporting NFS mounts
- Fix the use of %.*s in NFSD tracepoints
- Fix /proc/sys/fs/nfs/nsm_use_hostnames"
* tag 'nfsd-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (31 commits)
nfsd: fix crash on LOCKT on reexported NFSv3
nfs: don't allow reexport reclaims
lockd: don't attempt blocking locks on nfs reexports
nfs: don't atempt blocking locks on nfs reexports
Keep read and write fds with each nlm_file
lockd: update nlm_lookup_file reexport comment
nlm: minor refactoring
nlm: minor nlm_lookup_file argument change
lockd: lockd server-side shouldn't set fl_ops
SUNRPC: Add documentation for the fail_sunrpc/ directory
SUNRPC: Server-side disconnect injection
SUNRPC: Move client-side disconnect injection
SUNRPC: Add a /sys/kernel/debug/fail_sunrpc/ directory
svcrdma: xpt_bc_xprt is already clear in __svc_rdma_free()
nfsd4: Fix forced-expiry locking
rpc: fix gss_svc_init cleanup on failure
SUNRPC: Add RPC_AUTH_TLS protocol numbers
lockd: change the proc_handler for nsm_use_hostnames
sysctl: introduce new proc handler proc_dobool
SUNRPC: Fix a NULL pointer deref in trace_svc_stats_latency()
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmEs2NIACgkQxWXV+ddt
WDsJMQ/+PJ/yXfI85mAeAzTJLWQ0zD6YO3iBhf3wOeyychWC4on435pj+zW8zR/U
/bix25ygoWF4MvGF6p0uyv4Z5mnvkZXE5lapUcJu6wXG7se1QRPH0broTh05IBXK
SnT93Eb9RexaiNFk7DVma9XkviqZ/ZISPtkJ9wYrfIba7j/U/wa+PtEFS7wk58hP
rFQXgV64xm/pcP28YYHfOkCjdyUMdJrnBUvfKOlX6d94lmYbP5lyiTL+XJEXExzN
wPakD0UsnXPr4TRvf+YRTPeFHPPUgyORII7otVUOKmGywWtcJrELX8rXFoW+6GwB
dzZIcSYXHUxU5UrtMbZgiztVBJ+bQY5juYMIrj13eYOMYkijxAqPP84iDO15+TSV
zNqyAVjUglHCGUGjhSpAxnAmtp+IJTZfVAWcvIKq3VqvJtb8tssQsk9bqFjH1xlH
qNJLE57CYe3tjw05K9y0keMh2iJWRWkXZYkgI/zjwo5nreemobpN+3fO4yneVLh7
ecdBmSl/JVSzAB1NamLOCZNGZLUqiiuTvZlJtI6ZsekrN1+4A6QzVcU/MGjSYL1v
C7W0hK0LF+e3xIBkxTKVq8noolsgbmlWacxJq8fZq9HwZy5IVJOVm9STDlCuLaIo
gPr0V0itkclcsMU0CHTyCjMsfuHYUwJZXwg93wKfJf5UCzS4OWU=
=ALO9
-----END PGP SIGNATURE-----
Merge tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"The highlights of this round are integrations with fs-verity and
idmapped mounts, the rest is usual mix of minor improvements, speedups
and cleanups.
There are some patches outside of btrfs, namely updating some VFS
interfaces, all straightforward and acked.
Features:
- fs-verity support, using standard ioctls, backward compatible with
read-only limitation on inodes with previously enabled fs-verity
- idmapped mount support
- make mount with rescue=ibadroots more tolerant to partially damaged
trees
- allow raid0 on a single device and raid10 on two devices,
degenerate cases but might be useful as an intermediate step during
conversion to other profiles
- zoned mode block group auto reclaim can be disabled via sysfs knob
Performance improvements:
- continue readahead of node siblings even if target node is in
memory, could speed up full send (on sample test +11%)
- batching of delayed items can speed up creating many files
- fsync/tree-log speedups
- avoid unnecessary work (gains +2% throughput, -2% run time on
sample load)
- reduced lock contention on renames (on dbench +4% throughput,
up to -30% latency)
Fixes:
- various zoned mode fixes
- preemptive flushing threshold tuning, avoid excessive work on
almost full filesystems
Core:
- continued subpage support, preparation for implementing remaining
features like compression and defragmentation; with some
limitations, write is now enabled on 64K page systems with 4K
sectors, still considered experimental
- no readahead on compressed reads
- inline extents disabled
- disabled raid56 profile conversion and mount
- improved flushing logic, fixing early ENOSPC on some workloads
- inode flags have been internally split to read-only and read-write
incompat bit parts, used by fs-verity
- new tree items for fs-verity
- descriptor item
- Merkle tree item
- inode operations extended to be namespace-aware
- cleanups and refactoring
Generic code changes:
- fs: new export filemap_fdatawrite_wbc
- fs: removed sync_inode
- block: bio_trim argument type fixups
- vfs: add namespace-aware lookup"
* tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits)
btrfs: reset replace target device to allocation state on close
btrfs: zoned: fix ordered extent boundary calculation
btrfs: do not do preemptive flushing if the majority is global rsv
btrfs: reduce the preemptive flushing threshold to 90%
btrfs: tree-log: check btrfs_lookup_data_extent return value
btrfs: avoid unnecessarily logging directories that had no changes
btrfs: allow idmapped mount
btrfs: handle ACLs on idmapped mounts
btrfs: allow idmapped INO_LOOKUP_USER ioctl
btrfs: allow idmapped SUBVOL_SETFLAGS ioctl
btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls
btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids
btrfs: allow idmapped SNAP_DESTROY ioctls
btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls
btrfs: check whether fsgid/fsuid are mapped during subvolume creation
btrfs: allow idmapped permission inode op
btrfs: allow idmapped setattr inode op
btrfs: allow idmapped tmpfile inode op
btrfs: allow idmapped symlink inode op
btrfs: allow idmapped mkdir inode op
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEs8fUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpio4D/9cGrHIbbZsuDIHzhaK2JIUrSG7G4GkcaG/
NAqbOp7KvF+1elMY08DWLT0nnFqHM7REHIS4Lv55KCNtktTFfdYmxso4lPrRu67o
MNbMJcEAglgIDw0xP4MfP/vZ0ftXJv8+OXSfL51pD4U40nWIZVpqn8WbWKRqjhGf
nQhiANbl2mO2Ec7I/UgAIqwczQnF5HveCkX5106dAppma8yEH+v2TkvZyZp/TCU3
h0ec26hLi+4QRBFm4O0yrVWj1gMS7yfHuEFSGw+jhp/WNTpH9A5pXFQjn7pIyJNi
uqrwM7knrod9ZH2pE1825w0TrbqkOdcZCo+/NvJHOAy03LUBJ/9qDc+JJUWsEmLZ
cpd8auaCfuAFx6ForHmKd+Pw1bANebWBMsClyQSh38+fsJ9myci3c3tkkzmO+dSW
G+rZZochiG4nFSl+CvlUoFfztuu8rdbOLKI/9usPMHNcDiY4yAAmz80B9uQdtQp7
tRLqegplsDODefLNvl0/Uj7WFJl6w5furchTXPmc+GSPFc+mpW08Olh7ScaCyD8c
a8YXaQi5hwuUR1N7uW65Df/HGMbIDvxOStcurIakP0mOSvRKrojZgQhbJ8zuCG4y
cRCwRUzvreNIoKK2ZxEvhLjhE5POaWgy6AtN/UI9k9BeVGQdboKVBGvub5Mv+ZKE
HpchbANk8Q==
=T7Zv
-----END PGP SIGNATURE-----
Merge tag 'for-5.15/io_uring-vfs-2021-08-30' of git://git.kernel.dk/linux-block
Pull io_uring mkdirat/symlinkat/linkat support from Jens Axboe:
"This adds io_uring support for mkdirat, symlinkat, and linkat"
* tag 'for-5.15/io_uring-vfs-2021-08-30' of git://git.kernel.dk/linux-block:
io_uring: add support for IORING_OP_LINKAT
io_uring: add support for IORING_OP_SYMLINKAT
io_uring: add support for IORING_OP_MKDIRAT
namei: update do_*() helpers to return ints
namei: make do_linkat() take struct filename
namei: add getname_uflags()
namei: make do_symlinkat() take struct filename
namei: make do_mknodat() take struct filename
namei: make do_mkdirat() take struct filename
namei: change filename_parentat() calling conventions
namei: ignore ERR/NULL names in putname()
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEs8QQQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpgAgD/wP9gGxrFE5oxtdozDPkEYTXn5e0QKseDyV
cNxLmSb3wc4WIEPwjCavdQHpy0fnbjaYwGveHf9ygQwDZPj9WBgEL3ipPYXCCzFA
ysoV86kBRxKDI476r2InxI8WaW7hV0IWxPlScUTA1QeeNAzRJDymQvRuwg5KvVRS
Jt6R58khzWpEGYO2CqFTpGsA7x01R0kvZ54xmFgKZ+Pxo+Bk03fkO32YUFC49Wm8
Zy+JMsaiIlLgucDTJ4zAKjQUXiwP2GMEw5Vk/lLUFGBvyw0AN2rO9g18L7QW2ZUu
vnkaJQwBbMUbgveXlI/y6GG/vuKUG2i4AmzNJH17qFCnimO3JY6vgzUOg5dqOiwx
bx7ZzmnBWgQp95/cSAlZ4QwRYf3z0hvVFKPj9U3X9wKGmuxUKHiLResQwp7bzRdd
4L4Jo1WFDDHR/1MOOzzW0uxE3uTm0LKcncsi4hJL20dl+16RXCIbzHWUTAd8yyMV
9QeUAumc4GHOeswa1Ms8jLPAgXyEoAkec7ca7cRIY/NW+DXGLG9tYBgCw1eLe6BN
M7LwMsPNlS2v2dMUbiuw8XxkA+uYso728e2vd/edca2jxXj8+SVnm020aYBnxIzh
nmjbf69+QddBPEnk/EPvRj8tXOhr3k7FklI4R7qlei/+IGTujGPvM4kn3p6fnHrx
d7bsu/jtaQ==
=izfH
-----END PGP SIGNATURE-----
Merge tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block
Pull support for struct bio recycling from Jens Axboe:
"This adds bio recycling support for polled IO, allowing quick reuse of
a bio for high IOPS scenarios via a percpu bio_set list.
It's good for almost a 10% improvement in performance, bumping our
per-core IO limit from ~3.2M IOPS to ~3.5M IOPS"
* tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block:
bio: improve kerneldoc documentation for bio_alloc_kiocb()
block: provide bio_clear_hipri() helper
block: use the percpu bio cache in __blkdev_direct_IO
io_uring: enable use of bio alloc cache
block: clear BIO_PERCPU_CACHE flag if polling isn't supported
bio: add allocation cache abstraction
fs: add kiocb alloc cache flag
bio: optimize initialization of a bio
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmEo3WcTHGpsYXl0b25A
a2VybmVsLm9yZwAKCRAADmhBGVaCFbnuD/9Vj3dnXRgZ9LHFuQHp5Vy5yaGfujwP
kIMN7bJiL0pCZ1zHapyn0cidZuTScDK2qMzGVR9Wrj/uYHEI+T08IEfaq/2CU3pS
PdygHgqQi+WtJj4dks1OphVdlF0+46B/zy2YY0LE39zFUDO38VUIgb/gTiQ8JHIr
BaDGtohlXmK8bDXo2XqMunSp7uQoRX92mv+bPjvIpsM60RnhPzhX0HeO/xhXO1PP
OpswQG3nOTX1alZOXuSKdH7e3zfO+pLnC5NYeo5R4rys5xRiShU19OSAJfiPwf2w
06OX7uLT7aF5LPQPjOP4MA9dHPWWeIuYnKzUpQZZ+D+zjuaxhF62saJG1Um0yQns
qqPZgTWhxzkJ16bj94T2UZW0bufFKM5cEfBPFw9ZyzQshX5jIE2V5ecV+F/AdtTj
EA55m/5N8NZMsNgnEOqn6o8TkWoV1seHKJXqja0+ZY9/Xs8GkmjHLhjNBCwGZIbw
8S64Szp+yex3m1mRS+L6T0Gudic5WdFxOKQAzvvJcmT6u+ZGBzU0FjNTNNdnu3tp
f9mANYxT1vNfYplRuNhMCp9oioSRpO2vfDQLCBRtsCy0QlKCum6qb7E/BcWRlwn1
SOAFmjeXpr+VXGX+Rjp/bzCdHlNOVbR4ODWJ+h5D7QZaQwQ6FafkEq1D2yjPR0OR
i722ECoM++vZ4w==
=CfcC
-----END PGP SIGNATURE-----
Merge tag 'locks-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull file locking updates from Jeff Layton:
"This starts with a couple of fixes for potential deadlocks in the
fowner/fasync handling.
The next patch removes the old mandatory locking code from the kernel
altogether.
The last patch cleans up rw_verify_area a bit more after the mandatory
locking removal"
* tag 'locks-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
fs: clean up after mandatory file locking support removal
fs: remove mandatory file locking support
fcntl: fix potential deadlock for &fasync_struct.fa_lock
fcntl: fix potential deadlocks for &fown_struct.lock
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmEmTZcACgkQnJ2qBz9k
QNkkmAgArW6XoF1CePds/ZaC9vfg/nk66/zVo0n+J8xXjMWAPxcKbWFfV0uWVixq
yk4lcLV47a2Mu/B/1oLNd3vrSmhwU+srWqNwOFn1nv+lP/6wJqr8oztRHn/0L9Q3
ZSRrukSejbQ6AvTL/WzTNnCjjCc2ne3Kyko6W41aU6uyJuzhSM32wbx7qlV6t54Z
iint9OrB4gM0avLohNafTUq6I+tEGzBMNwpCG/tqCmkcvDcv3rTDVAnPSCTm0Tx2
hdrYDcY/rLxo93pDBaW1rYA/fohR+mIVye6k2TjkPAL6T1x+rxeT5qnc+YijH5yF
sFPDhlD+ZsfOLi8stWXLOJ+8+gLODg==
=pDBR
-----END PGP SIGNATURE-----
Merge tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fs hole punching vs cache filling race fixes from Jan Kara:
"Fix races leading to possible data corruption or stale data exposure
in multiple filesystems when hole punching races with operations such
as readahead.
This is the series I was sending for the last merge window but with
your objection fixed - now filemap_fault() has been modified to take
invalidate_lock only when we need to create new page in the page cache
and / or bring it uptodate"
* tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
filesystems/locking: fix Malformed table warning
cifs: Fix race between hole punch and page fault
ceph: Fix race between hole punch and page fault
fuse: Convert to using invalidate_lock
f2fs: Convert to using invalidate_lock
zonefs: Convert to using invalidate_lock
xfs: Convert double locking of MMAPLOCK to use VFS helpers
xfs: Convert to use invalidate_lock
xfs: Refactor xfs_isilocked()
ext2: Convert to using invalidate_lock
ext4: Convert to use mapping->invalidate_lock
mm: Add functions to lock invalidate_lock for two mappings
mm: Protect operations adding pages to page cache with invalidate_lock
documentation: Sync file_operations members with reality
mm: Fix comments mentioning i_mutex
In the reexport case, nfsd is currently passing along locks with the
reclaim bit set. The client sends a new lock request, which is granted
if there's currently no conflict--even if it's possible a conflicting
lock could have been briefly held in the interim.
We don't currently have any way to safely grant reclaim, so for now
let's just deny them all.
I'm doing this by passing the reclaim bit to nfs and letting it fail the
call, with the idea that eventually the client might be able to do
something more forgiving here.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If this kiocb can safely use the polled bio allocation cache, then this
flag must be set. Generally this can be set for polled IO, where we will
not see IRQ completions of the request.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are a couple of places where we already open-code the (flags &
AT_EMPTY_PATH) check and io_uring will likely add another one in the
future. Let's just add a simple helper getname_uflags() that handles
this directly and use it.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/20210415100815.edrn4a7cy26wkowe@wittgenstein/
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-7-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that all users of sync_inode() have been deleted, remove
sync_inode().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs sometimes needs to flush dirty pages on a bunch of dirty inodes in
order to reclaim metadata reservations. Unfortunately most helpers in
this area are too smart for us:
1) The normal filemap_fdata* helpers only take range and sync modes, and
don't give any indication of how much was written, so we can only
flush full inodes, which isn't what we want in most cases.
2) The normal writeback path requires us to have the s_umount sem held,
but we can't unconditionally take it in this path because we could
deadlock.
3) The normal writeback path also skips inodes with I_SYNC set if we
write with WB_SYNC_NONE. This isn't the behavior we want under heavy
ENOSPC pressure, we want to actually make sure the pages are under
writeback before returning, and if another thread is in the middle of
writing the file we may return before they're under writeback and
miss our ordered extents and not properly wait for completion.
4) sync_inode() uses the normal writeback path and has the same problem
as #3.
What we really want is to call do_writepages() with our wbc. This way
we can make sure that writeback is actually started on the pages, and we
can control how many pages are written as a whole as we write many
inodes using the same wbc. Accomplish this with a new helper that does
just that so we can use it for our ENOSPC flushing infrastructure.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it
off in Fedora and RHEL8. Several other distros have followed suit.
I've heard of one problem in all that time: Someone migrated from an
older distro that supported "-o mand" to one that didn't, and the host
had a fstab entry with "mand" in it which broke on reboot. They didn't
actually _use_ mandatory locking so they just removed the mount option
and moved on.
This patch rips out mandatory locking support wholesale from the kernel,
along with the Kconfig option and the Documentation file. It also
changes the mount code to ignore the "mand" mount option instead of
erroring out, and to throw a big, ugly warning.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Just output the '\0' separate list of supported file systems for block
devices directly rather than going through a pointless round of string
manipulation.
Based on an earlier patch from Al Viro <viro@zeniv.linux.org.uk>.
Vivek:
Modified list_bdev_fs_names() and split_fs_names() to return number of
null terminted strings to caller. Callers now use that information to
loop through all the strings instead of relying on one extra null char
being present at the end.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Overlayfs does not cache ACL's (to avoid double caching). Instead it just
calls the underlying filesystem's i_op->get_acl(), which will return the
cached value, if possible.
In rcu path walk, however, get_cached_acl_rcu() is employed to get the
value from the cache, which will fail on overlayfs resulting in dropping
out of rcu walk mode. This can result in a big performance hit in certain
situations.
Fix by calling ->get_acl() with rcu=true in case of ACL_DONT_CACHE (which
indicates pass-through)
Reported-by: garyhuang <zjh.20052005@163.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Add a rcu argument to the ->get_acl() callback to allow
get_cached_acl_rcu() to call the ->get_acl() method in the next patch.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The immutable and append-only properties on an inode are published on
the inode's i_flags and enforced by the VFS.
Create a helper to fill the corresponding STATX_ATTR_ flags in the kstat
structure from the inode's i_flags.
Only orange was converted to use this helper.
Other filesystems could use it in the future.
Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Move the permission checks in notify_change into a separate function to
make them available to filesystems.
When notify_change is called, the vfs performs those checks before
calling into iop->setattr. However, a filesystem like gfs2 can only
lock and revalidate the inode inside ->setattr, and it must then repeat
those checks to err on the safe side.
It would be nice to get rid of the double checking, but moving the
permission check into iop->setattr altogether isn't really an option.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rename s_fsnotify_inode_refs to s_fsnotify_connectors and count all
objects with attached connectors, not only inodes with attached
connectors.
This will be used to optimize fsnotify() calls on sb without any
type of marks.
Link: https://lore.kernel.org/r/20210810151220.285179-4-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Matthew Bobrowski <repnop@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Some operations such as reflinking blocks among files will need to lock
invalidate_lock for two mappings. Add helper functions to do that.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently, serializing operations such as page fault, read, or readahead
against hole punching is rather difficult. The basic race scheme is
like:
fallocate(FALLOC_FL_PUNCH_HOLE) read / fault / ..
truncate_inode_pages_range()
<create pages in page
cache here>
<update fs block mapping and free blocks>
Now the problem is in this way read / page fault / readahead can
instantiate pages in page cache with potentially stale data (if blocks
get quickly reused). Avoiding this race is not simple - page locks do
not work because we want to make sure there are *no* pages in given
range. inode->i_rwsem does not work because page fault happens under
mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
the performance for mixed read-write workloads suffer.
So create a new rw_semaphore in the address_space - invalidate_lock -
that protects adding of pages to page cache for page faults / reads /
readahead.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Here is the big set of char / misc and other driver subsystem updates
for 5.14-rc1. Included in here are:
- habanna driver updates
- fsl-mc driver updates
- comedi driver updates
- fpga driver updates
- extcon driver updates
- interconnect driver updates
- mei driver updates
- nvmem driver updates
- phy driver updates
- pnp driver updates
- soundwire driver updates
- lots of other tiny driver updates for char and misc drivers
This is looking more and more like the "various driver subsystems mushed
together" tree...
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYOM8jQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymECgCg0yL+8WxDKO5Gg5llM5PshvLB1rQAn0y5pDgg
nw78LV3HQ0U7qaZBtI91
=x+AR
-----END PGP SIGNATURE-----
Merge tag 'char-misc-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char / misc driver updates from Greg KH:
"Here is the big set of char / misc and other driver subsystem updates
for 5.14-rc1. Included in here are:
- habanalabs driver updates
- fsl-mc driver updates
- comedi driver updates
- fpga driver updates
- extcon driver updates
- interconnect driver updates
- mei driver updates
- nvmem driver updates
- phy driver updates
- pnp driver updates
- soundwire driver updates
- lots of other tiny driver updates for char and misc drivers
This is looking more and more like the "various driver subsystems
mushed together" tree...
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (292 commits)
mcb: Use DEFINE_RES_MEM() helper macro and fix the end address
PNP: moved EXPORT_SYMBOL so that it immediately followed its function/variable
bus: mhi: pci-generic: Add missing 'pci_disable_pcie_error_reporting()' calls
bus: mhi: Wait for M2 state during system resume
bus: mhi: core: Fix power down latency
intel_th: Wait until port is in reset before programming it
intel_th: msu: Make contiguous buffers uncached
intel_th: Remove an unused exit point from intel_th_remove()
stm class: Spelling fix
nitro_enclaves: Set Bus Master for the NE PCI device
misc: ibmasm: Modify matricies to matrices
misc: vmw_vmci: return the correct errno code
siox: Simplify error handling via dev_err_probe()
fpga: machxo2-spi: Address warning about unused variable
lkdtm/heap: Add init_on_alloc tests
selftests/lkdtm: Enable various testable CONFIGs
lkdtm: Add CONFIG hints in errors where possible
lkdtm: Enable DOUBLE_FAULT on all architectures
lkdtm/heap: Add vmalloc linear overflow test
lkdtm/bugs: XFAIL UNALIGNED_LOAD_STORE_WRITE
...
Pull vfs name lookup updates from Al Viro:
"Small namei.c patch series, mostly to simplify the rules for nameidata
state. It's actually from the previous cycle - but I didn't post it
for review in time...
Changes visible outside of fs/namei.c: file_open_root() calling
conventions change, some freed bits in LOOKUP_... space"
* 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
namei: make sure nd->depth is always valid
teach set_nameidata() to handle setting the root as well
take LOOKUP_{ROOT,ROOT_GRABBED,JUMPED} out of LOOKUP_... space
switch file_open_root() to struct path
ext4 in 5.14:
- Allow applications to poll on changes to /sys/fs/ext4/*/errors_count
- Add the ioctl EXT4_IOC_CHECKPOINT which allows the journal to be
checkpointed, truncated and discarded or zero'ed.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmDcjRgACgkQ8vlZVpUN
gaMAMQgAjRYUQ+tdJVZzInFwukudhgLyuCP9AdCx76fisaH22yNCakQ7M2XGz59i
/YbJerLaueYpHZzpA9p5+sSjVhMwILO3scBSJbOwdsbrFAsFLzcgQKQhGGqK2KvX
IAOEArC8/hm1wnVb7sfQYdBHlWyeJpI8hd/8WZPlYtySlRnP1TZCd+X7y7lmNs1H
QU1KECwstI2t8Lug0QeKx2B9PI9AWcCs0lTJ4LfcANZAh3HIJi9aUCk4SFDRkf3/
8AazvMqTHJD9yc+BNyZOro2ykDFCStkNqf0cDYTzvKrr66CHScPUtyI0oAEdspxN
+SNNARPGZgNOuR3ZRbGivtwgEB+GpQ==
=jSd4
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"In addition to bug fixes and cleanups, there are two new features for
ext4 in 5.14:
- Allow applications to poll on changes to
/sys/fs/ext4/*/errors_count
- Add the ioctl EXT4_IOC_CHECKPOINT which allows the journal to be
checkpointed, truncated and discarded or zero'ed"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (32 commits)
jbd2: export jbd2_journal_[un]register_shrinker()
ext4: notify sysfs on errors_count value change
fs: remove bdev_try_to_free_page callback
ext4: remove bdev_try_to_free_page() callback
jbd2: simplify journal_clean_one_cp_list()
jbd2,ext4: add a shrinker to release checkpointed buffers
jbd2: remove redundant buffer io error checks
jbd2: don't abort the journal when freeing buffers
jbd2: ensure abort the journal if detect IO error when writing original buffer back
jbd2: remove the out label in __jbd2_journal_remove_checkpoint()
ext4: no need to verify new add extent block
jbd2: clean up misleading comments for jbd2_fc_release_bufs
ext4: add check to prevent attempting to resize an fs with sparse_super2
ext4: consolidate checks for resize of bigalloc into ext4_resize_begin
ext4: remove duplicate definition of ext4_xattr_ibody_inline_set()
ext4: fsmap: fix the block/inode bitmap comment
ext4: fix comment for s_hash_unsigned
ext4: use local variable ei instead of EXT4_I() macro
ext4: fix avefreec in find_group_orlov
ext4: correct the cache_nr in tracepoint ext4_es_shrink_exit
...
Use __set_page_dirty_no_writeback() instead. This will set the dirty bit
on the page, which will be used to avoid calling set_page_dirty() in the
future. It will have no effect on actually writing the page back, as the
pages are not on any LRU lists.
[akpm@linux-foundation.org: export __set_page_dirty_no_writeback() to modules]
Link: https://lkml.kernel.org/r/20210615162342.1669332-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the ramfs aops to libfs and reuse them for kernfs and configfs.
Thosw two did not wire up ->set_page_dirty before and now get
__set_page_dirty_no_writeback, which is the right one for no-writeback
address_space usage.
Drop the now unused exports of the libfs helpers only used for ramfs-style
pagecache usage.
Link: https://lkml.kernel.org/r/20210614061512.3966143-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After remove the unique user of sop->bdev_try_to_free_page() callback,
we could remove the callback and the corresponding blkdev_releasepage()
at all.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210610112440.3438139-9-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Patch series "drivers/char: remove /dev/kmem for good".
Exploring /dev/kmem and /dev/mem in the context of memory hot(un)plug and
memory ballooning, I started questioning the existence of /dev/kmem.
Comparing it with the /proc/kcore implementation, it does not seem to be
able to deal with things like
a) Pages unmapped from the direct mapping (e.g., to be used by secretmem)
-> kern_addr_valid(). virt_addr_valid() is not sufficient.
b) Special cases like gart aperture memory that is not to be touched
-> mem_pfn_is_ram()
Unless I am missing something, it's at least broken in some cases and might
fault/crash the machine.
Looks like its existence has been questioned before in 2005 and 2010 [1],
after ~11 additional years, it might make sense to revive the discussion.
CONFIG_DEVKMEM is only enabled in a single defconfig (on purpose or by
mistake?). All distributions disable it: in Ubuntu it has been disabled
for more than 10 years, in Debian since 2.6.31, in Fedora at least
starting with FC3, in RHEL starting with RHEL4, in SUSE starting from
15sp2, and OpenSUSE has it disabled as well.
1) /dev/kmem was popular for rootkits [2] before it got disabled
basically everywhere. Ubuntu documents [3] "There is no modern user of
/dev/kmem any more beyond attackers using it to load kernel rootkits.".
RHEL documents in a BZ [5] "it served no practical purpose other than to
serve as a potential security problem or to enable binary module drivers
to access structures/functions they shouldn't be touching"
2) /proc/kcore is a decent interface to have a controlled way to read
kernel memory for debugging puposes. (will need some extensions to
deal with memory offlining/unplug, memory ballooning, and poisoned
pages, though)
3) It might be useful for corner case debugging [1]. KDB/KGDB might be a
better fit, especially, to write random memory; harder to shoot
yourself into the foot.
4) "Kernel Memory Editor" [4] hasn't seen any updates since 2000 and seems
to be incompatible with 64bit [1]. For educational purposes,
/proc/kcore might be used to monitor value updates -- or older
kernels can be used.
5) It's broken on arm64, and therefore, completely disabled there.
Looks like it's essentially unused and has been replaced by better
suited interfaces for individual tasks (/proc/kcore, KDB/KGDB). Let's
just remove it.
[1] https://lwn.net/Articles/147901/
[2] https://www.linuxjournal.com/article/10505
[3] https://wiki.ubuntu.com/Security/Features#A.2Fdev.2Fkmem_disabled
[4] https://sourceforge.net/projects/kme/
[5] https://bugzilla.redhat.com/show_bug.cgi?id=154796
Link: https://lkml.kernel.org/r/20210324102351.6932-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210324102351.6932-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Alexander A. Klimov" <grandmaster@al2klimov.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: Corentin Labbe <clabbe@baylibre.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Gregory Clement <gregory.clement@bootlin.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hillf Danton <hdanton@sina.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: James Troup <james.troup@canonical.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kairui Song <kasong@redhat.com>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Liviu Dudau <liviu.dudau@arm.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: openrisc@lists.librecores.org
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Pavel Machek (CIP)" <pavel@denx.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Pierre Morel <pmorel@linux.ibm.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Cc: sparclinux@vger.kernel.org
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Theodore Dubois <tblodt@icloud.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: William Cohen <wcohen@redhat.com>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We no longer track anything in nrexceptional, so remove it, saving 8 bytes
per inode.
Link: https://lkml.kernel.org/r/20201026151849.24232-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Vishal Verma <vishal.l.verma@intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Improve IOCB_NOWAIT O_DIRECT reads", v3.
An internal workload complained because it was using too much CPU, and
when I took a look, we had a lot of io_uring workers going to town.
For an async buffered read like workload, I am normally expecting _zero_
offloads to a worker thread, but this one had tons of them. I'd drop
caches and things would look good again, but then a minute later we'd
regress back to using workers. Turns out that every minute something
was reading parts of the device, which would add page cache for that
inode. I put patches like these in for our kernel, and the problem was
solved.
Don't -EAGAIN IOCB_NOWAIT dio reads just because we have page cache
entries for the given range. This causes unnecessary work from the
callers side, when the IO could have been issued totally fine without
blocking on writeback when there is none.
This patch (of 3):
For O_DIRECT reads/writes, we check if we need to issue a call to
filemap_write_and_wait_range() to issue and/or wait for writeback for any
page in the given range. The existing mechanism just checks for a page in
the range, which is suboptimal for IOCB_NOWAIT as we'll fallback to the
slow path (and needing retry) if there's just a clean page cache page in
the range.
Provide filemap_range_needs_writeback() which tries a little harder to
check if we actually need to issue and/or wait for writeback in the range.
Link: https://lkml.kernel.org/r/20210224164455.1096727-1-axboe@kernel.dk
Link: https://lkml.kernel.org/r/20210224164455.1096727-2-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmCHPZwACgkQ+7dXa6fL
C2uJxw/9FVNssHxtA8iFDvZskE4YHiL6vMgOgKOeVmBfUvxqJcxWQXcF8ycbon5y
jGcDRV1DWTv395ckALHqmD6SlH/5q+OBt4cCOXCebOlzbC63JmjJ6xOjHntZKw3i
9c3GITNca5AsPXHXHGIcoRY4/4FntpLoVpyfYJ4ZZJCY7a7QUbgnEIIy9/Ps8Clw
BahhiKChl2JCgV3KZBk/ypkf0IBduxKgT+IUxA9o7H5UsLzvUgnfd5uMIALLPMI1
NXzUHBJoUtnWcB52nWPufJx9YwkMfSx70mutT0T74CFxbJakwRgAl2tWr5g989qM
/fQrsOhMlU3NaXYaRPelbxkuzvy3hU1xSe3GLiZcxmh4Cb/YAX0TrHRecO62NWff
pu/UWQS8Du5Gy8DrHScuo8baI1KFfyiV2lWQPfBO8kPaEB2ERw+PN6fWSh993Cn9
4UHaR3Oyn4qyVXeirNZg+frado+BEZAbNMZwn0lyi6jnLeyir6qABOdpQk34SB35
D4jfdPOBxeh3OVFkc+EBJ98i3/nal2+yXrNOqkP4OwmF0HqGt0YKKSaLNigXaDdO
3CKmQlBqBZsUdRYHJyJsofrifkKjP78zx2WyUJPms8MGX9z+9kYR3f1erifLesCT
Kb2TrAFx4ZgqS5tFh6UHnX4x0qy2RckgNrKTMpv38K8lNqplvLo=
=tZgy
-----END PGP SIGNATURE-----
Merge tag 'netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull network filesystem helper library updates from David Howells:
"Here's a set of patches for 5.13 to begin the process of overhauling
the local caching API for network filesystems. This set consists of
two parts:
(1) Add a helper library to handle the new VM readahead interface.
This is intended to be used unconditionally by the filesystem
(whether or not caching is enabled) and provides a common
framework for doing caching, transparent huge pages and, in the
future, possibly fscrypt and read bandwidth maximisation. It also
allows the netfs and the cache to align, expand and slice up a
read request from the VM in various ways; the netfs need only
provide a function to read a stretch of data to the pagecache and
the helper takes care of the rest.
(2) Add an alternative fscache/cachfiles I/O API that uses the kiocb
facility to do async DIO to transfer data to/from the netfs's
pages, rather than using readpage with wait queue snooping on one
side and vfs_write() on the other. It also uses less memory, since
it doesn't do buffered I/O on the backing file.
Note that this uses SEEK_HOLE/SEEK_DATA to locate the data
available to be read from the cache. Whilst this is an improvement
from the bmap interface, it still has a problem with regard to a
modern extent-based filesystem inserting or removing bridging
blocks of zeros. Fixing that requires a much greater overhaul.
This is a step towards overhauling the fscache API. The change is
opt-in on the part of the network filesystem. A netfs should not try
to mix the old and the new API because of conflicting ways of handling
pages and the PG_fscache page flag and because it would be mixing DIO
with buffered I/O. Further, the helper library can't be used with the
old API.
This does not change any of the fscache cookie handling APIs or the
way invalidation is done at this time.
In the near term, I intend to deprecate and remove the old I/O API
(fscache_allocate_page{,s}(), fscache_read_or_alloc_page{,s}(),
fscache_write_page() and fscache_uncache_page()) and eventually
replace most of fscache/cachefiles with something simpler and easier
to follow.
This patchset contains the following parts:
- Some helper patches, including provision of an ITER_XARRAY iov
iterator and a function to do readahead expansion.
- Patches to add the netfs helper library.
- A patch to add the fscache/cachefiles kiocb API.
- A pair of patches to fix some review issues in the ITER_XARRAY and
read helpers as spotted by Al and Willy.
Jeff Layton has patches to add support in Ceph for this that he
intends for this merge window. I have a set of patches to support AFS
that I will post a separate pull request for.
With this, AFS without a cache passes all expected xfstests; with a
cache, there's an extra failure, but that's also there before these
patches. Fixing that probably requires a greater overhaul. Ceph also
passes the expected tests.
I also have patches in a separate branch to tidy up the handling of
PG_fscache/PG_private_2 and their contribution to page refcounting in
the core kernel here, but I haven't included them in this set and will
route them separately"
Link: https://lore.kernel.org/lkml/3779937.1619478404@warthog.procyon.org.uk/
* tag 'netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
netfs: Miscellaneous fixes
iov_iter: Four fixes for ITER_XARRAY
fscache, cachefiles: Add alternate API to use kiocb for read/write to cache
netfs: Add a tracepoint to log failures that would be otherwise unseen
netfs: Define an interface to talk to a cache
netfs: Add write_begin helper
netfs: Gather stats
netfs: Add tracepoints
netfs: Provide readahead and readpage netfs helpers
netfs, mm: Add set/end/wait_on_page_fscache() aliases
netfs, mm: Move PG_fscache helper funcs to linux/netfs.h
netfs: Documentation for helper library
netfs: Make a netfs helper module
mm: Implement readahead_control pageset expansion
mm/readahead: Handle ractl nr_pages being modified
fs: Document file_ra_state
mm/filemap: Pass the file_ra_state in the ractl
mm: Add set/end/wait functions for PG_private_2
iov_iter: Add ITER_XARRAY
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYIfiiwAKCRCRxhvAZXjc
ogtMAQC+MtgJZdcH5iDHNEyI36JaWUccKRV7PdvfF1YgnXO45gD+IYxR1c/EQQyD
kh2AmqhET6jVhe9Nsob5yxduksI+ygo=
=oh/d
-----END PGP SIGNATURE-----
Merge tag 'fs.idmapped.helpers.v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull fs mapping helper updates from Christian Brauner:
"This adds kernel-doc to all new idmapping helpers and improves their
naming which was triggered by a discussion with some fs developers.
Some of the names are based on suggestions by Vivek and Al.
Also remove the open-coded permission checking in a few places with
simple helpers. Overall this should lead to more clarity and make it
easier to maintain"
* tag 'fs.idmapped.helpers.v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
fs: introduce two inode i_{u,g}id initialization helpers
fs: introduce fsuidgid_has_mapping() helper
fs: document and rename fsid helpers
fs: document mapping helpers
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYIfiFwAKCRCRxhvAZXjc
oswFAP4sL0oA7mBGDzoxktIMWKY+f7KKDjb9gXc8fDQV9bbcNwD6A9QPJCahfab9
cndByav/xcB/7n/NXLecNYr8NcfTgg8=
=mdyh
-----END PGP SIGNATURE-----
Merge tag 'fs.idmapped.docs.v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull fs helper kernel-doc updates from Christian Brauner:
"In the last cycles we forgot to update the kernel-docs in some places
that were changed during the idmapped mount work. Lukas and Randy took
the chance to not just fixup those places but also fixup and expand
kernel-docs for some additional helpers.
No functional changes"
* tag 'fs.idmapped.docs.v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
fs: update kernel-doc for vfs_rename()
fs: turn some comments into kernel-doc
xattr: fix kernel-doc for mnt_userns and vfs xattr helpers
namei: fix kernel-doc for struct renamedata and more
libfs: fix kernel-doc for mnt_userns
Pull fileattr conversion updates from Miklos Szeredi via Al Viro:
"This splits the handling of FS_IOC_[GS]ETFLAGS from ->ioctl() into a
separate method.
The interface is reasonably uniform across the filesystems that
support it and gives nice boilerplate removal"
* 'miklos.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (23 commits)
ovl: remove unneeded ioctls
fuse: convert to fileattr
fuse: add internal open/release helpers
fuse: unsigned open flags
fuse: move ioctl to separate source file
vfs: remove unused ioctl helpers
ubifs: convert to fileattr
reiserfs: convert to fileattr
ocfs2: convert to fileattr
nilfs2: convert to fileattr
jfs: convert to fileattr
hfsplus: convert to fileattr
efivars: convert to fileattr
xfs: convert to fileattr
orangefs: convert to fileattr
gfs2: convert to fileattr
f2fs: convert to fileattr
ext4: convert to fileattr
ext2: convert to fileattr
btrfs: convert to fileattr
...
Remove vfs_ioc_setflags_prepare(), vfs_ioc_fssetxattr_check() and
simple_fill_fsxattr(), which are no longer used.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
There's a substantial amount of boilerplate in filesystems handling
FS_IOC_[GS]ETFLAGS/ FS_IOC_FS[GS]ETXATTR ioctls.
Also due to userspace buffers being involved in the ioctl API this is
difficult to stack, as shown by overlayfs issues related to these ioctls.
Introduce a new internal API named "fileattr" (fsxattr can be confused with
xattr, xflags is inappropriate, since this is more than just flags).
There's significant overlap between flags and xflags and this API handles
the conversions automatically, so filesystems may choose which one to use.
In ->fileattr_get() a hint is provided to the filesystem whether flags or
xattr are being requested by userspace, but in this series this hint is
ignored by all filesystems, since generating all the attributes is cheap.
If a filesystem doesn't implemement the fileattr API, just fall back to
f_op->ioctl(). When all filesystems are converted, the fallback can be
removed.
32bit compat ioctls are now handled by the generic code as well.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Commit 9fe6145097 ("namei: introduce struct renamedata") introduces a
new struct for vfs_rename() and makes the vfs_rename() kernel-doc argument
description out of sync.
Move the description of arguments for vfs_rename() to a new kernel-doc for
the struct renamedata to make these descriptions checkable against the
actual implementation.
Link: https://lore.kernel.org/r/20210204180059.28360-3-lukas.bulwahn@gmail.com
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
While reviewing ./include/linux/fs.h, I noticed that three comments can
actually be turned into kernel-doc comments. This allows to check the
consistency between the descriptions and the functions' signatures in
case they may change in the future.
A quick validation with the consistency check:
./scripts/kernel-doc -none include/linux/fs.h
currently reports no issues in this file.
Link: https://lore.kernel.org/r/20210204180059.28360-2-lukas.bulwahn@gmail.com
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Give filesystem two little helpers that do the right thing when
initializing the i_uid and i_gid fields on idmapped and non-idmapped
mounts. Filesystems shouldn't have to be concerned with too many
details.
Link: https://lore.kernel.org/r/20210320122623.599086-5-christian.brauner@ubuntu.com
Inspired-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Don't open-code the checks and instead move them into a clean little
helper we can call. This also reduces the risk that if we ever change
something we forget to change all locations.
Link: https://lore.kernel.org/r/20210320122623.599086-4-christian.brauner@ubuntu.com
Inspired-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Vivek pointed out that the fs{g,u}id_into_mnt() naming scheme can be
misleading as it could be understood as implying they do the exact same
thing as i_{g,u}id_into_mnt(). The original motivation for this naming
scheme was to signal to callers that the helpers will always take care
to map the k{g,u}id such that the ownership is expressed in terms of the
mnt_users.
Get rid of the confusion by renaming those helpers to something more
sensible. Al suggested mapped_fs{g,u}id() which seems a really good fit.
Usually filesystems don't need to bother with these helpers directly
only in some cases where they allocate objects that carry {g,u}ids which
are either filesystem specific (e.g. xfs quota objects) or don't have a
clean set of helpers as inodes have.
Link: https://lore.kernel.org/r/20210320122623.599086-3-christian.brauner@ubuntu.com
Inspired-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>