The value of "end" should be "start + length - 1".
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Use struct initializers to ensure that the xfs_btree_irecs passed into
the query_range function are completely initialized. No functional
changes, just closing some sloppy hygiene.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Improve the validation of the fsmap offset fields in the query keys and
move the validation to the top of the function now that we have pushed
the low key adjustment code downwards.
Also fix some indenting issues that aren't worth a separate patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The external log device fsmap backend doesn't have an rmapbt to query,
so it's wasteful to spend time initializing the rmap_irec objects.
Worse yet, the log could (someday) be longer than 2^32 fsblocks, so
using the rmap irec structure will result in integer overflows.
Fix this mess by computing the start address that we want from keys[0]
directly, and use the daddr-based record filtering algorithm that we
also use for rtbitmap queries.
Fixes: e89c041338 ("xfs: implement the GETFSMAP ioctl")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The rtbitmap fsmap backend doesn't query the rmapbt, so it's wasteful to
spend time initializing the rmap_irec objects. Worse yet, the logic to
query the rtbitmap is spread across three separate functions, which is
unnecessarily difficult to follow.
Compute the start rtextent that we want from keys[0] directly and
combine the functions to avoid passing parameters around everywhere, and
consolidate all the logic into a single function. At one point many
years ago I intended to use __xfs_getfsmap_rtdev as the launching point
for realtime rmapbt queries, but this hasn't been the case for a long
time.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The realtime section ends at the last rt extent. If the user configures
the rt geometry with an extent size that is not an integer factor of the
number of rt blocks, it's possible for there to be rt blocks past the
end of the last rt extent. These tail blocks cannot ever be allocated
and will cause corruption reports if the last extent coincides with the
end of an rt bitmap block, so do not report consider them for the
GETFSMAP output.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
It's not correct to use the rmap irec structure to hold query key
information to query the rtbitmap because the realtime volume can be
longer than 2^32 fsblocks in length. Because the rt volume doesn't have
allocation groups, introduce a daddr-based record filtering algorithm
and compute the rtextent values using 64-bit variables. The same
problem exists in the external log device fsmap implementation, so use
the same solution to fix it too.
After this patch, all the code that touches info->low and info->high
under xfs_getfsmap_logdev and __xfs_getfsmap_rtdev are unnecessary.
Cleaning this up will be done in subsequent patches.
Fixes: 4c934c7dd6 ("xfs: report realtime space information via the rtbitmap")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
I noticed a bug in ranged GETFSMAP queries:
# xfs_io -c 'fsmap -vvvv' /opt
EXT: DEV BLOCK-RANGE OWNER FILE-OFFSET AG AG-OFFSET TOTAL
0: 8:80 [0..7]: static fs metadata 0 (0..7) 8
<snip>
9: 8:80 [192..223]: 137 0..31 0 (192..223) 32
# xfs_io -c 'fsmap -vvvv -d 208 208' /opt
#
That's not right -- we asked what block maps block 208, and we should've
received a mapping for inode 137 offset 16. Instead, we get nothing.
The root cause of this problem is a mis-interaction between the fsmap
code and how btree ranged queries work. xfs_btree_query_range returns
any btree record that overlaps with the query interval, even if the
record starts before or ends after the interval. Similarly, GETFSMAP is
supposed to return a recordset containing all records that overlap the
range queried.
However, it's possible that the recordset is larger than the buffer that
the caller provided to convey mappings to userspace. In /that/ case,
userspace is supposed to copy the last record returned to fmh_keys[0]
and call GETFSMAP again. In this case, we do not want to return
mappings that we have already supplied to the caller. The call to
xfs_btree_query_range is the same, but now we ignore any records that
start before fmh_keys[0].
Unfortunately, we didn't implement the filtering predicate correctly.
The predicate should only be called when we're calling back for more
records. Accomplish this by setting info->low.rm_blockcount to a
nonzero value and ensuring that it is cleared as necessary. As a
result, we no longer want to adjust dkeys[0] in the main setup function
because that's confusing.
This patch doesn't touch the logdev/rtbitmap backends because they have
bigger problems that will be addressed by subsequent patches.
Found via xfs/556 with parent pointers enabled.
Fixes: e89c041338 ("xfs: implement the GETFSMAP ioctl")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The use of file_free_rcu() in init_file() to free the struct that was
allocated by the caller was hacky and we got what we deserved.
Let init_file() and its callers take care of cleaning up each after
their own allocated resources on error.
Fixes: 62d53c4a1d ("fs: use backing_file container for internal files with "fake" f_path") # mainline only
Reported-and-tested-by: syzbot+ada42aab05cf51b00e98@syzkaller.appspotmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Message-Id: <20230701171134.239409-1-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Highlights include:
Stable fixes and other bugfixes:
- nfs: don't report STATX_BTIME in ->getattr
- Revert "NFSv4: Retry LOCK on OLD_STATEID during delegation return"
since it breaks NFSv4 state recovery.
- NFSv4.1: freeze the session table upon receiving NFS4ERR_BADSESSION
- Fix the NFSv4.2 xattr cache shrinker_id
- Force a ctime update after a NFSv4.2 SETXATTR call
Features and cleanups:
- NFS and RPC over TLS client code from Chuck Lever.
- Support for use of abstract unix socket addresses with the rpcbind
daemon.
- Sysfs API to allow shutdown of the kernel RPC client and prevent
umount() hangs if the server is known to be permanently down.
- XDR cleanups from Anna.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmSgmmUACgkQZwvnipYK
APJwUA/+J6uEjJFoigSDU5dpCwQr4pHZPgUn3T2heplcyalGMxLo1VjDTVuFXb+a
NZqdUZF2ePmYqss/UYzJC7R6/z9OanVBcpiGqp66foJt9ncs9BSm5AzdV5Gvi4VX
6SrBM98nSqvD47l45LQ90bqIdR6WgMP9OiDC257PzYnaMZJcB0xObD4HWXh1zbIz
3xynJTSQnRGbv9I5EjJJGVIHDWLfSKY61NUXjrUcmMZ2L39ITNy0CRi8sIdj3oY/
A2Iz52IHtAhE77+EetThPskbTLa07raQSWRo3X6XJqCKiJIXa5giNDoG/zLq6sOT
hi1AV7Tdxaed2EYibeRWzsSVQIClBb7T/hdro5dWs5u/bxM6Bt+yY90ZWUMZVOAQ
/kGTYQXhI31vUgRaEN+2xci0wKDy9wqyAWcD8u8Gz01KaK09sfJSIvvYn+srSeaz
wEUQHZCdBGtNFVP2q18q4x8BN27uObh1DdMvNhrxrA7YraXSQvL/rIIsD0jmDInb
6olMm9g9nZSHgq62+CYs2v7J/AJKQzE7PsWrTMJDX1rso+/Lyc6x7oUGxv2IFt5H
VZVZNdstKeNzfcnNKsGG2ZbufhasKHqiHJxJTdNOuOi0YBi+ixtJVRpupId3+6aZ
ysng0IfzqiWSuiq5Axjreva+480IDSMW+7cqcw5urKEfYY5uVcc=
=leGh
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Stable fixes and other bugfixes:
- nfs: don't report STATX_BTIME in ->getattr
- Revert 'NFSv4: Retry LOCK on OLD_STATEID during delegation return'
since it breaks NFSv4 state recovery.
- NFSv4.1: freeze the session table upon receiving NFS4ERR_BADSESSION
- Fix the NFSv4.2 xattr cache shrinker_id
- Force a ctime update after a NFSv4.2 SETXATTR call
Features and cleanups:
- NFS and RPC over TLS client code from Chuck Lever
- Support for use of abstract unix socket addresses with the rpcbind
daemon
- Sysfs API to allow shutdown of the kernel RPC client and prevent
umount() hangs if the server is known to be permanently down
- XDR cleanups from Anna"
* tag 'nfs-for-6.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (33 commits)
Revert "NFSv4: Retry LOCK on OLD_STATEID during delegation return"
NFS: Don't cleanup sysfs superblock entry if uninitialized
nfs: don't report STATX_BTIME in ->getattr
NFSv4.1: freeze the session table upon receiving NFS4ERR_BADSESSION
NFSv4.2: fix wrong shrinker_id
NFSv4: Clean up some shutdown loops
NFS: Cancel all existing RPC tasks when shutdown
NFS: add sysfs shutdown knob
NFS: add a sysfs link to the acl rpc_client
NFS: add a sysfs link to the lockd rpc_client
NFS: Add sysfs links to sunrpc clients for nfs_clients
NFS: add superblock sysfs entries
NFS: Make all of /sys/fs/nfs network-namespace unique
NFS: Open-code the nfs_kset kset_create_and_add()
NFS: rename nfs_client_kobj to nfs_net_kobj
NFS: rename nfs_client_kset to nfs_kset
NFS: Add an "xprtsec=" NFS mount option
NFS: Have struct nfs_client carry a TLS policy field
SUNRPC: Add a TCP-with-TLS RPC transport class
SUNRPC: Capture CMSG metadata on client-side receive
...
- DAX fixes and cleanups including a use after free, extra references,
and device unregistration, and a redundant variable.
- Allow the DAX fault handler to return VM_FAULT_HWPOISON
- A few libnvdimm cleanups such as making some functions and variables
static where sufficient.
- Add a few missing prototypes for wrapped functions in
tools/testing/nvdimm
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQT9vPEBxh63bwxRYEEPzq5USduLdgUCZJ6AdAAKCRAPzq5USduL
dtGnAP9uh+DxVKLnp/Q0977pLZKYVHYU32C/pG3hFnjS5tAp6QEAke/uF+wxcTGr
EZdnDJuTGt2sAMQsQ34NdDJUzwqQEgw=
=7l6z
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull nvdimm and DAX updates from Vishal Verma:
"This is mostly small cleanups and fixes, with the biggest change being
the change to the DAX fault handler allowing it to return
VM_FAULT_HWPOISON.
Summary:
- DAX fixes and cleanups including a use after free, extra
references, and device unregistration, and a redundant variable.
- Allow the DAX fault handler to return VM_FAULT_HWPOISON
- A few libnvdimm cleanups such as making some functions and
variables static where sufficient.
- Add a few missing prototypes for wrapped functions in
tools/testing/nvdimm"
* tag 'libnvdimm-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: enable dax fault handler to report VM_FAULT_HWPOISON
nvdimm: make security_show static
nvdimm: make nd_class variable static
dax/kmem: Pass valid argument to memory_group_register_static
fsdax: remove redundant variable 'error'
dax: Cleanup extra dax_region references
dax: Introduce alloc_dev_dax_id()
dax: Use device_unregister() in unregister_dax_mapping()
dax: Fix dax_mapping_release() use after free
tools/testing/nvdimm: Drop empty platform remove function
libnvdimm: mark 'security_show' static again
testing: nvdimm: add missing prototypes for wrapped functions
dax: fix missing-prototype warnings
Just one minor nit I forgot to merge.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmSfai0SHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoin1jQP/jeK0UOLWxz+rIouzO9gLe/HWiF5Kzez
PJehDwpKkFsiDL5F2NX/LQ9tHI1LcO9/GKMQnU0CJo2u7I72MaKjBdWr7a/vVQyg
yJ33MCfDkRtUVXqxlNsbo0/oyteXUZF2dhAnpaRGfVzeP4IZqSq+6QCxjjGBzwX3
FfPaLug8hxs+Gf+CkeCTxOzxv/iAiYxxQQe8GvRIXYg+/fTcvW1+VCavfN9a5M7c
hORaQLp2o4gkeBGvx6nU8ai8NWmL+xWxE7degS1mgn8fUok4bWG3DDkAWzUnaSEp
31vIbtrTwEfe7OVNbKHSDXbJ6kNTrRe9QTao5htkuHw5BPj6TlCkW98FNht9MmkJ
WGgimrsd60Mbm9TmCbaBBbbN8GTFn8WRRs8k1n0yXkXjVYgrLsxtmBrt+SCSEt3A
ELRrXLlYsRAVbxmbE8w6C2JYlBsseeBfzoGUXV5nofHjl+rNU1/kcI9Vep709o3o
dixxnbHuutotsWcgu1+FX+oaEOaf76sjiegTRnK/fPa9cmvrQhghQQm2EJN60v8S
xif7rj/3h3TuFy6Qcwm+r5YlkuqJ6FR+yespmIuNQAVpxUYye4PyGXndUZMS4md/
psUxjCWcuHFQL92B6ek/UPDcpK1Ox2UA2jcB1IaX7yXjI2a9aqDNnfZEDzSRIeey
L1yg4D3gBmXb
=73Ne
-----END PGP SIGNATURE-----
Merge tag 'sysctl-fixes-v2-v6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull another sysctl fix from Luis Chamberlain:
"Just one minor nit I forgot to merge"
* tag 'sysctl-fixes-v2-v6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
sysctl: set variable sysctl_mount_point storage-class-specifier to static
The pointer 'server' is assigned but never read, the pointer is
redundant and can be removed. Cleans up clang scan build warning:
fs/smb/client/dfs.c:217:3: warning: Value stored to 'server' is
never read [deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We switch session state to SES_EXITING without cifs_tcp_ses_lock now,
it may lead to potential use-after-free issue.
Consider the following execution processes:
Thread 1:
__cifs_put_smb_ses()
spin_lock(&cifs_tcp_ses_lock)
if (--ses->ses_count > 0)
spin_unlock(&cifs_tcp_ses_lock)
return
spin_unlock(&cifs_tcp_ses_lock)
---> **GAP**
spin_lock(&ses->ses_lock)
if (ses->ses_status == SES_GOOD)
ses->ses_status = SES_EXITING
spin_unlock(&ses->ses_lock)
Thread 2:
cifs_find_smb_ses()
spin_lock(&cifs_tcp_ses_lock)
list_for_each_entry(ses, ...)
spin_lock(&ses->ses_lock)
if (ses->ses_status == SES_EXITING)
spin_unlock(&ses->ses_lock)
continue
...
spin_unlock(&ses->ses_lock)
if (ret)
cifs_smb_ses_inc_refcount(ret)
spin_unlock(&cifs_tcp_ses_lock)
If thread 1 is preempted in the gap and thread 2 start executing, thread 2
will get the session, and soon thread 1 will switch the session state to
SES_EXITING and start releasing it, even though thread 1 had increased the
session's refcount and still uses it.
So switch session state under cifs_tcp_ses_lock to eliminate this gap.
Signed-off-by: Winston Wen <wentao@uniontech.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmSfmDYACgkQiiy9cAdy
T1FR4Qv+M5G6dBYK8iSBGxOOam5y7cuOZYumopcoVvZq+m+6vMsVJIh0aVx+C3u5
+Whajk+MYMXx9rSItZvvXlNlpzzLQF3O9MXJ3nS1DkAexgvdZawMATWAbgHZWIw8
F0k1t0+wI2jw4Nel1rdlkokx+v9YJXsK8hUX8te2OmcpmmymylK4stIi7QCSaCFO
2uctnwaBtVtlQ6areii0/p/cxiJ/vrxCCa4Yu/zKP3UKOQDGmFxPYanSaeovo38R
/0LsuU2S/nba1gkXt65NdBziwsjAoj/6IVznT7989jYQd7zFeut8oNQO2PONtanB
oXviCd9IP2vdMpjsLTuyiKifR4EmAF7KLUP/cjtMVIgY7eYn8ssZ6DCPkhofbehd
gjKCDQBannv4d8nVTKXvwEfnh4zv2l1CxVjmncslDLk9fi27g9da4QFai+74fwvj
pDLW1/D/OFyxno8OTpzOL3PqLs0c3FgYl75Z5Q+SypAeJCdCgSzkRlz/mVdl3BpG
+YrcIcEz
=dyvQ
-----END PGP SIGNATURE-----
Merge tag '6.5-rc-smb3-client-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client updates from Steve French:
- Deferred close fix
- Debugging improvements: display missing mount option, dump rc on
invalidate inode failures, print client_guid in DebugData, log
session id when matching session not found in reconnect, new dynamic
tracepoint for session not found
- Mount fixes including: potential null dereference, and possible
memory leak and path name parsing when double slashes
- Fix potential use after free in compounding
- Two crediting (flow control) fixes: fix for crediting leak (stress
scenario with excess lease credits) and better locking around
updating credits
- Three cleanups from issues pointed out by the kernel test robot
- Session state check improvements (including for potential use after
free)
- DFS fixes: Fix for getattr on link when DFS disabled, fix for DFS
mounts to same share with different prefix paths, DFS mount error
checking improvement
* tag '6.5-rc-smb3-client-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6:
cifs: new dynamic tracepoint to track ses not found errors
cifs: log session id when a matching ses is not found
smb: client: improve DFS mount check
smb: client: fix shared DFS root mounts with different prefixes
smb: client: fix parsing of source mount option
smb: client: fix broken file attrs with nodfs mounts
cifs: print client_guid in DebugData
cifs: fix session state check in smb2_find_smb_ses
cifs: fix session state check in reconnect to avoid use-after-free issue
cifs: do all necessary checks for credits within or before locking
cifs: prevent use-after-free by freeing the cfile later
smb: client: fix warning in generic_ip_connect()
smb: client: fix warning in CIFSFindNext()
smb: client: fix warning in CIFSFindFirst()
smb3: do not reserve too many oplock credits
cifs: print more detail when invalidate_inode_mapping fails
smb: client: fix warning in cifs_smb3_do_mount()
smb: client: fix warning in cifs_match_super()
cifs: print nosharesock value while dumping mount options
SMB3: Do not send lease break acknowledgment if all file handles have been closed
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmSeBMcACgkQiiy9cAdy
T1Fzowv/alhVztEYWhb1WF5ypxIu66+Y4SkqzHTNeVD1wYZxRp1Dinzj48J+8ZWZ
swmBO6RYlr8DpoXaQe6FVuJb075xW5eFPUj2xJHn7/8/YITBC5UPjLEYipEWQVnU
4jHgeSalDw6pVSUoq9b6RhHggqHeDCPTOGneQ9LzfWziVIHnglnGNmyq12ODO504
xlrpnBMcZ83Taj6oqf+CBLjGE768JNkrG9aIef50OdkLj1qaGaoKdQoEtRXZeRx5
uD6cjwqm3GmsxLYCElThbJDpCqa8Pejc91/BR6CsqgwO+llrVVF0l2BZNslec48T
SDGKnNyBHIxRgyemvyxqo5NQlV75VL9ger0co/LeVFAzJFLO7c1rNjKlx/BJ2jjZ
OBB+rqsoK+Eva0OEpNBfuIGe9pi7GHxpCoq1vAOYDJjwsKqYS65N6A0nsoyXYm4Z
PorjfeoFx+52uUS0X0YPyFldo+cw2K2zVaRUQlKfajkaldtpe/pdb0g4Zw4pbJ3H
fr2e5akF
=pFLY
-----END PGP SIGNATURE-----
Merge tag '6.5-rc-ksmbd-server-fixes-part1' of git://git.samba.org/ksmbd
Pull ksmbd server updates from Steve French:
- two fixes for compounding bugs (make sure no out of bound reads with
less common combinations of commands in the compound)
- eight minor cleanup patches (e.g. simplifying return values, replace
one element array, use of kzalloc where simpler)
- fix for clang warning on possible overflow in filename conversion
* tag '6.5-rc-ksmbd-server-fixes-part1' of git://git.samba.org/ksmbd:
ksmbd: avoid field overflow warning
ksmbd: Replace one-element array with flexible-array member
ksmbd: Use struct_size() helper in ksmbd_negotiate_smb_dialect()
ksmbd: add missing compound request handing in some commands
ksmbd: fix out of bounds read in smb2_sess_setup
ksmbd: Replace the ternary conditional operator with min()
ksmbd: use kvzalloc instead of kvmalloc
ksmbd: Change the return value of ksmbd_vfs_query_maximal_access to void
ksmbd: return a literal instead of 'err' in ksmbd_vfs_kern_path_locked()
ksmbd: use kzalloc() instead of __GFP_ZERO
ksmbd: remove unused ksmbd_tree_conn_share function
Although some more stuff is brewing, the EFI changes that are ready for
mainline are few, so not a lot to pull this cycle:
- improve the PCI DMA paranoia logic in the EFI stub
- some constification changes
- add statfs support to efivarfs
- allow user space to enumerate updatable firmware resources without
CAP_SYS_ADMIN
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQQm/3uucuRGn1Dmh0wbglWLn0tXAUCZJ1jIwAKCRAwbglWLn0t
XDs8AP9PAAWIgukyXkYpoxabaQQK1Pqw6Zv63XAcNYBHa4zjHwD/UTcYviQIlI0B
Rfj4i8pDQVVfReSI+lKWvhXfRQ5Qbgs=
=w6zX
-----END PGP SIGNATURE-----
Merge tag 'efi-next-for-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI updates from Ard Biesheuvel:
"Although some more stuff is brewing, the EFI changes that are ready
for mainline are few this cycle:
- improve the PCI DMA paranoia logic in the EFI stub
- some constification changes
- add statfs support to efivarfs
- allow user space to enumerate updatable firmware resources without
CAP_SYS_ADMIN"
* tag 'efi-next-for-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi/libstub: Disable PCI DMA before grabbing the EFI memory map
efi/esrt: Allow ESRT access without CAP_SYS_ADMIN
efivarfs: expose used and total size
efi: make kobj_type structure constant
efi: x86: make kobj_type structure constant
syzbot reports below bug:
BUG: KASAN: slab-use-after-free in f2fs_truncate_data_blocks_range+0x122a/0x14c0 fs/f2fs/file.c:574
Read of size 4 at addr ffff88802a25c000 by task syz-executor148/5000
CPU: 1 PID: 5000 Comm: syz-executor148 Not tainted 6.4.0-rc7-syzkaller-00041-ge660abd551f1 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:351
print_report mm/kasan/report.c:462 [inline]
kasan_report+0x11c/0x130 mm/kasan/report.c:572
f2fs_truncate_data_blocks_range+0x122a/0x14c0 fs/f2fs/file.c:574
truncate_dnode+0x229/0x2e0 fs/f2fs/node.c:944
f2fs_truncate_inode_blocks+0x64b/0xde0 fs/f2fs/node.c:1154
f2fs_do_truncate_blocks+0x4ac/0xf30 fs/f2fs/file.c:721
f2fs_truncate_blocks+0x7b/0x300 fs/f2fs/file.c:749
f2fs_truncate.part.0+0x4a5/0x630 fs/f2fs/file.c:799
f2fs_truncate include/linux/fs.h:825 [inline]
f2fs_setattr+0x1738/0x2090 fs/f2fs/file.c:1006
notify_change+0xb2c/0x1180 fs/attr.c:483
do_truncate+0x143/0x200 fs/open.c:66
handle_truncate fs/namei.c:3295 [inline]
do_open fs/namei.c:3640 [inline]
path_openat+0x2083/0x2750 fs/namei.c:3791
do_filp_open+0x1ba/0x410 fs/namei.c:3818
do_sys_openat2+0x16d/0x4c0 fs/open.c:1356
do_sys_open fs/open.c:1372 [inline]
__do_sys_creat fs/open.c:1448 [inline]
__se_sys_creat fs/open.c:1442 [inline]
__x64_sys_creat+0xcd/0x120 fs/open.c:1442
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
The root cause is, inodeA references inodeB via inodeB's ino, once inodeA
is truncated, it calls truncate_dnode() to truncate data blocks in inodeB's
node page, it traverse mapping data from node->i.i_addr[0] to
node->i.i_addr[ADDRS_PER_BLOCK() - 1], result in out-of-boundary access.
This patch fixes to add sanity check on dnode page in truncate_dnode(),
so that, it can help to avoid triggering such issue, and once it encounters
such issue, it will record newly introduced ERROR_INVALID_NODE_REFERENCE
error into superblock, later fsck can detect such issue and try repairing.
Also, it removes f2fs_truncate_data_blocks() for cleanup due to the
function has only one caller, and uses f2fs_truncate_data_blocks_range()
instead.
Reported-and-tested-by: syzbot+12cb4425b22169b52036@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-f2fs-devel/000000000000f3038a05fef867f8@google.com
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If a file is not comprssed yet or does not have compressed data,
for example, its data has a very low compression ratio, do not
set FI_COMPRESS_RELEASED flag.
Signed-off-by: Sheng Yong <shengyong@oppo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
fs/f2fs/node.c: In function ‘f2fs_destroy_node_manager’:
fs/f2fs/node.c:3390:1: warning: the frame size of 1048 bytes is larger than 1024 bytes [-Wframe-larger-than=]
3390 | }
Merging below pointer arrays into common one, and reuse it by cast type.
struct nat_entry *natvec[NATVEC_SIZE];
struct nat_entry_set *setvec[SETVEC_SIZE];
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If truncate_node() fails in truncate_dnode(), it missed to call
f2fs_put_page(), fix it.
Fixes: 7735730d39 ("f2fs: fix to propagate error from __get_meta_page()")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
smatch reports
fs/proc/proc_sysctl.c:32:18: warning: symbol
'sysctl_mount_point' was not declared. Should it be static?
This variable is only used in its defining file, so it should be static.
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
If a client sends out a cap update dropping caps with the prior 'seq'
just before an incoming cap revoke request, then the client may drop
the revoke because it believes it's already released the requested
capabilities.
This causes the MDS to wait indefinitely for the client to respond
to the revoke. It's therefore always a good idea to ack the cap
revoke request with the bumped up 'seq'.
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/61782
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In case:
mds client
- Releases cap and put Inode
- Increase cap->seq and sends
revokes req to the client
- Receives release req and - Receives & drops the revoke req
skip removing the cap and
then eval the CInode and
issue or revoke caps again.
- Receives & drops the caps update
or revoke req
- Health warning for client
isn't responding to
mclientcaps(revoke)
All the IMPORT/REVOKE/GRANT cap ops will increase the session seq
in MDS side and then the client need to issue a cap release to
unblock MDS to remove the corresponding cap to unblock possible
waiters.
Link: https://tracker.ceph.com/issues/61332
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The 'i_wr_ref' is used to track the 'Fb' caps, while whenever the 'Fb'
caps is took the kclient will always take the 'Fw' caps at the same
time. That means it will always be a false check in __ceph_finish_cap_snap().
When writing to buffer the kclient will take both 'Fb|Fw' caps and then
write the contents to the buffer pages by increasing the 'i_wrbuffer_ref'
and then just release both 'Fb|Fw'. This is different with the user
space libcephfs, which will keep the 'Fb' being took and use 'i_wr_ref'
instead of 'i_wrbuffer_ref' to track this until the buffer is flushed
to Rados.
We need to defer flushing the capsnap until the corresponding buffer
pages are all flushed to Rados, and at the same time just trigger to
flush the buffer pages immediately.
Link: https://tracker.ceph.com/issues/48640
Link: https://tracker.ceph.com/issues/59343
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Blindly expanding the readahead windows will cause unneccessary
pagecache thrashing and also will introduce the network workload.
We should disable expanding the windows if the readahead is disabled
and also shouldn't expand the windows too much.
Expanding forward firstly instead of expanding backward for possible
sequential reads.
Bound `rreq->len` to the actual file size to restore the previous page
cache usage.
The posix_fadvise may change the maximum size of a file readahead.
Cc: stable@vger.kernel.org
Fixes: 4987005600 ("ceph: convert ceph_readpages to ceph_readahead")
Link: https://lore.kernel.org/ceph-devel/20230504082510.247-1-sehuww@mail.scut.edu.cn
Link: https://www.spinics.net/lists/ceph-users/msg76183.html
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Hu Weiwen <sehuww@mail.scut.edu.cn>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
For write requests the parent's mtime will be updated correspondingly.
And if the 'Xx' caps is issued and when releasing other caps together
with the write requests the MDS Locker will try to eval the xattr lock,
which need to change the locker state excl --> sync and need to do Xx
caps revocation.
Just voluntarily dropping CEPH_CAP_XATTR_EXCL caps to avoid a cap
revoke message, which could cause the mtime will be overwrote by stale
one.
[ idryomov: break unnecessarily long lines ]
Link: https://tracker.ceph.com/issues/61584
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When the msgs are corrupted we need to dump them and then it will
be easier to dig what has happened and where the issue is.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When the MDS rank is in clientreplay state, the metrics requests
will be discarded directly. Also, when there are a lot of known
client requests to recover from, the metrics requests will slow
down the MDS rank from getting to the active state sooner.
With this patch, we will send the metrics requests only when the
MDS rank is in active state.
Link: https://tracker.ceph.com/issues/61524
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmScT18ACgkQnJ2qBz9k
QNnlqAf/bIU+I3Qd3EUpzWrOXEyRjaUggRnb4ibIH2I6DjSAP4wtm5wiG/+wjDFe
v+gdRd8PlAlHbZJvW3WUxeSzWendqd78i2lgwFN+s2QCVtQSUsNy7mtUvOL2b1zy
Kf35vTNbkKE0TevoqHZmoT/mehSBj6Zt4k5POMalfxwnJHoVF25OqHEQQc8vnOjv
as/uMaHVwK/Q0pMafTz8vt9Fogkdqe6A+qLLxTvG6iQKd2Z0NdYK2GxR0oTVhDOK
Ly+h1evRldgOcrishrje00LZT8SznUQkWBjIpPN/HbXR1qc5Jk+BYJUqT2jg7zVd
EW61U79nsaugpTUicpTUIluUZ7/QKA==
=toKL
-----END PGP SIGNATURE-----
Merge tag 'fs_for_v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull misc filesystem updates from Jan Kara:
- Rewrite kmap_local() handling in ext2
- Convert ext2 direct IO path to iomap (with some infrastructure tweaks
associated with that)
- Convert two boilerplate licenses in udf to SPDX identifiers
- Other small udf, ext2, and quota fixes and cleanups
* tag 'fs_for_v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Fix uninitialized array access for some pathnames
ext2: Drop fragment support
quota: fix warning in dqgrab()
quota: Properly disable quotas when add_dquot_ref() fails
fs: udf: udftime: Replace LGPL boilerplate with SPDX identifier
fs: udf: Replace GPL 2.0 boilerplate license notice with SPDX identifier
fs: Drop wait_unfrozen wait queue
ext2_find_entry()/ext2_dotdot(): callers don't need page_addr anymore
ext2_{set_link,delete_entry}(): don't bother with page_addr
ext2_put_page(): accept any pointer within the page
ext2_get_page(): saner type
ext2: use offset_in_page() instead of open-coding it as subtraction
ext2_rename(): set_link and delete_entry may fail
ext2: Add direct-io trace points
ext2: Move direct-io to use iomap
ext2: Use generic_buffers_fsync() implementation
ext4: Use generic_buffers_fsync_noflush() implementation
fs/buffer.c: Add generic_buffers_fsync*() implementation
ext2/dax: Fix ext2_setsize when len is page aligned
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmScTY8ACgkQnJ2qBz9k
QNnfwwgAhZtow1klDLH6qCnWMufB6AwT7VfAHyA3fvyTYMjUKb0sGHkpuh8hqVOb
Lzb4YB+jSWV8XnMFn/4gFJQU/nAv8bMPavghMGpr5VNjQi7WkxYF/GB6O1I5NOHK
EnJjDExgdxXDJZORaaXLVJWrtzJuDFgdiSeIwJECFa0MdTHNgPy3XOl+PPxnYQ/V
xyHyP5ImGgd5O4iy3PFDQBGgOXIMrBX8IMce+qLQNYIvjSIUgmdnIkoUCvsQiisp
LyKI2LxqAqnpA4h4Ow6hOZDw2VlPT0vDwFVUfFIZMIqs5YgaSbWa1Z6cs37MigAn
fgUyRVx2y8A2Lwla7rwLaUEToRVADw==
=ZdcG
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
- Support for fanotify events returning file handles for filesystems
not exportable via NFS
- Improved error handling exportfs functions
- Add missing FS_OPEN events when unusual open helpers are used
* tag 'fsnotify_for_v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: move fsnotify_open() hook into do_dentry_open()
exportfs: check for error return value from exportfs_encode_*()
fanotify: support reporting non-decodeable file handles
exportfs: allow exporting non-decodeable file handles to userspace
exportfs: add explicit flag to request non-decodeable file handles
exportfs: change connectable argument to bit flags
The dlm posix lock handling (for gfs2) has three notable changes:
- Local pids returned from GETLK are no longer negated. A previous
patch negating remote pids mistakenly changed local pids also.
- SETLKW operations can now be interrupted only when the process is
killed, and not from other signals. General interruption was
resulting in previously acquired locks being cleared, not just
the in-progress lock. Handling this correctly will require
extending a cancel capability to user space (a future feature.)
- If multiple threads are requesting posix locks (with SETLKW),
fix incorrect matching of results to the requests.
The dlm networking has several minor cleanups, and one notable change:
- Avoid delaying ack messages for too long (used for message reliability),
resulting in a backlog of un-acked messages. These could previously
be delayed as a result of either too many or too few other messages
being sent. Now an upper and lower threshold is used to determine
when an ack should be sent.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJkneDWAAoJEDgbc8f8gGmqH+sP/A7MYVNHPe2uBpGwsVMjesV+
WTfL/JFKh1Ejpgc1gdaH5qxvKdtW7EZMrwdKrqNmxaHKAVkQIAgE+W8lueH0f/EO
u1UCdU6GXg5VCwzLlJlELrJ/FmFGnKdtcsrRDEJfBlMMH2DrJOnWYikvE7t29Yn9
f2z8xaePyDMmC8urhIFwiikNiRxWeVCfzzEF3XK+HdCREa77ELdWidcpnq8U+gSY
YAtHy1CR8CZ6zqHXYJrFIq2O0zFMJZTibL3XEWSQ1pusLYTE8XT7ElvN2pVV7vv8
C0bLfoOf7PcGIyNDTAA8hkF/y8tO7C4TqoFa0ZqGZpYWvAukGeioo54qF9+rDDMi
pRugqwn/L4zF5qMn3/k+LM0F8bBXERGVWbgp0DdJ5YohFyo3I7zvCaLeIYwEHpdz
HSMpZtXXpAV8T3utF/s3TxOJ0hIX0v7ccJCF7ssRi7wFMru2kUZ2u0DN10ekXtc8
xyA7rf+dpmz/qijBctmJiZViJ7sTZrnmG+dPa55KDE0DKbceMFIjIDCrmakvJQJI
b0Cn3OfyQpuE9Aov3ZH2dQhwXgzoAMvqeTWjh0nRDzEyP3iO+NiL7Qu1OFMZ0GQZ
BQ7JYTHGlhYtfcQJXNlCZvWHx59iBOqWWT17ZEkBVtEzt4qaCMJ1cas3aNqEnRrv
g760Sp9RMbomS0nRNNBS
=+TTu
-----END PGP SIGNATURE-----
Merge tag 'dlm-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"The dlm posix lock handling (for gfs2) has three notable changes:
- Local pids returned from GETLK are no longer negated. A previous
patch negating remote pids mistakenly changed local pids also.
- SETLKW operations can now be interrupted only when the process is
killed, and not from other signals. General interruption was
resulting in previously acquired locks being cleared, not just the
in-progress lock. Handling this correctly will require extending a
cancel capability to user space (a future feature.)
- If multiple threads are requesting posix locks (with SETLKW), fix
incorrect matching of results to the requests.
The dlm networking has several minor cleanups, and one notable change:
- Avoid delaying ack messages for too long (used for message
reliability), resulting in a backlog of un-acked messages. These
could previously be delayed as a result of either too many or too
few other messages being sent. Now an upper and lower threshold is
used to determine when an ack should be sent"
* tag 'dlm-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
fs: dlm: remove filter local comms on close
fs: dlm: add send ack threshold and append acks to msgs
fs: dlm: handle sequence numbers as atomic
fs: dlm: handle lkb wait count as atomic_t
fs: dlm: filter ourself midcomms calls
fs: dlm: warn about messages from left nodes
fs: dlm: move dlm_purge_lkb_callbacks to user module
fs: dlm: cleanup STOP_IO bitflag set when stop io
fs: dlm: don't check othercon twice
fs: dlm: unregister memory at the very last
fs: dlm: fix missing pending to false
fs: dlm: clear pending bit when queue was empty
fs: dlm: revert check required context while close
fs: dlm: fix mismatch of plock results from userspace
fs: dlm: make F_SETLK use unkillable wait_event
fs: dlm: interrupt posix locks only when process is killed
fs: dlm: fix cleanup pending ops when interrupted
fs: dlm: return positive pid value for F_GETLK
dlm: Replace all non-returning strlcpy with strscpy
* Fix a problem where shrink would blow out the space reserve by
declining to shrink the filesystem.
* Drop the EXPERIMENTAL tag for the large extent counts feature.
* Set FMODE_CAN_ODIRECT and get rid of an address space op.
* Fix an AG count overflow bug in growfs if the new device size is
redonkulously large.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHQEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZIs45AAKCRBKO3ySh0YR
ps5NAP92oOaMlXeaxTTGLnbCe/sQhQiVfjE45sQL2BziHN/s2gD2OX01yn2w+Mpg
CdQ6HChUzL2fU3eleh1yMNR7McuaCA==
=hQX7
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.5-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"There's not much going on this cycle -- the large extent counts
feature graduated, so now users can create more extremely fragmented
files! :P
The rest are bug fixes; and I'll be sending more next week.
- Fix a problem where shrink would blow out the space reserve by
declining to shrink the filesystem
- Drop the EXPERIMENTAL tag for the large extent counts feature
- Set FMODE_CAN_ODIRECT and get rid of an address space op
- Fix an AG count overflow bug in growfs if the new device size is
redonkulously large"
* tag 'xfs-6.5-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix ag count overflow during growfs
xfs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
xfs: drop EXPERIMENTAL tag for large extent counts
xfs: don't deplete the reserve pool when trying to shrink the fs
journalling, and block allocator subsystems. Also improve performance
for parallel DIO overwrites.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmSaIWAACgkQ8vlZVpUN
gaODEAf9GLk68DvU9iOhgJ1p/lMIqtbY0vvB1aeiQg7Z99mk/Vc//R5qQvtO2oN5
9G4OMSGKoUO0x9OlvDIw6za1BsE1pGHyBLmei7PO1JpHop6b6hKj+WQVPWb43v15
TI0vIkWzwJI2eIxsTqvpMkgwZ3aNL9c52xFyjwk/6lAsw4y2wxEls/NZhhE2tAXF
w/RFmI9RC/AZy1JX3VeruzeiSvAq+JAnsW8iNIoN5nBvWU7yXLA3b4mcoWWrCQ5E
sKqOkhTeobhYsAie6dxGhri/JrL1HwPOpJ8SWWmrlLWXoMVx1rXxW3OnxIAEl9sz
05n7Z+6LvI6aEk+rnjCqt4Z1cpIIEA==
=cAq/
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Various cleanups and bug fixes in ext4's extent status tree,
journalling, and block allocator subsystems.
Also improve performance for parallel DIO overwrites"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (55 commits)
ext4: avoid updating the superblock on a r/o mount if not needed
jbd2: skip reading super block if it has been verified
ext4: fix to check return value of freeze_bdev() in ext4_shutdown()
ext4: refactoring to use the unified helper ext4_quotas_off()
ext4: turn quotas off if mount failed after enabling quotas
ext4: update doc about journal superblock description
ext4: add journal cycled recording support
jbd2: continue to record log between each mount
jbd2: remove j_format_version
jbd2: factor out journal initialization from journal_get_superblock()
jbd2: switch to check format version in superblock directly
jbd2: remove unused feature macros
ext4: ext4_put_super: Remove redundant checking for 'sbi->s_journal_bdev'
ext4: Fix reusing stale buffer heads from last failed mounting
ext4: allow concurrent unaligned dio overwrites
ext4: clean up mballoc criteria comments
ext4: make ext4_zeroout_es() return void
ext4: make ext4_es_insert_extent() return void
ext4: make ext4_es_insert_delayed_block() return void
ext4: make ext4_es_remove_extent() return void
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE9zuTYTs0RXF+Ke33EVvVyTe/1WoFAmSTFh8ACgkQEVvVyTe/
1Wpy5g/+NTL00ptBHHnEZJG7rRu+PzvjB+gzI8YpMT4kiGkJBNKZIxvgURvisDJ/
coOXySq0yk0y901r2Rb+hP8YSYcSo8nrLV+jJwlZcmFsudtbSbOdUUJiAUjqGZk2
YxiCl9NrAZJMVQj6AfL1LE0L00NmIxXruGiCXWEnbHjYvmy/LIsLOg9b+vlkPPZM
xqxQq//hVZel5Lh6rmz9svJIqFF/e3zobX7NBV0MiFGHX926u2TV0ImAnjmhuxdV
mZ8zKCQ4kGTU++PjOCyKPS+xF9eg6mlCVW9uq/QIN83muhwaUlOimJEBunbNxBQz
if/OWxSMZv5UFXqYbF6YsHQrWupCV2uEcs8ypcGNsPX3AeX9i8JTGdVWMPax+4HO
IMZ7Ltxf8Udihsn4s9n0ZMgSDbMTcPRFMNrHXtWU6FuEEjbP5SuD1KREJrh58myL
I99/0tOICAx3ZRueZNmqz4BJ9zev+86PtPNDqHbZy2qrwBn7C6ly3AFWCCv32mza
ehINgUv/TtzUBKqkbhznrjinrv5upV3CV9xg4L8ohhuVeoa8h3kGf8H5XSfnhY56
x1DNWAZK+yZpLf/I39hrWwSME0T1D3GEuI2NlPmAKpqZSgwYdhSAcfebfQhmwD3Q
B3qY3Dh74dsN9zW9oygcXXtHHizzsomOc/qXkXvYP4MXkQi0+h0=
=KPxq
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
Pull overlayfs update from Amir Goldstein:
- fix two NULL pointer deref bugs (Zhihao Cheng)
- add support for "data-only" lower layers destined to be used by
composefs
- port overlayfs to the new mount api (Christian Brauner)
* tag 'ovl-update-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs: (26 commits)
ovl: add Amir as co-maintainer
ovl: reserve ability to reconfigure mount options with new mount api
ovl: modify layer parameter parsing
ovl: port to new mount api
ovl: factor out ovl_parse_options() helper
ovl: store enum redirect_mode in config instead of a string
ovl: pass ovl_fs to xino helpers
ovl: clarify ovl_get_root() semantics
ovl: negate the ofs->share_whiteout boolean
ovl: check type and offset of struct vfsmount in ovl_entry
ovl: implement lazy lookup of lowerdata in data-only layers
ovl: prepare for lazy lookup of lowerdata inode
ovl: prepare to store lowerdata redirect for lazy lowerdata lookup
ovl: implement lookup in data-only layers
ovl: introduce data-only lower layers
ovl: remove unneeded goto instructions
ovl: deduplicate lowerdata and lowerstack[]
ovl: deduplicate lowerpath and lowerstack[]
ovl: move ovl_entry into ovl_inode
ovl: factor out ovl_free_entry() and ovl_stack_*() helpers
...
Olga Kornievskaia reports that this patch breaks NFSv4.0 state recovery.
It also introduces additional complexity in the error paths for cases not
related to the original problem. Let's revert it for now, and address the
original problem in another manner.
This reverts commit f5ea16137a.
Fixes: f5ea16137a ("NFSv4: Retry LOCK on OLD_STATEID during delegation return")
Reported-by: Kornievskaia, Olga <Olga.Kornievskaia@netapp.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Its possible to end up in nfs_free_server() before the server's superblock
sysfs entry has been initialized, in which case calling kobject_put() will
emit a WARNING. Check if the kobject has been initialized before cleaning
it up.
Fixes: 1c7251187d ("NFS: add superblock sysfs entries")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Need to happen before we allocate and then leak the xefi. Found by
coverity via an xfsprogs libxfs scan.
[djwong: This also fixes the type of the @agbno argument.]
Fixes: 7dfee17b13 ("xfs: validate block number being freed before adding to xefi")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The AGF verifier does not check that the AGF length field is within
known good bounds. This has never been checked by runtime kernel
code (i.e. the lack of verification goes back to 1993) yet we assume
in many places that it is correct and verify other metdata against
it.
Add length verification to the AGF verifier. The length of the AGF
must be equal to the size of the AG specified in the superblock,
unless it is the last AG in the filesystem. In that case, it must be
less than or equal to sb->sb_agblocks and greater than
XFS_MIN_AG_BLOCKS, which is the smallest AG a growfs operation will
allow to exist.
This requires a bit of rework of the verifier function. We want to
verify metadata before we use it to verify other metadata. Hence
we need to verify the AGF sequence numbers before using them to
verify the length of the AGF. Then we can verify the AGF length
before we verify AGFL fields. Then we can verifier other fields that
are bounds limited by the AGF length.
And, finally, by calculating agf_length only once into a local
variable, we can collapse repeated "if (xfs_has_foo() &&"
conditionaly checks into single checks. This makes the code much
easier to follow as all the checks for a given feature are obviously
in the same place.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
If the journal geometry results in a sector or log stripe unit
validation problem, it indicates that we cannot set the log up to
safely write to the the journal. In these cases, we must abort the
mount because the corruption needs external intervention to resolve.
Similarly, a journal that is too large cannot be written to safely,
either, so we shouldn't allow those geometries to mount, either.
If the log is too small, we risk having transaction reservations
overruning the available log space and the system hanging waiting
for space it can never provide. This is purely a runtime hang issue,
not a corruption issue as per the first cases listed above. We abort
mounts of the log is too small for V5 filesystems, but we must allow
v4 filesystems to mount because, historically, there was no log size
validity checking and so some systems may still be out there with
undersized logs.
The problem is that on V4 filesystems, when we discover a log
geometry problem, we skip all the remaining checks and then allow
the log to continue mounting. This mean that if one of the log size
checks fails, we skip the log stripe unit check. i.e. we allow the
mount because a "non-fatal" geometry is violated, and then fail to
check the hard fail geometries that should fail the mount.
Move all these fatal checks to the superblock verifier, and add a
new check for the two log sector size geometry variables having the
same values. This will prevent any attempt to mount a log that has
invalid or inconsistent geometries long before we attempt to mount
the log.
However, for the minimum log size checks, we can only do that once
we've setup up the log and calculated all the iclog sizes and
roundoffs. Hence this needs to remain in the log mount code after
the log has been initialised. It is also the only case where we
should allow a v4 filesystem to continue running, so leave that
handling in place, too.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
If the current transaction holds a busy extent and we are trying to
allocate a new extent to fix up the free list, we can deadlock if
the AG is entirely empty except for the busy extent held by the
transaction.
This can occur at runtime processing an XEFI with multiple extents
in this path:
__schedule+0x22f at ffffffff81f75e8f
schedule+0x46 at ffffffff81f76366
xfs_extent_busy_flush+0x69 at ffffffff81477d99
xfs_alloc_ag_vextent_size+0x16a at ffffffff8141711a
xfs_alloc_ag_vextent+0x19b at ffffffff81417edb
xfs_alloc_fix_freelist+0x22f at ffffffff8141896f
xfs_free_extent_fix_freelist+0x6a at ffffffff8141939a
__xfs_free_extent+0x99 at ffffffff81419499
xfs_trans_free_extent+0x3e at ffffffff814a6fee
xfs_extent_free_finish_item+0x24 at ffffffff814a70d4
xfs_defer_finish_noroll+0x1f7 at ffffffff81441407
xfs_defer_finish+0x11 at ffffffff814417e1
xfs_itruncate_extents_flags+0x13d at ffffffff8148b7dd
xfs_inactive_truncate+0xb9 at ffffffff8148bb89
xfs_inactive+0x227 at ffffffff8148c4f7
xfs_fs_destroy_inode+0xb8 at ffffffff81496898
destroy_inode+0x3b at ffffffff8127d2ab
do_unlinkat+0x1d1 at ffffffff81270df1
do_syscall_64+0x40 at ffffffff81f6b5f0
entry_SYSCALL_64_after_hwframe+0x44 at ffffffff8200007c
This can also happen in log recovery when processing an EFI
with multiple extents through this path:
context_switch() kernel/sched/core.c:3881
__schedule() kernel/sched/core.c:5111
schedule() kernel/sched/core.c:5186
xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598
xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641
xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828
xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362
xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029
__xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067
xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370
xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626
xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605
xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893
xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824
xfs_log_mount_finish() fs/xfs/xfs_log.c:764
xfs_mountfs() fs/xfs/xfs_mount.c:978
xfs_fs_fill_super() fs/xfs/xfs_super.c:1908
mount_bdev() fs/super.c:1417
xfs_fs_mount() fs/xfs/xfs_super.c:1985
legacy_get_tree() fs/fs_context.c:647
vfs_get_tree() fs/super.c:1547
do_new_mount() fs/namespace.c:2843
do_mount() fs/namespace.c:3163
ksys_mount() fs/namespace.c:3372
__do_sys_mount() fs/namespace.c:3386
__se_sys_mount() fs/namespace.c:3383
__x64_sys_mount() fs/namespace.c:3383
do_syscall_64() arch/x86/entry/common.c:296
entry_SYSCALL_64() arch/x86/entry/entry_64.S:180
To avoid this deadlock, we should not block in
xfs_extent_busy_flush() if we hold a busy extent in the current
transaction.
Now that the EFI processing code can handle requeuing a partially
completed EFI, we can detect this situation in
xfs_extent_busy_flush() and return -EAGAIN rather than going to
sleep forever. The -EAGAIN get propagated back out to the
xfs_trans_free_extent() context, where the EFD is populated and the
transaction is rolled, thereby moving the busy extents into the CIL.
At this point, we can retry the extent free operation again with a
clean transaction. If we hit the same "all free extents are busy"
situation when trying to fix up the free list, we can safely call
xfs_extent_busy_flush() and wait for the busy extents to resolve
and wake us. At this point, the allocation search can make progress
again and we can fix up the free list.
This deadlock was first reported by Chandan in mid-2021, but I
couldn't make myself understood during review, and didn't have time
to fix it myself.
It was reported again in March 2023, and again I have found myself
unable to explain the complexities of the solution needed during
review.
As such, I don't have hours more time to waste trying to get the
fix written the way it needs to be written, so I'm just doing it
myself. This patchset is largely based on Wengang Wang's last patch,
but with all the unnecessary stuff removed, split up into multiple
patches and cleaned up somewhat.
Reported-by: Chandan Babu R <chandanrlinux@gmail.com>
Reported-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Extent freeing neeeds to be able to avoid a busy extent deadlock
when the transaction itself holds the only busy extents in the
allocation group. This may occur if we have an EFI that contains
multiple extents to be freed, and the freeing the second intent
requires the space the first extent free released to expand the
AGFL. If we block on the busy extent at this point, we deadlock.
We hold a dirty transaction that contains a entire atomic extent
free operations within it, so if we can abort the extent free
operation and commit the progress that we've made, the busy extent
can be resolved by a log force. Hence we can restart the aborted
extent free with a new transaction and continue to make
progress without risking deadlocks.
To enable this, we need the EFI processing code to be able to handle
an -EAGAIN error to tell it to commit the current transaction and
retry again. This mechanism is already built into the defer ops
processing (used bythe refcount btree modification intents), so
there's relatively little handling we need to add to the EFI code to
enable this.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
To avoid blocking in xfs_extent_busy_flush() when freeing extents
and the only busy extents are held by the current transaction, we
need to pass the XFS_ALLOC_FLAG_FREEING flag context all the way
into xfs_extent_busy_flush().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Btrees that aren't freespace management trees use the normal extent
allocation and freeing routines for their blocks. Hence when a btree
block is freed, a direct call to xfs_free_extent() is made and the
extent is immediately freed. This puts the entire free space
management btrees under this path, so we are stacking btrees on
btrees in the call stack. The inobt, finobt and refcount btrees
all do this.
However, the bmap btree does not do this - it calls
xfs_free_extent_later() to defer the extent free operation via an
XEFI and hence it gets processed in deferred operation processing
during the commit of the primary transaction (i.e. via intent
chaining).
We need to change xfs_free_extent() to behave in a non-blocking
manner so that we can avoid deadlocks with busy extents near ENOSPC
in transactions that free multiple extents. Inserting or removing a
record from a btree can cause a multi-level tree merge operation and
that will free multiple blocks from the btree in a single
transaction. i.e. we can call xfs_free_extent() multiple times, and
hence the btree manipulation transaction is vulnerable to this busy
extent deadlock vector.
To fix this, convert all the remaining callers of xfs_free_extent()
to use xfs_free_extent_later() to queue XEFIs and hence defer
processing of the extent frees to a context that can be safely
restarted if a deadlock condition is detected.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
XFS has strict metadata ordering requirements. One of the things it
does is maintain the commit order of items from transaction commit
through the CIL and into the AIL. That is, if a transaction logs
item A before item B in a modification, then they will be inserted
into the CIL in the order {A, B}. These items are then written into
the iclog during checkpointing in the order {A, B}. When the
checkpoint commits, they are supposed to be inserted into the AIL in
the order {A, B}, and when they are pushed from the AIL, they are
pushed in the order {A, B}.
If we crash, log recovery then replays the two items from the
checkpoint in the order {A, B}, resulting in the objects the items
apply to being queued for writeback at the end of the checkpoint
in the order {A, B}. This means recovery behaves the same way as the
runtime code.
In places, we have subtle dependencies on this ordering being
maintained. One of this place is performing intent recovery from the
log. It assumes that recovering an intent will result in a
non-intent object being the first thing that is modified in the
recovery transaction, and so when the transaction commits and the
journal flushes, the first object inserted into the AIL beyond the
intent recovery range will be a non-intent item. It uses the
transistion from intent items to non-intent items to stop the
recovery pass.
A recent log recovery issue indicated that an intent was appearing
as the first item in the AIL beyond the recovery range, hence
breaking the end of recovery detection that exists.
Tracing indicated insertion of the items into the AIL was apparently
occurring in the right order (the intent was last in the commit item
list), but the intent was appearing first in the AIL. IOWs, the
order of items in the AIL was {D,C,B,A}, not {A,B,C,D}, and bulk
insertion was reversing the order of the items in the batch of items
being inserted.
Lucky for us, all the items fed to bulk insertion have the same LSN,
so the reversal of order does not affect the log head/tail tracking
that is based on the contents of the AIL. It only impacts on code
that has implicit, subtle dependencies on object order, and AFAICT
only the intent recovery loop is impacted by it.
Make sure bulk AIL insertion does not reorder items incorrectly.
Fixes: 0e57f6a36f ("xfs: bulk AIL insertion during transaction commit")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Pointers drop_leaf and save_leaf are initialized with values that are never
read, they are being re-assigned later on just before they are used. Remove
the redundant early initializations and keep the later assignments at the
point where they are used. Cleans up two clang scan build warnings:
fs/xfs/libxfs/xfs_attr_leaf.c:2288:29: warning: Value stored to 'drop_leaf'
during its initialization is never read [deadcode.DeadStores]
fs/xfs/libxfs/xfs_attr_leaf.c:2289:29: warning: Value stored to 'save_leaf'
during its initialization is never read [deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The return value type of i_blocksize() is 'unsigned int', so the
type of blocksize has been modified from 'int' to 'unsigned int'
to ensure data type consistency.
Signed-off-by: Lu Hongfei <luhongfei@vivo.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
It is perfectly valid to not find session not found errors
when a reconnect of a session happens when requests for the
same session are happening in parallel.
We had these log messages as VFS logs. My last change dumped
these logs as FYI logs.
This change just creates a new dynamic tracepoint to capture
events of this type, just in case it is useful while
debugging issues in the future.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We do not log the session id in crypt_setup when a matching
session is not found. Printing the session id helps debugging
here. This change does just that.
This change also changes this log to FYI, since it is normal to
see then during a reconnect. Doing the same for a similar log
in case of signed connections.
The plan is to have a tracepoint for this event, so that we will
be able to see this event if need be. That will be done as
another change.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
clang warns about a possible field overflow in a memcpy:
In file included from fs/smb/server/smb_common.c:7:
include/linux/fortify-string.h:583:4: error: call to '__write_overflow_field' declared with 'warning' attribute: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror,-Wattribute-warning]
__write_overflow_field(p_size_field, size);
It appears to interpret the "&out[baselen + 4]" as referring to a single
byte of the character array, while the equivalen "out + baselen + 4" is
seen as an offset into the array.
I don't see that kind of warning elsewhere, so just go with the simple
rework.
Fixes: e2f34481b2 ("cifsd: add server-side procedures for SMB3")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
This modifies our user mode stack expansion code to always take the
mmap_lock for writing before modifying the VM layout.
It's actually something we always technically should have done, but
because we didn't strictly need it, we were being lazy ("opportunistic"
sounds so much better, doesn't it?) about things, and had this hack in
place where we would extend the stack vma in-place without doing the
proper locking.
And it worked fine. We just needed to change vm_start (or, in the case
of grow-up stacks, vm_end) and together with some special ad-hoc locking
using the anon_vma lock and the mm->page_table_lock, it all was fairly
straightforward.
That is, it was all fine until Ruihan Li pointed out that now that the
vma layout uses the maple tree code, we *really* don't just change
vm_start and vm_end any more, and the locking really is broken. Oops.
It's not actually all _that_ horrible to fix this once and for all, and
do proper locking, but it's a bit painful. We have basically three
different cases of stack expansion, and they all work just a bit
differently:
- the common and obvious case is the page fault handling. It's actually
fairly simple and straightforward, except for the fact that we have
something like 24 different versions of it, and you end up in a maze
of twisty little passages, all alike.
- the simplest case is the execve() code that creates a new stack.
There are no real locking concerns because it's all in a private new
VM that hasn't been exposed to anybody, but lockdep still can end up
unhappy if you get it wrong.
- and finally, we have GUP and page pinning, which shouldn't really be
expanding the stack in the first place, but in addition to execve()
we also use it for ptrace(). And debuggers do want to possibly access
memory under the stack pointer and thus need to be able to expand the
stack as a special case.
None of these cases are exactly complicated, but the page fault case in
particular is just repeated slightly differently many many times. And
ia64 in particular has a fairly complicated situation where you can have
both a regular grow-down stack _and_ a special grow-up stack for the
register backing store.
So to make this slightly more manageable, the bulk of this series is to
first create a helper function for the most common page fault case, and
convert all the straightforward architectures to it.
Thus the new 'lock_mm_and_find_vma()' helper function, which ends up
being used by x86, arm, powerpc, mips, riscv, alpha, arc, csky, hexagon,
loongarch, nios2, sh, sparc32, and xtensa. So we not only convert more
than half the architectures, we now have more shared code and avoid some
of those twisty little passages.
And largely due to this common helper function, the full diffstat of
this series ends up deleting more lines than it adds.
That still leaves eight architectures (ia64, m68k, microblaze, openrisc,
parisc, s390, sparc64 and um) that end up doing 'expand_stack()'
manually because they are doing something slightly different from the
normal pattern. Along with the couple of special cases in execve() and
GUP.
So there's a couple of patches that first create 'locked' helper
versions of the stack expansion functions, so that there's a obvious
path forward in the conversion. The execve() case is then actually
pretty simple, and is a nice cleanup from our old "grow-up stackls are
special, because at execve time even they grow down".
The #ifdef CONFIG_STACK_GROWSUP in that code just goes away, because
it's just more straightforward to write out the stack expansion there
manually, instead od having get_user_pages_remote() do it for us in some
situations but not others and have to worry about locking rules for GUP.
And the final step is then to just convert the remaining odd cases to a
new world order where 'expand_stack()' is called with the mmap_lock held
for reading, but where it might drop it and upgrade it to a write, only
to return with it held for reading (in the success case) or with it
completely dropped (in the failure case).
In the process, we remove all the stack expansion from GUP (where
dropping the lock wouldn't be ok without special rules anyway), and add
it in manually to __access_remote_vm() for ptrace().
Thanks to Adrian Glaubitz and Frank Scheiner who tested the ia64 cases.
Everything else here felt pretty straightforward, but the ia64 rules for
stack expansion are really quite odd and very different from everything
else. Also thanks to Vegard Nossum who caught me getting one of those
odd conditions entirely the wrong way around.
Anyway, I think I want to actually move all the stack expansion code to
a whole new file of its own, rather than have it split up between
mm/mmap.c and mm/memory.c, but since this will have to be backported to
the initial maple tree vma introduction anyway, I tried to keep the
patches _fairly_ minimal.
Also, while I don't think it's valid to expand the stack from GUP, the
final patch in here is a "warn if some crazy GUP user wants to try to
expand the stack" patch. That one will be reverted before the final
release, but it's left to catch any odd cases during the merge window
and release candidates.
Reported-by: Ruihan Li <lrh2000@pku.edu.cn>
* branch 'expand-stack':
gup: add warning if some caller would seem to want stack expansion
mm: always expand the stack with the mmap write lock held
execve: expand new process stack manually ahead of time
mm: make find_extend_vma() fail if write lock not held
powerpc/mm: convert coprocessor fault to lock_mm_and_find_vma()
mm/fault: convert remaining simple cases to lock_mm_and_find_vma()
arm/mm: Convert to using lock_mm_and_find_vma()
riscv/mm: Convert to using lock_mm_and_find_vma()
mips/mm: Convert to using lock_mm_and_find_vma()
powerpc/mm: Convert to using lock_mm_and_find_vma()
arm64/mm: Convert to using lock_mm_and_find_vma()
mm: make the page fault mmap locking killable
mm: introduce new 'lock_mm_and_find_vma()' page fault helper
Core
----
- Rework the sendpage & splice implementations. Instead of feeding
data into sockets page by page extend sendmsg handlers to support
taking a reference on the data, controlled by a new flag called
MSG_SPLICE_PAGES. Rework the handling of unexpected-end-of-file
to invoke an additional callback instead of trying to predict what
the right combination of MORE/NOTLAST flags is.
Remove the MSG_SENDPAGE_NOTLAST flag completely.
- Implement SCM_PIDFD, a new type of CMSG type analogous to
SCM_CREDENTIALS, but it contains pidfd instead of plain pid.
- Enable socket busy polling with CONFIG_RT.
- Improve reliability and efficiency of reporting for ref_tracker.
- Auto-generate a user space C library for various Netlink families.
Protocols
---------
- Allow TCP to shrink the advertised window when necessary, prevent
sk_rcvbuf auto-tuning from growing the window all the way up to
tcp_rmem[2].
- Use per-VMA locking for "page-flipping" TCP receive zerocopy.
- Prepare TCP for device-to-device data transfers, by making sure
that payloads are always attached to skbs as page frags.
- Make the backoff time for the first N TCP SYN retransmissions
linear. Exponential backoff is unnecessarily conservative.
- Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO).
- Avoid waking up applications using TLS sockets until we have
a full record.
- Allow using kernel memory for protocol ioctl callbacks, paving
the way to issuing ioctls over io_uring.
- Add nolocalbypass option to VxLAN, forcing packets to be fully
encapsulated even if they are destined for a local IP address.
- Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
in-kernel ECMP implementation (e.g. Open vSwitch) select the same
link for all packets. Support L4 symmetric hashing in Open vSwitch.
- PPPoE: make number of hash bits configurable.
- Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
(ipconfig).
- Add layer 2 miss indication and filtering, allowing higher layers
(e.g. ACL filters) to make forwarding decisions based on whether
packet matched forwarding state in lower devices (bridge).
- Support matching on Connectivity Fault Management (CFM) packets.
- Hide the "link becomes ready" IPv6 messages by demoting their
printk level to debug.
- HSR: don't enable promiscuous mode if device offloads the proto.
- Support active scanning in IEEE 802.15.4.
- Continue work on Multi-Link Operation for WiFi 7.
BPF
---
- Add precision propagation for subprogs and callbacks. This allows
maintaining verification efficiency when subprograms are used,
or in fact passing the verifier at all for complex programs,
especially those using open-coded iterators.
- Improve BPF's {g,s}setsockopt() length handling. Previously BPF
assumed the length is always equal to the amount of written data.
But some protos allow passing a NULL buffer to discover what
the output buffer *should* be, without writing anything.
- Accept dynptr memory as memory arguments passed to helpers.
- Add routing table ID to bpf_fib_lookup BPF helper.
- Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands.
- Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
maps as read-only).
- Show target_{obj,btf}_id in tracing link fdinfo.
- Addition of several new kfuncs (most of the names are self-explanatory):
- Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
and bpf_dynptr_clone().
- bpf_task_under_cgroup()
- bpf_sock_destroy() - force closing sockets
- bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs
Netfilter
---------
- Relax set/map validation checks in nf_tables. Allow checking
presence of an entry in a map without using the value.
- Increase ip_vs_conn_tab_bits range for 64BIT builds.
- Allow updating size of a set.
- Improve NAT tuple selection when connection is closing.
Driver API
----------
- Integrate netdev with LED subsystem, to allow configuring HW
"offloaded" blinking of LEDs based on link state and activity
(i.e. packets coming in and out).
- Support configuring rate selection pins of SFP modules.
- Factor Clause 73 auto-negotiation code out of the drivers, provide
common helper routines.
- Add more fool-proof helpers for managing lifetime of MDIO devices
associated with the PCS layer.
- Allow drivers to report advanced statistics related to Time Aware
scheduler offload (taprio).
- Allow opting out of VF statistics in link dump, to allow more VFs
to fit into the message.
- Split devlink instance and devlink port operations.
New hardware / drivers
----------------------
- Ethernet:
- Synopsys EMAC4 IP support (stmmac)
- Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
- Marvell 88E6250 7 port switches
- Microchip LAN8650/1 Rev.B0 PHYs
- MediaTek MT7981/MT7988 built-in 1GE PHY driver
- WiFi:
- Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
- Realtek RTL8723DS (SDIO variant)
- Realtek RTL8851BE
- CAN:
- Fintek F81604
Drivers
-------
- Ethernet NICs:
- Intel (100G, ice):
- support dynamic interrupt allocation
- use meta data match instead of VF MAC addr on slow-path
- nVidia/Mellanox:
- extend link aggregation to handle 4, rather than just 2 ports
- spawn sub-functions without any features by default
- OcteonTX2:
- support HTB (Tx scheduling/QoS) offload
- make RSS hash generation configurable
- support selecting Rx queue using TC filters
- Wangxun (ngbe/txgbe):
- add basic Tx/Rx packet offloads
- add phylink support (SFP/PCS control)
- Freescale/NXP (enetc):
- report TAPRIO packet statistics
- Solarflare/AMD:
- support matching on IP ToS and UDP source port of outer header
- VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
- add devlink dev info support for EF10
- Virtual NICs:
- Microsoft vNIC:
- size the Rx indirection table based on requested configuration
- support VLAN tagging
- Amazon vNIC:
- try to reuse Rx buffers if not fully consumed, useful for ARM
servers running with 16kB pages
- Google vNIC:
- support TCP segmentation of >64kB frames
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- enable USXGMII (88E6191X)
- Microchip:
- lan966x: add support for Egress Stage 0 ACL engine
- lan966x: support mapping packet priority to internal switch
priority (based on PCP or DSCP)
- Ethernet PHYs:
- Broadcom PHYs:
- support for Wake-on-LAN for BCM54210E/B50212E
- report LPI counter
- Microsemi PHYs: support RGMII delay configuration (VSC85xx)
- Micrel PHYs: receive timestamp in the frame (LAN8841)
- Realtek PHYs: support optional external PHY clock
- Altera TSE PCS: merge the driver into Lynx PCS which it is
a variant of
- CAN: Kvaser PCIEcan:
- support packet timestamping
- WiFi:
- Intel (iwlwifi):
- major update for new firmware and Multi-Link Operation (MLO)
- configuration rework to drop test devices and split
the different families
- support for segmented PNVM images and power tables
- new vendor entries for PPAG (platform antenna gain) feature
- Qualcomm 802.11ax (ath11k):
- Multiple Basic Service Set Identifier (MBSSID) and
Enhanced MBSSID Advertisement (EMA) support in AP mode
- support factory test mode
- RealTek (rtw89):
- add RSSI based antenna diversity
- support U-NII-4 channels on 5 GHz band
- RealTek (rtl8xxxu):
- AP mode support for 8188f
- support USB RX aggregation for the newer chips
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmSbJM4ACgkQMUZtbf5S
IrtoDhAAhEim1+LBIKf4lhPcVdZ2p/TkpnwTz5jsTwSeRBAxTwuNJ2fQhFXg13E3
MnRq6QaEp8G4/tA/gynLvQop+FEZEnv+horP0zf/XLcC8euU7UrKdrpt/4xxdP07
IL/fFWsoUGNO+L9LNaHwBo8g7nHvOkPscHEBHc2Xrvzab56TJk6vPySfLqcpKlNZ
CHWDwTpgRqNZzSKiSpoMVd9OVMKUXcPYHpDmfEJ5l+e8vTXmZzOLHrSELHU5nP5f
mHV7gxkDCTshoGcaed7UTiOvgu1p6E5EchDJxiLaSUbgsd8SZ3u4oXwRxgj33RK/
fB2+UaLrRt/DdlHvT/Ph8e8Ygu77yIXMjT49jsfur/zVA0HEA2dFb7V6QlsYRmQp
J25pnrdXmE15llgqsC0/UOW5J1laTjII+T2T70UOAqQl4LWYAQDG4WwsAqTzU0KY
dueydDouTp9XC2WYrRUEQxJUzxaOaazskDUHc5c8oHp/zVBT+djdgtvVR9+gi6+7
yy4elI77FlEEqL0ItdU/lSWINayAlPLsIHkMyhSGKX0XDpKjeycPqkNx4UterXB/
JKIR5RBWllRft+igIngIkKX0tJGMU0whngiw7d1WLw25wgu4sB53hiWWoSba14hv
tXMxwZs5iGaPcT38oRVMZz8I1kJM4Dz3SyI7twVvi4RUut64EG4=
=9i4I
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking changes from Jakub Kicinski:
"WiFi 7 and sendpage changes are the biggest pieces of work for this
release. The latter will definitely require fixes but I think that we
got it to a reasonable point.
Core:
- Rework the sendpage & splice implementations
Instead of feeding data into sockets page by page extend sendmsg
handlers to support taking a reference on the data, controlled by a
new flag called MSG_SPLICE_PAGES
Rework the handling of unexpected-end-of-file to invoke an
additional callback instead of trying to predict what the right
combination of MORE/NOTLAST flags is
Remove the MSG_SENDPAGE_NOTLAST flag completely
- Implement SCM_PIDFD, a new type of CMSG type analogous to
SCM_CREDENTIALS, but it contains pidfd instead of plain pid
- Enable socket busy polling with CONFIG_RT
- Improve reliability and efficiency of reporting for ref_tracker
- Auto-generate a user space C library for various Netlink families
Protocols:
- Allow TCP to shrink the advertised window when necessary, prevent
sk_rcvbuf auto-tuning from growing the window all the way up to
tcp_rmem[2]
- Use per-VMA locking for "page-flipping" TCP receive zerocopy
- Prepare TCP for device-to-device data transfers, by making sure
that payloads are always attached to skbs as page frags
- Make the backoff time for the first N TCP SYN retransmissions
linear. Exponential backoff is unnecessarily conservative
- Create a new MPTCP getsockopt to retrieve all info
(MPTCP_FULL_INFO)
- Avoid waking up applications using TLS sockets until we have a full
record
- Allow using kernel memory for protocol ioctl callbacks, paving the
way to issuing ioctls over io_uring
- Add nolocalbypass option to VxLAN, forcing packets to be fully
encapsulated even if they are destined for a local IP address
- Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
in-kernel ECMP implementation (e.g. Open vSwitch) select the same
link for all packets. Support L4 symmetric hashing in Open vSwitch
- PPPoE: make number of hash bits configurable
- Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
(ipconfig)
- Add layer 2 miss indication and filtering, allowing higher layers
(e.g. ACL filters) to make forwarding decisions based on whether
packet matched forwarding state in lower devices (bridge)
- Support matching on Connectivity Fault Management (CFM) packets
- Hide the "link becomes ready" IPv6 messages by demoting their
printk level to debug
- HSR: don't enable promiscuous mode if device offloads the proto
- Support active scanning in IEEE 802.15.4
- Continue work on Multi-Link Operation for WiFi 7
BPF:
- Add precision propagation for subprogs and callbacks. This allows
maintaining verification efficiency when subprograms are used, or
in fact passing the verifier at all for complex programs,
especially those using open-coded iterators
- Improve BPF's {g,s}setsockopt() length handling. Previously BPF
assumed the length is always equal to the amount of written data.
But some protos allow passing a NULL buffer to discover what the
output buffer *should* be, without writing anything
- Accept dynptr memory as memory arguments passed to helpers
- Add routing table ID to bpf_fib_lookup BPF helper
- Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
- Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
maps as read-only)
- Show target_{obj,btf}_id in tracing link fdinfo
- Addition of several new kfuncs (most of the names are
self-explanatory):
- Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
and bpf_dynptr_clone().
- bpf_task_under_cgroup()
- bpf_sock_destroy() - force closing sockets
- bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs
Netfilter:
- Relax set/map validation checks in nf_tables. Allow checking
presence of an entry in a map without using the value
- Increase ip_vs_conn_tab_bits range for 64BIT builds
- Allow updating size of a set
- Improve NAT tuple selection when connection is closing
Driver API:
- Integrate netdev with LED subsystem, to allow configuring HW
"offloaded" blinking of LEDs based on link state and activity
(i.e. packets coming in and out)
- Support configuring rate selection pins of SFP modules
- Factor Clause 73 auto-negotiation code out of the drivers, provide
common helper routines
- Add more fool-proof helpers for managing lifetime of MDIO devices
associated with the PCS layer
- Allow drivers to report advanced statistics related to Time Aware
scheduler offload (taprio)
- Allow opting out of VF statistics in link dump, to allow more VFs
to fit into the message
- Split devlink instance and devlink port operations
New hardware / drivers:
- Ethernet:
- Synopsys EMAC4 IP support (stmmac)
- Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
- Marvell 88E6250 7 port switches
- Microchip LAN8650/1 Rev.B0 PHYs
- MediaTek MT7981/MT7988 built-in 1GE PHY driver
- WiFi:
- Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
- Realtek RTL8723DS (SDIO variant)
- Realtek RTL8851BE
- CAN:
- Fintek F81604
Drivers:
- Ethernet NICs:
- Intel (100G, ice):
- support dynamic interrupt allocation
- use meta data match instead of VF MAC addr on slow-path
- nVidia/Mellanox:
- extend link aggregation to handle 4, rather than just 2 ports
- spawn sub-functions without any features by default
- OcteonTX2:
- support HTB (Tx scheduling/QoS) offload
- make RSS hash generation configurable
- support selecting Rx queue using TC filters
- Wangxun (ngbe/txgbe):
- add basic Tx/Rx packet offloads
- add phylink support (SFP/PCS control)
- Freescale/NXP (enetc):
- report TAPRIO packet statistics
- Solarflare/AMD:
- support matching on IP ToS and UDP source port of outer
header
- VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
- add devlink dev info support for EF10
- Virtual NICs:
- Microsoft vNIC:
- size the Rx indirection table based on requested
configuration
- support VLAN tagging
- Amazon vNIC:
- try to reuse Rx buffers if not fully consumed, useful for ARM
servers running with 16kB pages
- Google vNIC:
- support TCP segmentation of >64kB frames
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- enable USXGMII (88E6191X)
- Microchip:
- lan966x: add support for Egress Stage 0 ACL engine
- lan966x: support mapping packet priority to internal switch
priority (based on PCP or DSCP)
- Ethernet PHYs:
- Broadcom PHYs:
- support for Wake-on-LAN for BCM54210E/B50212E
- report LPI counter
- Microsemi PHYs: support RGMII delay configuration (VSC85xx)
- Micrel PHYs: receive timestamp in the frame (LAN8841)
- Realtek PHYs: support optional external PHY clock
- Altera TSE PCS: merge the driver into Lynx PCS which it is a
variant of
- CAN: Kvaser PCIEcan:
- support packet timestamping
- WiFi:
- Intel (iwlwifi):
- major update for new firmware and Multi-Link Operation (MLO)
- configuration rework to drop test devices and split the
different families
- support for segmented PNVM images and power tables
- new vendor entries for PPAG (platform antenna gain) feature
- Qualcomm 802.11ax (ath11k):
- Multiple Basic Service Set Identifier (MBSSID) and Enhanced
MBSSID Advertisement (EMA) support in AP mode
- support factory test mode
- RealTek (rtw89):
- add RSSI based antenna diversity
- support U-NII-4 channels on 5 GHz band
- RealTek (rtl8xxxu):
- AP mode support for 8188f
- support USB RX aggregation for the newer chips"
* tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits)
net: scm: introduce and use scm_recv_unix helper
af_unix: Skip SCM_PIDFD if scm->pid is NULL.
net: lan743x: Simplify comparison
netlink: Add __sock_i_ino() for __netlink_diag_dump().
net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses
Revert "af_unix: Call scm_recv() only after scm_set_cred()."
phylink: ReST-ify the phylink_pcs_neg_mode() kdoc
libceph: Partially revert changes to support MSG_SPLICE_PAGES
net: phy: mscc: fix packet loss due to RGMII delays
net: mana: use vmalloc_array and vcalloc
net: enetc: use vmalloc_array and vcalloc
ionic: use vmalloc_array and vcalloc
pds_core: use vmalloc_array and vcalloc
gve: use vmalloc_array and vcalloc
octeon_ep: use vmalloc_array and vcalloc
net: usb: qmi_wwan: add u-blox 0x1312 composition
perf trace: fix MSG_SPLICE_PAGES build error
ipvlan: Fix return value of ipvlan_queue_xmit()
netfilter: nf_tables: fix underflow in chain reference counter
netfilter: nf_tables: unbind non-anonymous set if rule construction fails
...
The changes queued up for v6.5-rc1 for sysctl are in line with
prior efforts to stop usage of deprecated routines which incur
recursion and also make it hard to remove the empty array element
in each sysctl array declaration. The most difficult user to modify
was parport which required a bit of re-thinking of how to declare shared
sysctls there, Joel Granados has stepped up to the plate to do most of
this work and eventual removal of register_sysctl_table(). That work
ended up saving us about 1465 bytes according to bloat-o-meter. Since
we gained a few bloat-o-meter karma points I moved two rather small
sysctl arrays from kernel/sysctl.c leaving us only two more sysctl
arrays to move left.
Most changes have been tested on linux-next for about a month. The last
straggler patches are a minor parport fix, changes to the sysctl
kernel selftest so to verify correctness and prevent regressions for
the future change he made to provide an alternative solution for the
special sysctl mount point target which was using the now deprecated
sysctl child element.
This is all prep work to now finally be able to remove the empty
array element in all sysctl declarations / registrations which is
expected to save us a bit of bytes all over the kernel. That work
will be tested early after v6.5-rc1 is out.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmSceh0SHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoinBFQQAK9WdpcU8ODoDzoSls4jsCQpZUCfZ+ED
pbCgQqUqu9VPs6bnJ+aXVa6Fh3uCr6+TIfNFM55qI/Sbo2issZ7bm0nvKmGgc6/m
giqDP7btvHqiAsEootci8DVdbBXKkdH4dx3pSwleyN8pdinewH0hrKImaPpahyo6
1mB1du0iI89yjsZmheHVVSyfXXYAnP0PqRVy5Y+qxY7yYlIegQ5uAZmwRE62lfTf
TuiV7OFuDZ2DBYOmqIhfGKGRnfOL5ZVF3iHCrfUpX3p+fEFzDmwvm3vr73PTSrFw
/aRRLa/hOWr5ilw1bvnMcazgQzFEOlQb3DMhBKH7gLl3XHVrM+TaaqYHjUia1+6Y
e2axz/duA2q9uLMW81daRApvHMCgy0exkpC7prfOxF5bgTe4TjA7ZWvGpqG1kPKT
PPSxw80XvG5hLZm4tB0ZWJ5rOfFpiUGGneSeRQwyuClBt73SIO+F03jyGpt83slU
jFE50ac14Zwh1oxpCQtYoR1+bXWdq1QwM5vQBNEuaoTSnJfVjrXqBz/BnqJChtjr
m1vA27+4/dfki2P3gVWF1lGx43ir3uJvqk+BjWXm2CDDJqpRi3N0qcUwZwLuqAAz
/LEgFqK61bpHi/C8c2NWAxIoeWRU4NUOaoiKmZwyt0sKAWU1Yzg70xssYeg7VYqZ
3pvFNVBqkV+F
=sXUU
-----END PGP SIGNATURE-----
Merge tag 'v6.5-rc1-sysctl-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull sysctl updates from Luis Chamberlain:
"The changes for sysctl are in line with prior efforts to stop usage of
deprecated routines which incur recursion and also make it hard to
remove the empty array element in each sysctl array declaration.
The most difficult user to modify was parport which required a bit of
re-thinking of how to declare shared sysctls there, Joel Granados has
stepped up to the plate to do most of this work and eventual removal
of register_sysctl_table(). That work ended up saving us about 1465
bytes according to bloat-o-meter. Since we gained a few bloat-o-meter
karma points I moved two rather small sysctl arrays from
kernel/sysctl.c leaving us only two more sysctl arrays to move left.
Most changes have been tested on linux-next for about a month. The
last straggler patches are a minor parport fix, changes to the sysctl
kernel selftest so to verify correctness and prevent regressions for
the future change he made to provide an alternative solution for the
special sysctl mount point target which was using the now deprecated
sysctl child element.
This is all prep work to now finally be able to remove the empty array
element in all sysctl declarations / registrations which is expected
to save us a bit of bytes all over the kernel. That work will be
tested early after v6.5-rc1 is out"
* tag 'v6.5-rc1-sysctl-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
sysctl: replace child with an enumeration
sysctl: Remove debugging dump_stack
test_sysclt: Test for registering a mount point
test_sysctl: Add an option to prevent test skip
test_sysctl: Add an unregister sysctl test
test_sysctl: Group node sysctl test under one func
test_sysctl: Fix test metadata getters
parport: plug a sysctl register leak
sysctl: move security keys sysctl registration to its own file
sysctl: move umh sysctl registration to its own file
signal: move show_unhandled_signals sysctl to its own file
sysctl: remove empty dev table
sysctl: Remove register_sysctl_table
sysctl: Refactor base paths registrations
sysctl: stop exporting register_sysctl_table
parport: Removed sysctl related defines
parport: Remove register_sysctl_table from parport_default_proc_register
parport: Remove register_sysctl_table from parport_device_proc_register
parport: Remove register_sysctl_table from parport_proc_register
parport: Move magic number "15" to a define
Some servers may return error codes from REQ_GET_DFS_REFERRAL requests
that are unexpected by the client, so to make it easier, assume
non-DFS mounts when the client can't get the initial DFS referral of
@ctx->UNC in dfs_mount_share().
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When having two DFS root mounts that are connected to same namespace,
same mount options but different prefix paths, we can't really use the
shared @server->origin_fullpath when chasing DFS links in them.
Move the origin_fullpath field to cifs_tcon structure so when having
shared DFS root mounts with different prefix paths, and we need to
chase any DFS links, dfs_get_automount_devname() will pick up the
correct full path out of the @tcon that will be used for the new
mount.
Before patch
mount.cifs //dom/dfs/dir /mnt/1 -o ...
mount.cifs //dom/dfs /mnt/2 -o ...
# shared server, ses, tcon
# server: origin_fullpath=//dom/dfs/dir
# @server->origin_fullpath + '/dir/link1'
$ ls /mnt/2/dir/link1
ls: cannot open directory '/mnt/2/dir/link1': No such file or directory
After patch
mount.cifs //dom/dfs/dir /mnt/1 -o ...
mount.cifs //dom/dfs /mnt/2 -o ...
# shared server & ses
# tcon_1: origin_fullpath=//dom/dfs/dir
# tcon_2: origin_fullpath=//dom/dfs
# @tcon_2->origin_fullpath + '/dir/link1'
$ ls /mnt/2/dir/link1
dir0 dir1 dir10 dir3 dir5 dir6 dir7 dir9 target2_file.txt tsub
Fixes: 8e3554150d ("cifs: fix sharing of DFS connections")
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
top-level directories.
- Douglas Anderson has added a new "buddy" mode to the hardlockup
detector. It permits the detector to work on architectures which
cannot provide the required interrupts, by having CPUs periodically
perform checks on other CPUs.
- Zhen Lei has enhanced kexec's ability to support two crash regions.
- Petr Mladek has done a lot of cleanup on the hard lockup detector's
Kconfig entries.
- And the usual bunch of singleton patches in various places.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJelTAAKCRDdBJ7gKXxA
juDkAP0VXWynzkXoojdS/8e/hhi+htedmQ3v2dLZD+vBrctLhAEA7rcH58zAVoWa
2ejqO6wDrRGUC7JQcO9VEjT0nv73UwU=
=F293
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2023-06-24-19-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-mm updates from Andrew Morton:
- Arnd Bergmann has fixed a bunch of -Wmissing-prototypes in top-level
directories
- Douglas Anderson has added a new "buddy" mode to the hardlockup
detector. It permits the detector to work on architectures which
cannot provide the required interrupts, by having CPUs periodically
perform checks on other CPUs
- Zhen Lei has enhanced kexec's ability to support two crash regions
- Petr Mladek has done a lot of cleanup on the hard lockup detector's
Kconfig entries
- And the usual bunch of singleton patches in various places
* tag 'mm-nonmm-stable-2023-06-24-19-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (72 commits)
kernel/time/posix-stubs.c: remove duplicated include
ocfs2: remove redundant assignment to variable bit_off
watchdog/hardlockup: fix typo in config HARDLOCKUP_DETECTOR_PREFER_BUDDY
powerpc: move arch_trigger_cpumask_backtrace from nmi.h to irq.h
devres: show which resource was invalid in __devm_ioremap_resource()
watchdog/hardlockup: define HARDLOCKUP_DETECTOR_ARCH
watchdog/sparc64: define HARDLOCKUP_DETECTOR_SPARC64
watchdog/hardlockup: make HAVE_NMI_WATCHDOG sparc64-specific
watchdog/hardlockup: declare arch_touch_nmi_watchdog() only in linux/nmi.h
watchdog/hardlockup: make the config checks more straightforward
watchdog/hardlockup: sort hardlockup detector related config values a logical way
watchdog/hardlockup: move SMP barriers from common code to buddy code
watchdog/buddy: simplify the dependency for HARDLOCKUP_DETECTOR_PREFER_BUDDY
watchdog/buddy: don't copy the cpumask in watchdog_next_cpu()
watchdog/buddy: cleanup how watchdog_buddy_check_hardlockup() is called
watchdog/hardlockup: remove softlockup comment in touch_nmi_watchdog()
watchdog/hardlockup: in watchdog_hardlockup_check() use cpumask_copy()
watchdog/hardlockup: don't use raw_cpu_ptr() in watchdog_hardlockup_kick()
watchdog/hardlockup: HAVE_NMI_WATCHDOG must implement watchdog_hardlockup_probe()
watchdog/hardlockup: keep kernel.nmi_watchdog sysctl as 0444 if probe fails
...
- Yosry has also eliminated cgroup's atomic rstat flushing.
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability.
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning.
- Lorenzo Stoakes has done some maintanance work on the get_user_pages()
interface.
- Liam Howlett continues with cleanups and maintenance work to the maple
tree code. Peng Zhang also does some work on maple tree.
- Johannes Weiner has done some cleanup work on the compaction code.
- David Hildenbrand has contributed additional selftests for
get_user_pages().
- Thomas Gleixner has contributed some maintenance and optimization work
for the vmalloc code.
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code.
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting.
- Christoph Hellwig has some cleanups for the filemap/directio code.
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the provided
APIs rather than open-coding accesses.
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings.
- John Hubbard has a series of fixes to the MM selftesting code.
- ZhangPeng continues the folio conversion campaign.
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock.
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from
128 to 8.
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management.
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code.
- Vishal Moola also has done some folio conversion work.
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA
joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh
J0qXCzqaaN8+BuEyLGDVPaXur9KirwY=
=B7yQ
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton:
- Yosry Ahmed brought back some cgroup v1 stats in OOM logs
- Yosry has also eliminated cgroup's atomic rstat flushing
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning
- Lorenzo Stoakes has done some maintanance work on the
get_user_pages() interface
- Liam Howlett continues with cleanups and maintenance work to the
maple tree code. Peng Zhang also does some work on maple tree
- Johannes Weiner has done some cleanup work on the compaction code
- David Hildenbrand has contributed additional selftests for
get_user_pages()
- Thomas Gleixner has contributed some maintenance and optimization
work for the vmalloc code
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting
- Christoph Hellwig has some cleanups for the filemap/directio code
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the
provided APIs rather than open-coding accesses
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings
- John Hubbard has a series of fixes to the MM selftesting code
- ZhangPeng continues the folio conversion campaign
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
from 128 to 8
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code
- Vishal Moola also has done some folio conversion work
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch
* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
mm/hugetlb: remove hugetlb_set_page_subpool()
mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
hugetlb: revert use of page_cache_next_miss()
Revert "page cache: fix page_cache_next/prev_miss off by one"
mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
mm: memcg: rename and document global_reclaim()
mm: kill [add|del]_page_to_lru_list()
mm: compaction: convert to use a folio in isolate_migratepages_block()
mm: zswap: fix double invalidate with exclusive loads
mm: remove unnecessary pagevec includes
mm: remove references to pagevec
mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
mm: remove struct pagevec
net: convert sunrpc from pagevec to folio_batch
i915: convert i915_gpu_error to use a folio_batch
pagevec: rename fbatch_count()
mm: remove check_move_unevictable_pages()
drm: convert drm_gem_put_pages() to use a folio_batch
i915: convert shmem_sg_free_table() to use a folio_batch
scatterlist: add sg_set_folio()
...
Handle trailing and leading separators when parsing UNC and prefix
paths in smb3_parse_devname(). Then, store the sanitised paths in
smb3_fs_context::source.
This fixes the following cases
$ mount //srv/share// /mnt/1 -o ...
$ cat /mnt/1/d0/f0
cat: /mnt/1/d0/f0: Invalid argument
The -EINVAL was returned because the client sent SMB2_CREATE "\\d0\f0"
rather than SMB2_CREATE "\d0\f0".
$ mount //srv//share /mnt/1 -o ...
mount: Invalid argument
The -EINVAL was returned correctly although the client only realised
it after sending a couple of bad requests rather than bailing out
earlier when parsing mount options.
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Having the ClientGUID info makes it easier to debug
issues related to a client on a server that serves a
number of clients.
This change prints the ClientGUID in DebugData.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Chech the session state and skip it if it's exiting.
Signed-off-by: Winston Wen <wentao@uniontech.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Don't collect exiting session in smb2_reconnect_server(), because it
will be released soon.
Note that the exiting session will stay in server->smb_ses_list until
it complete the cifs_free_ipc() and logoff() and then delete itself
from the list.
Signed-off-by: Winston Wen <wentao@uniontech.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
All the server credits and in-flight info is protected by req_lock.
Once the req_lock is held, and we've determined that we have enough
credits to continue, this lock cannot be dropped till we've made the
changes to credits and in-flight count.
However, we used to drop the lock in order to avoid deadlock with
the recent srv_lock. This could cause the checks already made to be
invalidated.
Fixed it by moving the server status check to before locking req_lock.
Fixes: d7d7a66aac ("cifs: avoid use of global locks for high contention data")
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
In smb2_compound_op we have a possible use-after-free
which can cause hard to debug problems later on.
This was revealed during stress testing with KASAN enabled
kernel. Fixing it by moving the cfile free call to
a few lines below, after the usage.
Fixes: 76894f3e2f ("cifs: improve symlink handling for smb2+")
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
- Fix KMSAN vs FORTIFY in strlcpy/strlcat (Alexander Potapenko)
- Convert strreplace() to return string start (Andy Shevchenko)
- Flexible array conversions (Arnd Bergmann, Wyes Karny, Kees Cook)
- Add missing function prototypes seen with W=1 (Arnd Bergmann)
- Fix strscpy() kerndoc typo (Arne Welzel)
- Replace strlcpy() with strscpy() across many subsystems which were
either Acked by respective maintainers or were trivial changes that
went ignored for multiple weeks (Azeem Shaikh)
- Remove unneeded cc-option test for UBSAN_TRAP (Nick Desaulniers)
- Add KUnit tests for strcat()-family
- Enable KUnit tests of FORTIFY wrappers under UML
- Add more complete FORTIFY protections for strlcat()
- Add missed disabling of FORTIFY for all arch purgatories.
- Enable -fstrict-flex-arrays=3 globally
- Tightening UBSAN_BOUNDS when using GCC
- Improve checkpatch to check for strcpy, strncpy, and fake flex arrays
- Improve use of const variables in FORTIFY
- Add requested struct_size_t() helper for types not pointers
- Add __counted_by macro for annotating flexible array size members
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmSbftQWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJj0MD/9X9jzJzCmsAU+yNldeoAzC84Sk
GVU3RBxGcTNysL1gZXynkIgigw7DWc4htMGeSABHHwQRVP65JCH1Kw/VqIkyumbx
9LdX6IklMJb4pRT4PVU3azebV4eNmSjlur2UxMeW54Czm91/6I8RHbJOyAPnOUmo
2oomGdP/hpEHtKR7hgy8Axc6w5ySwQixh2V5sVZG3VbvCS5WKTmTXbs6puuRT5hz
iHt7v+7VtEg/Qf1W7J2oxfoghvVBsaRrSLrExWT/oZYh1ZxM7DsCAAoG/IsDgHGA
9LBXiRECgAFThbHVxLvvKZQMXdVk0i8iXLX43XMKC0wTA+NTyH7wlcQQ4RWNMuo8
sfA9Qm9gMArXaf64aymr3Uwn20Zan0391HdlbhOJZAE6v3PPJbleUnM58AzD2d3r
5Lz6AIFBxDImy+3f9iDWgacCT5/PkeiXTHzk9QnKhJyKKtRA58XJxj4q2+rPnGJP
n4haXqoxD5FJbxdXiGKk31RS0U5HBug7wkOcUrTqDHUbc/QNU2b7dxTKUx+zYtCU
uV5emPzpF4H4z+91WpO47n9gkMAfwV0lt9S2dwS8pxsgqctbmIan+Jgip7rsqZ2G
OgLXBsb43eEs+6WgO8tVt/ZHYj9ivGMdrcNcsIfikzNs/xweUJ53k2xSEn2xEa5J
cwANDmkL6QQK7yfeeg==
=s0j1
-----END PGP SIGNATURE-----
Merge tag 'hardening-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
"There are three areas of note:
A bunch of strlcpy()->strscpy() conversions ended up living in my tree
since they were either Acked by maintainers for me to carry, or got
ignored for multiple weeks (and were trivial changes).
The compiler option '-fstrict-flex-arrays=3' has been enabled
globally, and has been in -next for the entire devel cycle. This
changes compiler diagnostics (though mainly just -Warray-bounds which
is disabled) and potential UBSAN_BOUNDS and FORTIFY _warning_
coverage. In other words, there are no new restrictions, just
potentially new warnings. Any new FORTIFY warnings we've seen have
been fixed (usually in their respective subsystem trees). For more
details, see commit df8fc4e934.
The under-development compiler attribute __counted_by has been added
so that we can start annotating flexible array members with their
associated structure member that tracks the count of flexible array
elements at run-time. It is possible (likely?) that the exact syntax
of the attribute will change before it is finalized, but GCC and Clang
are working together to sort it out. Any changes can be made to the
macro while we continue to add annotations.
As an example of that last case, I have a treewide commit waiting with
such annotations found via Coccinelle:
https://git.kernel.org/linus/adc5b3cb48a049563dc673f348eab7b6beba8a9b
Also see commit dd06e72e68 for more details.
Summary:
- Fix KMSAN vs FORTIFY in strlcpy/strlcat (Alexander Potapenko)
- Convert strreplace() to return string start (Andy Shevchenko)
- Flexible array conversions (Arnd Bergmann, Wyes Karny, Kees Cook)
- Add missing function prototypes seen with W=1 (Arnd Bergmann)
- Fix strscpy() kerndoc typo (Arne Welzel)
- Replace strlcpy() with strscpy() across many subsystems which were
either Acked by respective maintainers or were trivial changes that
went ignored for multiple weeks (Azeem Shaikh)
- Remove unneeded cc-option test for UBSAN_TRAP (Nick Desaulniers)
- Add KUnit tests for strcat()-family
- Enable KUnit tests of FORTIFY wrappers under UML
- Add more complete FORTIFY protections for strlcat()
- Add missed disabling of FORTIFY for all arch purgatories.
- Enable -fstrict-flex-arrays=3 globally
- Tightening UBSAN_BOUNDS when using GCC
- Improve checkpatch to check for strcpy, strncpy, and fake flex
arrays
- Improve use of const variables in FORTIFY
- Add requested struct_size_t() helper for types not pointers
- Add __counted_by macro for annotating flexible array size members"
* tag 'hardening-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (54 commits)
netfilter: ipset: Replace strlcpy with strscpy
uml: Replace strlcpy with strscpy
um: Use HOST_DIR for mrproper
kallsyms: Replace all non-returning strlcpy with strscpy
sh: Replace all non-returning strlcpy with strscpy
of/flattree: Replace all non-returning strlcpy with strscpy
sparc64: Replace all non-returning strlcpy with strscpy
Hexagon: Replace all non-returning strlcpy with strscpy
kobject: Use return value of strreplace()
lib/string_helpers: Change returned value of the strreplace()
jbd2: Avoid printing outside the boundary of the buffer
checkpatch: Check for 0-length and 1-element arrays
riscv/purgatory: Do not use fortified string functions
s390/purgatory: Do not use fortified string functions
x86/purgatory: Do not use fortified string functions
acpi: Replace struct acpi_table_slit 1-element array with flex-array
clocksource: Replace all non-returning strlcpy with strscpy
string: use __builtin_memcpy() in strlcpy/strlcat
staging: most: Replace all non-returning strlcpy with strscpy
drm/i2c: tda998x: Replace all non-returning strlcpy with strscpy
...
- Fix a few comments for correctness and typos (Baruch Siach)
- Small simplifications for binfmt (Christophe JAILLET)
- Set p_align to 4 for PT_NOTE in core dump (Fangrui Song)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmSbc94WHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJkWaEACb5idNLPHL5lb7zr1HGXWVnpo9
3a4gwNpZpDpODRSL5VQ93WUwzN+74hN15h85orXbjQRZRXJVM4gNmgHargqs4LIb
LrbB04ylCoAnz+MYPAPkD0Yw4WZXqwK33V11V8luUf9q+zFemjQ0rOzFDmrX5a+y
qxq8CvY/jETmi5GPzKn35w8gJs8X++8GPIQ0ZDeQzYGZsZZ4m0+f1tqsC9bOmn/F
LRn0ePaSTrYPQILQ2xVjlCv9HHk8MUsJu7+eyOI0NemWlITookofkycjMDe+9LpS
hRKhdNbni24BEN2eDdJNC5TeXlvAOEQv4n4GFZ37sUQGdDhjdQTYF9sySluX5+Bo
fx0qCQyWJMoskYBbBTAERv9hBxZctU94k82XaUxZA1bx0f9h+0USqwL9YrOnF5p/
4734FUjeotGd8uFpUe/B6/SjRQ6WwYzwoEME/5Q/EKuotLrk3SfLu3fH/rl3cGH8
mlt4vwD0xDfFgb4Pj5wf8lqxpK+mMCusExqvzKVVV4L7Q5gDIntA2BvSRykjQ7VD
vwRaQDM9mnDD/t0FgDVVHl91UF5Fsctf81UvfoI5eIMKqFo3NRGCV8znFf8G8//i
8Qmjj4kpxN+zMOi4nKsTjNDFmauRmOrGaDrWZnyO3m/VlYrYodnCKJbdLsqWH4zx
a2O3oVICYLUeM0NEkQ==
=01FE
-----END PGP SIGNATURE-----
Merge tag 'execve-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve updates from Kees Cook:
- Fix a few comments for correctness and typos (Baruch Siach)
- Small simplifications for binfmt (Christophe JAILLET)
- Set p_align to 4 for PT_NOTE in core dump (Fangrui Song)
* tag 'execve-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
binfmt_elf: fix comment typo s/reset/regset/
elf: correct note name comment
binfmt: Slightly simplify elf_fdpic_map_file()
binfmt: Use struct_size()
coredump, vmcore: Set p_align to 4 for PT_NOTE
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmSZuh0UHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNpuxAAxChGqME9nE7iITx1TaFRrbK49mDF
1RZh/5cwzde72lLLFkTFKB6ErMSQkrrtA+jFH7vKsrOslBel1+yO80vkXmhYCeZU
P3m0FeREUpuU4QV0tbQamPeR+SWohmKi2dYWd+VdpLA+1aTK3KNYsi2NFkDIreap
BqeRq4S0Rqc4u3/5juk6JCGFhTRWaH16YJQrzIKHF/K3DK+gMhAY5sjuAWzFc6ma
/5bbD55kdVVDfnsxNSe+lzJ7zEf7TYedLG6BN+R9cVrU+El12a38M29kASaAof5w
vpb92a27hA9Q5EyQ2O9QXnr2L5CShT4bvAZCGkK4cmZerGNTdM0iojhYj1s7FAV/
USkWgkDmEuSatp0+DdXlfQyUmZZWlw1W0oiEfZwR8w7TY7q9CU7aD8K7+GDSIazB
g89nYznVjlaC/oA4/owMraoWP3eiDiAcsQdO052Vv63TVyJtTiRiKyBq5EFLrX8L
iaUCa4cBaYFc94kN1PZeNXZKwqRc2F6oAFT1YuXnFWBGmixN0kUL023C0xjl/J7P
02jYYSVzLm22aU39GU0DSnaLfAwl3muazOB3XuyGOhUWHFYzjkc9UhmGp0W50DkK
qigW3ONA8s8CKUS/q7QSGq+Vf+CVZA5f+daDDPGYstPfCTk61eu0wjwfwek3W0o+
xKzBr2Od3vTOzAs=
=3nWy
-----END PGP SIGNATURE-----
Merge tag 'lsm-pr-20230626' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull lsm updates from Paul Moore:
- A SafeSetID patch to correct what appears to be a cut-n-paste typo in
the code causing a UID to be printed where a GID was desired.
This is coming via the LSM tree because we haven't been able to get a
response from the SafeSetID maintainer (Micah Morton) in several
months. Hopefully we are able to get in touch with Micah, but until
we do I'm going to pick them up in the LSM tree.
- A small fix to the reiserfs LSM xattr code.
We're continuing to work through some issues with the reiserfs code
as we try to fixup the LSM xattr handling, but in the process we're
uncovering some ugly problems in reiserfs and we may just end up
removing the LSM xattr support in reiserfs prior to reiserfs'
removal.
For better or worse, this shouldn't impact any of the reiserfs users,
as we discovered that LSM xattrs on reiserfs were completely broken,
meaning no one is currently using the combo of reiserfs and a file
labeling LSM.
- A tweak to how the cap_user_data_t struct/typedef is declared in the
header file to appease the Sparse gods.
- In the process of trying to sort out the SafeSetID lost-maintainer
problem I realized that I needed to update the labeled networking
entry to "Supported".
- Minor comment/documentation and spelling fixes.
* tag 'lsm-pr-20230626' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
device_cgroup: Fix kernel-doc warnings in device_cgroup
SafeSetID: fix UID printed instead of GID
MAINTAINERS: move labeled networking to "supported"
capability: erase checker warnings about struct __user_cap_data_struct
lsm: fix a number of misspellings
reiserfs: Initialize sec->length in reiserfs_security_init().
capability: fix kernel-doc warnings in capability.c
-----BEGIN PGP SIGNATURE-----
iIYEABYIAC4WIQSVyBthFV4iTW/VU1/l49DojIL20gUCZJlO/xAcbWljQGRpZ2lr
b2QubmV0AAoJEOXj0OiMgvbS/FwA/A5GFGtKnLzFpIAXgc1G3kr8c+J/WF7RVUD+
PJNEsH6PAP41l3BTqVSeVZ+tdLSC7NbesoX0MTd+rCgAWnr+Pko2Cw==
=KqpG
-----END PGP SIGNATURE-----
Merge tag 'landlock-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux
Pull landlock updates from Mickaël Salaün:
"Add support for Landlock to UML.
To do this, this fixes the way hostfs manages inodes according to the
underlying filesystem [1]. They are now properly handled as for other
filesystems, which enables Landlock support (and probably other
features).
This also extends Landlock's tests with 6 pseudo filesystems,
including hostfs"
[1] https://lore.kernel.org/all/20230612191430.339153-1-mic@digikod.net/
* tag 'landlock-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux:
selftests/landlock: Add hostfs tests
selftests/landlock: Add tests for pseudo filesystems
selftests/landlock: Make mounts configurable
selftests/landlock: Add supports_filesystem() helper
selftests/landlock: Don't create useless file layouts
hostfs: Fix ephemeral inodes
This finishes the job of always holding the mmap write lock when
extending the user stack vma, and removes the 'write_locked' argument
from the vm helper functions again.
For some cases, we just avoid expanding the stack at all: drivers and
page pinning really shouldn't be extending any stacks. Let's see if any
strange users really wanted that.
It's worth noting that architectures that weren't converted to the new
lock_mm_and_find_vma() helper function are left using the legacy
"expand_stack()" function, but it has been changed to drop the mmap_lock
and take it for writing while expanding the vma. This makes it fairly
straightforward to convert the remaining architectures.
As a result of dropping and re-taking the lock, the calling conventions
for this function have also changed, since the old vma may no longer be
valid. So it will now return the new vma if successful, and NULL - and
the lock dropped - if the area could not be extended.
Tested-by: Vegard Nossum <vegard.nossum@oracle.com>
Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> # ia64
Tested-by: Frank Scheiner <frank.scheiner@web.de> # ia64
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In nfsd4_encode_fattr(), TIME_CREATE was being written out after all
other times. However, they should be written out in an order that
matches the bit flags in bmval1, which in this case are
#define FATTR4_WORD1_TIME_ACCESS (1UL << 15)
#define FATTR4_WORD1_TIME_CREATE (1UL << 18)
#define FATTR4_WORD1_TIME_DELTA (1UL << 19)
#define FATTR4_WORD1_TIME_METADATA (1UL << 20)
#define FATTR4_WORD1_TIME_MODIFY (1UL << 21)
so TIME_CREATE should come second.
I noticed this on a FreeBSD NFSv4.2 client, which supports creation
times. On this client, file times were weirdly permuted. With this
patch applied on the server, times looked normal on the client.
Fixes: e377a3e698 ("nfsd: Add support for the birth time attribute")
Link: https://unix.stackexchange.com/q/749605/56202
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This was noticed by a user who noticied that the mtime of a file
backing a loopback device was getting bumped when the loopback device
is mounted read/only. Note: This doesn't show up when doing a
loopback mount of a file directly, via "mount -o ro /tmp/foo.img
/mnt", since the loop device is set read-only when mount automatically
creates loop device. However, this is noticeable for a LUKS loop
device like this:
% cryptsetup luksOpen /tmp/foo.img test
% mount -o ro /dev/loop0 /mnt ; umount /mnt
or, if LUKS is not in use, if the user manually creates the loop
device like this:
% losetup /dev/loop0 /tmp/foo.img
% mount -o ro /dev/loop0 /mnt ; umount /mnt
The modified mtime causes rsync to do a rolling checksum scan of the
file on the local and remote side, incrementally increasing the time
to rsync the not-modified-but-touched image file.
Fixes: eee00237fa ("ext4: commit super block if fs record error when journal record without error")
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/ZIauBR7YiV3rVAHL@glitch
Reported-by: Sean Greenslade <sean@seangreenslade.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We got a NULL pointer dereference issue below while running generic/475
I/O failure pressure test.
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP PTI
CPU: 1 PID: 15600 Comm: fsstress Not tainted 6.4.0-rc5-xfstests-00055-gd3ab1bca26b4 #190
RIP: 0010:jbd2_journal_set_features+0x13d/0x430
...
Call Trace:
<TASK>
? __die+0x23/0x60
? page_fault_oops+0xa4/0x170
? exc_page_fault+0x67/0x170
? asm_exc_page_fault+0x26/0x30
? jbd2_journal_set_features+0x13d/0x430
jbd2_journal_revoke+0x47/0x1e0
__ext4_forget+0xc3/0x1b0
ext4_free_blocks+0x214/0x2f0
ext4_free_branches+0xeb/0x270
ext4_ind_truncate+0x2bf/0x320
ext4_truncate+0x1e4/0x490
ext4_handle_inode_extension+0x1bd/0x2a0
? iomap_dio_complete+0xaf/0x1d0
The root cause is the journal super block had been failed to write out
due to I/O fault injection, it's uptodate bit was cleared by
end_buffer_write_sync() and didn't reset yet in jbd2_write_superblock().
And it raced by journal_get_superblock()->bh_read(), unfortunately, the
read IO is also failed, so the error handling in
journal_fail_superblock() unexpectedly clear the journal->j_sb_buffer,
finally lead to above NULL pointer dereference issue.
If the journal super block had been read and verified, there is no need
to call bh_read() read it again even if it has been failed to written
out. So the fix could be simply move buffer_verified(bh) in front of
bh_read(). Also remove a stale comment left in
jbd2_journal_check_used_features().
Fixes: 51bacdba23d8 ("jbd2: factor out journal initialization from journal_get_superblock()")
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230616015547.3155195-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Rename ext4_quota_off_umount() to ext4_quotas_off(), and add type
parameter to replace open code in ext4_enable_quotas().
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230327141630.156875-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Always enable 'JBD2_CYCLE_RECORD' journal option on ext4, letting the
jbd2 continue to record new journal transactions from the recovered
journal head or the checkpointed transactions in the previous mount.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230322013353.1843306-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
For a newly mounted file system, the journal committing thread always
record new transactions from the start of the journal area, no matter
whether the journal was clean or just has been recovered. So the logdump
code in debugfs cannot dump continuous logs between each mount, it is
disadvantageous to analysis corrupted file system image and locate the
file system inconsistency bugs.
If we get a corrupted file system in the running products and want to
find out what has happened, besides lookup the system log, one effective
way is to backtrack the journal log. But we may not always run e2fsck
before each mount and the default fsck -a mode also cannot always
checkout all inconsistencies, so it could left over some inconsistencies
into the next mount until we detect it. Finally, transactions in the
journal may probably discontinuous and some relatively new transactions
has been covered, it becomes hard to analyse. If we could record
transactions continuously between each mount, we could acquire more
useful info from the journal. Like this:
|Previous mount checkpointed/recovered logs|Current mount logs |
|{------}{---}{--------} ... {------}| ... |{======}{========}...000000|
And yes the journal area is limited and cannot record everything, the
problematic transaction may also be covered even if we do this, but
this is still useful for fuzzy tests and short-running products.
This patch save the head blocknr in the superblock after flushing the
journal or unmounting the file system, let the next mount could continue
to record new transaction behind it. This change is backward compatible
because the old kernel does not care about the head blocknr of the
journal. It is also fine if we mount a clean old image without valid
head blocknr, we fail back to set it to s_first just like before.
Finally, for the case of mount an unclean file system, we could also get
the journal head easily after scanning/replaying the journal, it will
continue to record new transaction after the recovered transactions.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230322013353.1843306-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
journal->j_format_version is no longer used, remove it.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230315013128.3911115-7-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Current journal_get_superblock() couple journal superblock checking and
partial journal initialization, factor out initialization part from it
to make things clear.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230315013128.3911115-6-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We should only check and set extented features if journal format version
is 2, and now we check the in memory copy of the superblock
'journal->j_format_version', which relys on the parameter initialization
sequence, switch to use the h_blocktype in superblock cloud be more
clear.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230315013128.3911115-5-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
As discussed in [1], 'sbi->s_journal_bdev != sb->s_bdev' will always
become true if sbi->s_journal_bdev exists. Filesystem block device and
journal block device are both opened with 'FMODE_EXCL' mode, so these
two devices can't be same one. Then we can remove the redundant checking
'sbi->s_journal_bdev != sb->s_bdev' if 'sbi->s_journal_bdev' exists.
[1] https://lore.kernel.org/lkml/f86584f6-3877-ff18-47a1-2efaa12d18b2@huawei.com/
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230315013128.3911115-3-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Following process makes ext4 load stale buffer heads from last failed
mounting in a new mounting operation:
mount_bdev
ext4_fill_super
| ext4_load_and_init_journal
| ext4_load_journal
| jbd2_journal_load
| load_superblock
| journal_get_superblock
| set_buffer_verified(bh) // buffer head is verified
| jbd2_journal_recover // failed caused by EIO
| goto failed_mount3a // skip 'sb->s_root' initialization
deactivate_locked_super
kill_block_super
generic_shutdown_super
if (sb->s_root)
// false, skip ext4_put_super->invalidate_bdev->
// invalidate_mapping_pages->mapping_evict_folio->
// filemap_release_folio->try_to_free_buffers, which
// cannot drop buffer head.
blkdev_put
blkdev_put_whole
if (atomic_dec_and_test(&bdev->bd_openers))
// false, systemd-udev happens to open the device. Then
// blkdev_flush_mapping->kill_bdev->truncate_inode_pages->
// truncate_inode_folio->truncate_cleanup_folio->
// folio_invalidate->block_invalidate_folio->
// filemap_release_folio->try_to_free_buffers will be skipped,
// dropping buffer head is missed again.
Second mount:
ext4_fill_super
ext4_load_and_init_journal
ext4_load_journal
ext4_get_journal
jbd2_journal_init_inode
journal_init_common
bh = getblk_unmovable
bh = __find_get_block // Found stale bh in last failed mounting
journal->j_sb_buffer = bh
jbd2_journal_load
load_superblock
journal_get_superblock
if (buffer_verified(bh))
// true, skip journal->j_format_version = 2, value is 0
jbd2_journal_recover
do_one_pass
next_log_block += count_tags(journal, bh)
// According to journal_tag_bytes(), 'tag_bytes' calculating is
// affected by jbd2_has_feature_csum3(), jbd2_has_feature_csum3()
// returns false because 'j->j_format_version >= 2' is not true,
// then we get wrong next_log_block. The do_one_pass may exit
// early whenoccuring non JBD2_MAGIC_NUMBER in 'next_log_block'.
The filesystem is corrupted here, journal is partially replayed, and
new journal sequence number actually is already used by last mounting.
The invalidate_bdev() can drop all buffer heads even racing with bare
reading block device(eg. systemd-udev), so we can fix it by invalidating
bdev in error handling path in __ext4_fill_super().
Fetch a reproducer in [Link].
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217171
Fixes: 25ed6e8a54 ("jbd2: enable journal clients to enable v2 checksumming")
Cc: stable@vger.kernel.org # v3.5
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230315013128.3911115-2-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We've had reports of significant performance regression of sub-block
(unaligned) direct writes due to the added exclusivity restrictions
in ext4. The purpose of the exclusivity requirement for unaligned
direct writes is to avoid data corruption caused by unserialized
partial block zeroing in the iomap dio layer across overlapping
writes.
XFS has similar requirements for the same underlying reasons, yet
doesn't suffer the extreme performance regression that ext4 does.
The reason for this is that XFS utilizes IOMAP_DIO_OVERWRITE_ONLY
mode, which allows for optimistic submission of concurrent unaligned
I/O and kicks back writes that require partial block zeroing such
that they can be submitted in a safe, exclusive context. Since ext4
already performs most of these checks pre-submission, it can support
something similar without necessarily relying on the iomap flag and
associated retry mechanism.
Update the dio write submission path to allow concurrent submission
of unaligned direct writes that are purely overwrite and so will not
require block zeroing. To improve readability of the various related
checks, move the unaligned I/O handling down into
ext4_dio_write_checks(), where the dio draining and force wait logic
can immediately follow the locking requirement checks. Finally, the
IOMAP_DIO_OVERWRITE_ONLY flag is set to enable a warning check as a
precaution should the ext4 overwrite logic ever become inconsistent
with the zeroing expectations of iomap dio.
The performance improvement of sub-block direct write I/O is shown
in the following fio test on a 64xcpu guest vm:
Test: fio --name=test --ioengine=libaio --direct=1 --group_reporting
--overwrite=1 --thread --size=10G --filename=/mnt/fio
--readwrite=write --ramp_time=10s --runtime=60s --numjobs=8
--blocksize=2k --iodepth=256 --allow_file_create=0
v6.2: write: IOPS=4328, BW=8724KiB/s
v6.2 (patched): write: IOPS=801k, BW=1565MiB/s
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230314130759.642710-1-bfoster@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Line wrap and slightly clarify the comments describing mballoc's
cirtiera.
Define EXT4_MB_NUM_CRS as part of the enum, so that it will
automatically get updated when criteria is added or removed.
Also fix a potential unitialized use of 'cr' variable if
CONFIG_EXT4_DEBUG is enabled.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
After ext4_es_insert_extent() returns void, the return value in
ext4_zeroout_es() is also unnecessary, so make it return void too.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-13-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now ext4_es_insert_extent() never return error, so make it return void.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-12-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Similar to in ext4_es_insert_delayed_block(), we use preallocations that
do not fail to avoid inconsistencies, but we do not care about es that are
not must be kept, and we return 0 even if such es memory allocation fails.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-9-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Similar to in ext4_es_remove_extent(), we use a no-fail preallocation
to avoid inconsistencies, except that here we may have to preallocate
two extent_status.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-8-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If __es_remove_extent() returns an error it means that when splitting
extent, allocating an extent that must be kept failed, where returning
an error directly would cause the extent tree to be inconsistent. So we
use GFP_NOFAIL to pre-allocate an extent_status and pass it to
__es_remove_extent() to avoid this problem.
In addition, since the allocated memory is outside the i_es_lock, the
extent_status tree may change and the pre-allocated extent_status is
no longer needed, so we release the pre-allocated extent_status when
es->es_len is not initialized.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-7-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When splitting extent, if the second extent can not be dropped, we return
-ENOMEM and use GFP_NOFAIL to preallocate an extent_status outside of
i_es_lock and pass it to __es_remove_extent() to be used as the second
extent. This ensures that __es_remove_extent() is executed successfully,
thus ensuring consistency in the extent status tree. If the second extent
is not undroppable, we simply drop it and return 0. Then retry is no longer
necessary, remove it.
Now, __es_remove_extent() will always remove what it should, maybe more.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-6-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pass a extent_status pointer prealloc to __es_insert_extent(). If the
pointer is non-null, it is used directly when a new extent_status is
needed to avoid memory allocation failures.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-5-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Factor out __es_alloc_extent() and __es_free_extent(), which only allocate
and free extent_status in these two helpers.
The ext4_es_alloc_extent() function is split into __es_alloc_extent()
and ext4_es_init_extent(). In __es_alloc_extent() we allocate memory using
GFP_KERNEL | __GFP_NOFAIL | __GFP_ZERO if the memory allocation cannot
fail, otherwise we use GFP_ATOMIC. and the ext4_es_init_extent() is used to
initialize extent_status and update related variables after a successful
allocation.
This is to prepare for the use of pre-allocated extent_status later.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-4-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In the extent status tree, we have extents which we can just drop without
issues and extents we must not drop - this depends on the extent's status
- currently ext4_es_is_delayed() extents must stay, others may be dropped.
A helper function is added to help determine if the current extent can
be dropped, although only ext4_es_is_delayed() extents cannot be dropped
currently.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>