Pull misc vfs updates from Al Viro:
"Assorted stuff all over the place"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
useful constants: struct qstr for ".."
hostfs_open(): don't open-code file_dentry()
whack-a-mole: kill strlen_user() (again)
autofs: should_expire() argument is guaranteed to be positive
apparmor:match_mn() - constify devpath argument
buffer: a small optimization in grow_buffers
get rid of autofs_getpath()
constify dentry argument of dentry_path()/dentry_path_raw()
casefold, ensure that deleted file names are cleared in directory
blocks by zeroing directory entries when they are unlinked or moved as
part of a hash tree node split. We also improve the block allocator's
performance on a freshly mounted file system by prefetching block
bitmaps.
There are also the usual cleanups and bug fixes, including fixing a
page cache invalidation race when there is mixed buffered and direct
I/O and the block size is less than page size, and allow the dax flag
to be set and cleared on inline directories.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmCLei4ACgkQ8vlZVpUN
gaPZkgf/VH08xjMf3VthC+BpvVmChQXfV4yjigHbO2pmPyYWZhyJzkEGCQD8u2eB
b7ShW+B1NCifcTU34xAkKHwEtakzzEv3WIMrT1oZNWrpfo8tt850EkwQggaGGDpd
/HnP1/wLtziJ5hE6DwutmX7qB4VFghVj898MjDrEPSOBqItOjWps9mn/JWL7SHyI
Dqzhf5XZTYPaXWuJmSmKw3q8O70JDHnZe/rRWlfX1jLI5KDtqp71Nw1B+gszUB66
IUdncyZKvInsyjYhkbCQ8U6WFih82MrbKeuGYDp/RFvg5eMELEYkwT9j0ofuDHq8
zn62sAlbOXv1DiqkPDHKVm9GkHx8/g==
=UpnH
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"New features for ext4 this cycle include support for encrypted
casefold, ensure that deleted file names are cleared in directory
blocks by zeroing directory entries when they are unlinked or moved as
part of a hash tree node split. We also improve the block allocator's
performance on a freshly mounted file system by prefetching block
bitmaps.
There are also the usual cleanups and bug fixes, including fixing a
page cache invalidation race when there is mixed buffered and direct
I/O and the block size is less than page size, and allow the dax flag
to be set and cleared on inline directories"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (32 commits)
ext4: wipe ext4_dir_entry2 upon file deletion
ext4: Fix occasional generic/418 failure
fs: fix reporting supported extra file attributes for statx()
ext4: allow the dax flag to be set and cleared on inline directories
ext4: fix debug format string warning
ext4: fix trailing whitespace
ext4: fix various seppling typos
ext4: fix error return code in ext4_fc_perform_commit()
ext4: annotate data race in jbd2_journal_dirty_metadata()
ext4: annotate data race in start_this_handle()
ext4: fix ext4_error_err save negative errno into superblock
ext4: fix error code in ext4_commit_super
ext4: always panic when errors=panic is specified
ext4: delete redundant uptodate check for buffer
ext4: do not set SB_ACTIVE in ext4_orphan_cleanup()
ext4: make prefetch_block_bitmaps default
ext4: add proc files to monitor new structures
ext4: improve cr 0 / cr 1 group scanning
ext4: add MB_NUM_ORDERS macro
ext4: add mballoc stats proc file
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmCJUfIACgkQnJ2qBz9k
QNkStAf8CA7beya7LZ/GGN7HzXhv2cs+IpUFhRkynLklEM0lxKsOEagLFSZxkoMD
IBSRSo4odkkderqI9W/yp+9OYhOd9+BQCq4isg1Gh9Tf5xANJEpLvBAPnWVhooJs
9CrYZQY9Bdf+fF/8GHbKlrMAYm56vBCmWqyWTEtWUyPBOA12in2ZHQJmCa+5+nge
zTT/B5cvuhN5K7uYhGM4YfeCU5DBmmvD4sV6YBTkQOgCU0bEF0f9R3JjHDo34a1s
yqna3ypqKNRhsJVs8F+aOGRieUYxFoRqtYNHZK3qI9i07v7ndoTm5jzGN6OFlKs3
U3rF9/+cBgeESahWG6IjHIqhXGXNhg==
=KjNm
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
- support for limited fanotify functionality for unpriviledged users
- faster merging of fanotify events
- a few smaller fsnotify improvements
* tag 'fsnotify_for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
shmem: allow reporting fanotify events with file handles on tmpfs
fs: introduce a wrapper uuid_to_fsid()
fanotify_user: use upper_32_bits() to verify mask
fanotify: support limited functionality for unprivileged users
fanotify: configurable limits via sysfs
fanotify: limit number of event merge attempts
fsnotify: use hash table for faster events merge
fanotify: mix event info and pid into merge key hash
fanotify: reduce event objectid to 29-bit hash
fsnotify: allow fsnotify_{peek,remove}_first_event with empty queue
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmCHPZwACgkQ+7dXa6fL
C2uJxw/9FVNssHxtA8iFDvZskE4YHiL6vMgOgKOeVmBfUvxqJcxWQXcF8ycbon5y
jGcDRV1DWTv395ckALHqmD6SlH/5q+OBt4cCOXCebOlzbC63JmjJ6xOjHntZKw3i
9c3GITNca5AsPXHXHGIcoRY4/4FntpLoVpyfYJ4ZZJCY7a7QUbgnEIIy9/Ps8Clw
BahhiKChl2JCgV3KZBk/ypkf0IBduxKgT+IUxA9o7H5UsLzvUgnfd5uMIALLPMI1
NXzUHBJoUtnWcB52nWPufJx9YwkMfSx70mutT0T74CFxbJakwRgAl2tWr5g989qM
/fQrsOhMlU3NaXYaRPelbxkuzvy3hU1xSe3GLiZcxmh4Cb/YAX0TrHRecO62NWff
pu/UWQS8Du5Gy8DrHScuo8baI1KFfyiV2lWQPfBO8kPaEB2ERw+PN6fWSh993Cn9
4UHaR3Oyn4qyVXeirNZg+frado+BEZAbNMZwn0lyi6jnLeyir6qABOdpQk34SB35
D4jfdPOBxeh3OVFkc+EBJ98i3/nal2+yXrNOqkP4OwmF0HqGt0YKKSaLNigXaDdO
3CKmQlBqBZsUdRYHJyJsofrifkKjP78zx2WyUJPms8MGX9z+9kYR3f1erifLesCT
Kb2TrAFx4ZgqS5tFh6UHnX4x0qy2RckgNrKTMpv38K8lNqplvLo=
=tZgy
-----END PGP SIGNATURE-----
Merge tag 'netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull network filesystem helper library updates from David Howells:
"Here's a set of patches for 5.13 to begin the process of overhauling
the local caching API for network filesystems. This set consists of
two parts:
(1) Add a helper library to handle the new VM readahead interface.
This is intended to be used unconditionally by the filesystem
(whether or not caching is enabled) and provides a common
framework for doing caching, transparent huge pages and, in the
future, possibly fscrypt and read bandwidth maximisation. It also
allows the netfs and the cache to align, expand and slice up a
read request from the VM in various ways; the netfs need only
provide a function to read a stretch of data to the pagecache and
the helper takes care of the rest.
(2) Add an alternative fscache/cachfiles I/O API that uses the kiocb
facility to do async DIO to transfer data to/from the netfs's
pages, rather than using readpage with wait queue snooping on one
side and vfs_write() on the other. It also uses less memory, since
it doesn't do buffered I/O on the backing file.
Note that this uses SEEK_HOLE/SEEK_DATA to locate the data
available to be read from the cache. Whilst this is an improvement
from the bmap interface, it still has a problem with regard to a
modern extent-based filesystem inserting or removing bridging
blocks of zeros. Fixing that requires a much greater overhaul.
This is a step towards overhauling the fscache API. The change is
opt-in on the part of the network filesystem. A netfs should not try
to mix the old and the new API because of conflicting ways of handling
pages and the PG_fscache page flag and because it would be mixing DIO
with buffered I/O. Further, the helper library can't be used with the
old API.
This does not change any of the fscache cookie handling APIs or the
way invalidation is done at this time.
In the near term, I intend to deprecate and remove the old I/O API
(fscache_allocate_page{,s}(), fscache_read_or_alloc_page{,s}(),
fscache_write_page() and fscache_uncache_page()) and eventually
replace most of fscache/cachefiles with something simpler and easier
to follow.
This patchset contains the following parts:
- Some helper patches, including provision of an ITER_XARRAY iov
iterator and a function to do readahead expansion.
- Patches to add the netfs helper library.
- A patch to add the fscache/cachefiles kiocb API.
- A pair of patches to fix some review issues in the ITER_XARRAY and
read helpers as spotted by Al and Willy.
Jeff Layton has patches to add support in Ceph for this that he
intends for this merge window. I have a set of patches to support AFS
that I will post a separate pull request for.
With this, AFS without a cache passes all expected xfstests; with a
cache, there's an extra failure, but that's also there before these
patches. Fixing that probably requires a greater overhaul. Ceph also
passes the expected tests.
I also have patches in a separate branch to tidy up the handling of
PG_fscache/PG_private_2 and their contribution to page refcounting in
the core kernel here, but I haven't included them in this set and will
route them separately"
Link: https://lore.kernel.org/lkml/3779937.1619478404@warthog.procyon.org.uk/
* tag 'netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
netfs: Miscellaneous fixes
iov_iter: Four fixes for ITER_XARRAY
fscache, cachefiles: Add alternate API to use kiocb for read/write to cache
netfs: Add a tracepoint to log failures that would be otherwise unseen
netfs: Define an interface to talk to a cache
netfs: Add write_begin helper
netfs: Gather stats
netfs: Add tracepoints
netfs: Provide readahead and readpage netfs helpers
netfs, mm: Add set/end/wait_on_page_fscache() aliases
netfs, mm: Move PG_fscache helper funcs to linux/netfs.h
netfs: Documentation for helper library
netfs: Make a netfs helper module
mm: Implement readahead_control pageset expansion
mm/readahead: Handle ractl nr_pages being modified
fs: Document file_ra_state
mm/filemap: Pass the file_ra_state in the ractl
mm: Add set/end/wait functions for PG_private_2
iov_iter: Add ITER_XARRAY
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYIfiiwAKCRCRxhvAZXjc
ogtMAQC+MtgJZdcH5iDHNEyI36JaWUccKRV7PdvfF1YgnXO45gD+IYxR1c/EQQyD
kh2AmqhET6jVhe9Nsob5yxduksI+ygo=
=oh/d
-----END PGP SIGNATURE-----
Merge tag 'fs.idmapped.helpers.v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull fs mapping helper updates from Christian Brauner:
"This adds kernel-doc to all new idmapping helpers and improves their
naming which was triggered by a discussion with some fs developers.
Some of the names are based on suggestions by Vivek and Al.
Also remove the open-coded permission checking in a few places with
simple helpers. Overall this should lead to more clarity and make it
easier to maintain"
* tag 'fs.idmapped.helpers.v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
fs: introduce two inode i_{u,g}id initialization helpers
fs: introduce fsuidgid_has_mapping() helper
fs: document and rename fsid helpers
fs: document mapping helpers
Pull fileattr conversion updates from Miklos Szeredi via Al Viro:
"This splits the handling of FS_IOC_[GS]ETFLAGS from ->ioctl() into a
separate method.
The interface is reasonably uniform across the filesystems that
support it and gives nice boilerplate removal"
* 'miklos.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (23 commits)
ovl: remove unneeded ioctls
fuse: convert to fileattr
fuse: add internal open/release helpers
fuse: unsigned open flags
fuse: move ioctl to separate source file
vfs: remove unused ioctl helpers
ubifs: convert to fileattr
reiserfs: convert to fileattr
ocfs2: convert to fileattr
nilfs2: convert to fileattr
jfs: convert to fileattr
hfsplus: convert to fileattr
efivars: convert to fileattr
xfs: convert to fileattr
orangefs: convert to fileattr
gfs2: convert to fileattr
f2fs: convert to fileattr
ext4: convert to fileattr
ext2: convert to fileattr
btrfs: convert to fileattr
...
Upon file deletion, zero out all fields in ext4_dir_entry2 besides rec_len.
In case sensitive data is stored in filenames, this ensures no potentially
sensitive data is left in the directory entry upon deletion. Also, wipe
these fields upon moving a directory entry during the conversion to an
htree and when splitting htree nodes.
The data wiped may still exist in the journal, but there are future
commits planned to address this.
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Link: https://lore.kernel.org/r/20210422180834.2242353-1-leah.rumancik@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Eric has noticed that after pagecache read rework, generic/418 is
occasionally failing for ext4 when blocksize < pagesize. In fact, the
pagecache rework just made hard to hit race in ext4 more likely. The
problem is that since ext4 conversion of direct IO writes to iomap
framework (commit 378f32bab3), we update inode size after direct IO
write only after invalidating page cache. Thus if buffered read sneaks
at unfortunate moment like:
CPU1 - write at offset 1k CPU2 - read from offset 0
iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT);
ext4_readpage();
ext4_handle_inode_extension()
the read will zero out tail of the page as it still sees smaller inode
size and thus page cache becomes inconsistent with on-disk contents with
all the consequences.
Fix the problem by moving inode size update into end_io handler which
gets called before the page cache is invalidated.
Reported-and-tested-by: Eric Whitney <enwlinux@gmail.com>
Fixes: 378f32bab3 ("ext4: introduce direct I/O write using iomap infrastructure")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20210415155417.4734-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Some filesystem's use a digest of their uuid for f_fsid.
Create a simple wrapper for this open coded folding.
Filesystems that have a non null uuid but use the block device
number for f_fsid may also consider using this helper.
[JK: Added missing asm/byteorder.h include]
Link: https://lore.kernel.org/r/20210322173944.449469-2-amir73il@gmail.com
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
This is needed to allow generic/607 to pass for file systems with the
inline data_feature enabled, and it allows the use of file systems
where the directories use inline_data, while the files are accessed
via DAX.
Cc: stable@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Use the fileattr API to let the VFS handle locking, permission checking and
conversion.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Using no_printk() for jbd_debug() revealed two warnings:
fs/jbd2/recovery.c: In function 'fc_do_one_pass':
fs/jbd2/recovery.c:256:30: error: format '%d' expects a matching 'int' argument [-Werror=format=]
256 | jbd_debug(3, "Processing fast commit blk with seq %d");
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/ext4/fast_commit.c: In function 'ext4_fc_replay_add_range':
fs/ext4/fast_commit.c:1732:30: error: format '%d' expects argument of type 'int', but argument 2 has type 'long unsigned int' [-Werror=format=]
1732 | jbd_debug(1, "Converting from %d to %d %lld",
The first one was added incorrectly, and was also missing a few newlines
in debug output, and the second one happened when the type of an
argument changed.
Reported-by: kernel test robot <lkp@intel.com>
Fixes: d556435156 ("jbd2: avoid -Wempty-body warnings")
Fixes: 6db0746189 ("ext4: use BIT() macro for BH_** state bits")
Fixes: 5b849b5f96 ("jbd2: fast commit recovery path")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20210409201211.1866633-1-arnd@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Made suggested modifications from checkpatch in reference to ERROR:
trailing whitespace
Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20210409042035.15516-1-jack.qiu@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In case of if not ext4_fc_add_tlv branch, an error return code is missing.
Cc: stable@kernel.org
Fixes: aa75f4d3da ("ext4: main fast-commit commit path")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Xu Yihang <xuyihang@huawei.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20210408070033.123047-1-xuyihang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fix As write_mmp_block() so that it returns -EIO instead of 1, so that
the correct error gets saved into the superblock.
Cc: stable@kernel.org
Fixes: 54d3adbc29 ("ext4: save all error info in save_error_info() and drop ext4_set_errno()")
Reported-by: Liu Zhi Qiang <liuzhiqiang26@huawei.com>
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210406025331.148343-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We should set the error code when ext4_commit_super check argument failed.
Found in code review.
Fixes: c4be0c1dc4 ("filesystem freeze: add error handling of write_super_lockfs/unlockfs").
Cc: stable@kernel.org
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210402101631.561-1-changfengnan@vivo.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Before commit 014c9caa29 ("ext4: make ext4_abort() use
__ext4_error()"), the following series of commands would trigger a
panic:
1. mount /dev/sda -o ro,errors=panic test
2. mount /dev/sda -o remount,abort test
After commit 014c9caa29, remounting a file system using the test
mount option "abort" will no longer trigger a panic. This commit will
restore the behaviour immediately before commit 014c9caa29.
(However, note that the Linux kernel's behavior has not been
consistent; some previous kernel versions, including 5.4 and 4.19
similarly did not panic after using the mount option "abort".)
This also makes a change to long-standing behaviour; namely, the
following series commands will now cause a panic, when previously it
did not:
1. mount /dev/sda -o ro,errors=panic test
2. echo test > /sys/fs/ext4/sda/trigger_fs_error
However, this makes ext4's behaviour much more consistent, so this is
a good thing.
Cc: stable@kernel.org
Fixes: 014c9caa29 ("ext4: make ext4_abort() use __ext4_error()")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20210401081903.3421208-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The buffer uptodate state has been checked in function set_buffer_uptodate,
there is no need use buffer_uptodate before calling set_buffer_uptodate and
delete it.
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Yang Guo <guoyang2@huawei.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/1617260610-29770-1-git-send-email-zhangshaokun@hisilicon.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When CONFIG_QUOTA is enabled, if we failed to mount the filesystem due
to some error happens behind ext4_orphan_cleanup(), it will end up
triggering a after free issue of super_block. The problem is that
ext4_orphan_cleanup() will set SB_ACTIVE flag if CONFIG_QUOTA is
enabled, after we cleanup the truncated inodes, the last iput() will put
them into the lru list, and these inodes' pages may probably dirty and
will be write back by the writeback thread, so it could be raced by
freeing super_block in the error path of mount_bdev().
After check the setting of SB_ACTIVE flag in ext4_orphan_cleanup(), it
was used to ensure updating the quota file properly, but evict inode and
trash data immediately in the last iput does not affect the quotafile,
so setting the SB_ACTIVE flag seems not required[1]. Fix this issue by
just remove the SB_ACTIVE setting.
[1] https://lore.kernel.org/linux-ext4/99cce8ca-e4a0-7301-840f-2ace67c551f3@huawei.com/T/#m04990cfbc4f44592421736b504afcc346b2a7c00
Cc: stable@kernel.org
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210331033138.918975-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Block bitmap prefetching is needed for these allocator optimization
data structures to get populated and provide better group scanning
order. So, turn it on bu default. prefetch_block_bitmaps mount option
is now marked as removed and a new option no_prefetch_block_bitmaps is
added to disable block bitmap prefetching.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-8-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Instead of traversing through groups linearly, scan groups in specific
orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
largest free order >= the order of the request. So, with this patch,
we maintain lists for each possible order and insert each group into a
list based on the largest free order in its buddy bitmap. During cr 0
allocation, we traverse these lists in the increasing order of largest
free orders. This allows us to find a group with the best available cr
0 match in constant time. If nothing can be found, we fallback to cr 1
immediately.
At CR1, the story is slightly different. We want to traverse in the
order of increasing average fragment size. For CR1, we maintain a rb
tree of groupinfos which is sorted by average fragment size. Instead
of traversing linearly, at CR1, we traverse in the order of increasing
average fragment size, starting at the most optimal group. This brings
down cr 1 search complexity to log(num groups).
For cr >= 2, we just perform the linear search as before. Also, in
case of lock contention, we intermittently fallback to linear search
even in CR 0 and CR 1 cases. This allows us to proceed during the
allocation path even in case of high contention.
There is an opportunity to do optimization at CR2 too. That's because
at CR2 we only consider groups where bb_free counter (number of free
blocks) is greater than the request extent size. That's left as future
work.
All the changes introduced in this patch are protected under a new
mount option "mb_optimize_scan".
With this patchset, following experiment was performed:
Created a highly fragmented disk of size 65TB. The disk had no
contiguous 2M regions. Following command was run consecutively for 3
times:
time dd if=/dev/urandom of=file bs=2M count=10
Here are the results with and without cr 0/1 optimizations introduced
in this patch:
|---------+------------------------------+---------------------------|
| | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
|---------+------------------------------+---------------------------|
| 1st run | 5m1.871s | 2m47.642s |
| 2nd run | 2m28.390s | 0m0.611s |
| 3rd run | 2m26.530s | 0m1.255s |
|---------+------------------------------+---------------------------|
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
A few arrays in mballoc.c use the total number of valid orders as
their size. Currently, this value is set as "sb->s_blocksize_bits +
2". This makes code harder to read. So, instead add a new macro
MB_NUM_ORDERS(sb) to make the code more readable.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-5-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Before this patch, the function parse_options() was returning
journal_devnum and journal_ioprio variables to the caller. This patch
generalizes that interface to allow parse_options to return any parsed
options to return back to the caller. In this patch series, it gets
used to capture the value of "mb_optimize_scan=%u" mount option.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-3-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
s_mb_buddies_generated gets used later in this patch series to
determine if the cr 0 and cr 1 optimziations should be performed or
not. Currently, s_mb_buddies_generated is protected under a
spin_lock. In the allocation path, it is better if we don't depend on
the lock and instead read the value atomically. In order to do that,
we drop s_bal_lock altogether and we convert the only two protected
fields by it s_mb_buddies_generated and s_mb_generation_time to atomic
type.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-2-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Commit <50122847007> ("ext4: fix check to prevent initializing reserved
inodes") check the block group zero and prevent initializing reserved
inodes. But in some special cases, the reserved inode may not all belong
to the group zero, it may exist into the second group if we format
filesystem below.
mkfs.ext4 -b 4096 -g 8192 -N 1024 -I 4096 /dev/sda
So, it will end up triggering a false positive report of a corrupted
file system. This patch fix it by avoid check reserved inodes if no free
inode blocks will be zeroed.
Cc: stable@kernel.org
Fixes: 5012284700 ("ext4: fix check to prevent initializing reserved inodes")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210331121516.2243099-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
list_sort() internally casts the comparison function passed to it
to a different type with constant struct list_head pointers, and
uses this pointer to call the functions, which trips indirect call
Control-Flow Integrity (CFI) checking.
Instead of removing the consts, this change defines the
list_cmp_func_t type and changes the comparison function types of
all list_sort() callers to use const pointers, thus avoiding type
mismatches.
Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com
Matching names with casefolded encrypting directories requires
decrypting entries to confirm case since we are case preserving. We can
avoid needing to decrypt if our hash values don't match.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Link: https://lore.kernel.org/r/20210319073414.1381041-3-drosen@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This adds support for encryption with casefolding.
Since the name on disk is case preserving, and also encrypted, we can no
longer just recompute the hash on the fly. Additionally, to avoid
leaking extra information from the hash of the unencrypted name, we use
siphash via an fscrypt v2 policy.
The hash is stored at the end of the directory entry for all entries
inside of an encrypted and casefolded directory apart from those that
deal with '.' and '..'. This way, the change is backwards compatible
with existing ext4 filesystems.
[ Changed to advertise this feature via the file:
/sys/fs/ext4/features/encrypted_casefold -- TYT ]
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210319073414.1381041-2-drosen@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Give filesystem two little helpers that do the right thing when
initializing the i_uid and i_gid fields on idmapped and non-idmapped
mounts. Filesystems shouldn't have to be concerned with too many
details.
Link: https://lore.kernel.org/r/20210320122623.599086-5-christian.brauner@ubuntu.com
Inspired-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Vivek pointed out that the fs{g,u}id_into_mnt() naming scheme can be
misleading as it could be understood as implying they do the exact same
thing as i_{g,u}id_into_mnt(). The original motivation for this naming
scheme was to signal to callers that the helpers will always take care
to map the k{g,u}id such that the ownership is expressed in terms of the
mnt_users.
Get rid of the confusion by renaming those helpers to something more
sensible. Al suggested mapped_fs{g,u}id() which seems a really good fit.
Usually filesystems don't need to bother with these helpers directly
only in some cases where they allocate objects that carry {g,u}ids which
are either filesystem specific (e.g. xfs quota objects) or don't have a
clean set of helpers as inodes have.
Link: https://lore.kernel.org/r/20210320122623.599086-3-christian.brauner@ubuntu.com
Inspired-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmBXj1oACgkQ8vlZVpUN
gaNnAwgAqZJ0S/Hctexs+v+DNvuyMxsA84pB/9KYlK2zgbBOyK5Iftxjqxb9Sb6j
6XKQOIaP2EXYJ0MDWW/fDMUHatlJvXUp+A9kLTiOLMDaRXbobQzb5jlGg9ZB/pBj
TzISrR4widiqJbVT2RFpO9O7B75BQqlpqFNfkF/yJ9CU/ozAw9x+voPcZK7q8/Sh
+DeQCARvgfx1ZipHGTYKjJdujA0qGcDfboYJpgId/gA5Zi76tx4NlbeXAM2QmRfh
zAd1NzFhqf7JmKDAWDdUeRnrDHcje9FLcAxo7Quq7YWxRKFsOCz9LTxazL2UIoa2
HvGpMD23qmISCLUyyrfnrpGPj/mD2w==
=xcuH
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Miscellaneous ext4 bug fixes for v5.12"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: initialize ret to suppress smatch warning
ext4: stop inode update before return
ext4: fix rename whiteout with fast commit
ext4: fix timer use-after-free on failed mount
ext4: fix potential error in ext4_do_update_inode
ext4: do not try to set xattr into ea_inode if value is empty
ext4: do not iput inode under running transaction in ext4_rename()
ext4: find old entry again if failed to rename whiteout
ext4: fix error handling in ext4_end_enable_verity()
ext4: fix bh ref count on error paths
fs/ext4: fix integer overflow in s_log_groups_per_flex
ext4: add reclaim checks to xattr code
ext4: shrink race window in ext4_should_retry_alloc()
This patch adds rename whiteout support in fast commits. Note that the
whiteout object that gets created is actually char device. Which
imples, the function ext4_inode_journal_mode(struct inode *inode)
would return "JOURNAL_DATA" for this inode. This has a consequence in
fast commit code that it will make creation of the whiteout object a
fast-commit ineligible behavior and thus will fall back to full
commits. With this patch, this can be observed by running fast commits
with rename whiteout and seeing the stats generated by ext4_fc_stats
tracepoint as follows:
ext4_fc_stats: dev 254:32 fc ineligible reasons:
XATTR:0, CROSS_RENAME:0, JOURNAL_FLAG_CHANGE:0, NO_MEM:0, SWAP_BOOT:0,
RESIZE:0, RENAME_DIR:0, FALLOC_RANGE:0, INODE_JOURNAL_DATA:16;
num_commits:6, ineligible: 6, numblks: 3
So in short, this patch guarantees that in case of rename whiteout, we
fall back to full commits.
Amir mentioned that instead of creating a new whiteout object for
every rename, we can create a static whiteout object with irrelevant
nlink. That will make fast commits to not fall back to full
commit. But until this happens, this patch will ensure correctness by
falling back to full commits.
Fixes: 8016e29f43 ("ext4: fast commit recovery path")
Cc: stable@kernel.org
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20210316221921.1124955-1-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When filesystem mount fails because of corrupted filesystem we first
cancel the s_err_report timer reminding fs errors every day and only
then we flush s_error_work. However s_error_work may report another fs
error and re-arm timer thus resulting in timer use-after-free. Fix the
problem by first flushing the work and only after that canceling the
s_err_report timer.
Reported-by: syzbot+628472a2aac693ab0fcd@syzkaller.appspotmail.com
Fixes: 2d01ddc866 ("ext4: save error info to sb through journal if available")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210315165906.2175-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If set_large_file = 1 and errors occur in ext4_handle_dirty_metadata(),
the error code will be overridden, go to out_brelse to avoid this
situation.
Signed-off-by: Shijie Luo <luoshijie1@huawei.com>
Link: https://lore.kernel.org/r/20210312065051.36314-1-luoshijie1@huawei.com
Cc: stable@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Syzbot report a warning that ext4 may create an empty ea_inode if set
an empty extent attribute to a file on the file system which is no free
blocks left.
WARNING: CPU: 6 PID: 10667 at fs/ext4/xattr.c:1640 ext4_xattr_set_entry+0x10f8/0x1114 fs/ext4/xattr.c:1640
...
Call trace:
ext4_xattr_set_entry+0x10f8/0x1114 fs/ext4/xattr.c:1640
ext4_xattr_block_set+0x1d0/0x1b1c fs/ext4/xattr.c:1942
ext4_xattr_set_handle+0x8a0/0xf1c fs/ext4/xattr.c:2390
ext4_xattr_set+0x120/0x1f0 fs/ext4/xattr.c:2491
ext4_xattr_trusted_set+0x48/0x5c fs/ext4/xattr_trusted.c:37
__vfs_setxattr+0x208/0x23c fs/xattr.c:177
...
Now, ext4 try to store extent attribute into an external inode if
ext4_xattr_block_set() return -ENOSPC, but for the case of store an
empty extent attribute, store the extent entry into the extent
attribute block is enough. A simple reproduce below.
fallocate test.img -l 1M
mkfs.ext4 -F -b 2048 -O ea_inode test.img
mount test.img /mnt
dd if=/dev/zero of=/mnt/foo bs=2048 count=500
setfattr -n "user.test" /mnt/foo
Reported-by: syzbot+98b881fdd8ebf45ab4ae@syzkaller.appspotmail.com
Fixes: 9c6e7853c5 ("ext4: reserve space for xattr entries/names")
Cc: stable@kernel.org
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20210305120508.298465-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In ext4_rename(), when RENAME_WHITEOUT failed to add new entry into
directory, it ends up dropping new created whiteout inode under the
running transaction. After commit <9b88f9fb0d2> ("ext4: Do not iput inode
under running transaction"), we follow the assumptions that evict() does
not get called from a transaction context but in ext4_rename() it breaks
this suggestion. Although it's not a real problem, better to obey it, so
this patch add inode to orphan list and stop transaction before final
iput().
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20210303131703.330415-2-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If we failed to add new entry on rename whiteout, we cannot reset the
old->de entry directly, because the old->de could have moved from under
us during make indexed dir. So find the old entry again before reset is
needed, otherwise it may corrupt the filesystem as below.
/dev/sda: Entry '00000001' in ??? (12) has deleted/unused inode 15. CLEARED.
/dev/sda: Unattached inode 75
/dev/sda: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
Fixes: 6b4b8e6b4a ("ext4: fix bug for rename with RENAME_WHITEOUT")
Cc: stable@vger.kernel.org
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20210303131703.330415-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4 didn't properly clean up if verity failed to be enabled on a file:
- It left verity metadata (pages past EOF) in the page cache, which
would be exposed to userspace if the file was later extended.
- It didn't truncate the verity metadata at all (either from cache or
from disk) if an error occurred while setting the verity bit.
Fix these bugs by adding a call to truncate_inode_pages() and ensuring
that we truncate the verity metadata (both from cache and from disk) in
all error paths. Also rework the code to cleanly separate the success
path from the error paths, which makes it much easier to understand.
Reported-by: Yunlei He <heyunlei@hihonor.com>
Fixes: c93d8f8858 ("ext4: add basic fs-verity support")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20210302200420.137977-2-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been
horribly confusingly misnamed. Rename it to BIO_MAX_VECS to stop
confusing users of the bio API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>