Currently btrfs uses end_io functions to jump between different stages
of recovery.
For example, we go the following different functions:
- raid56_bio_end_io()
This handles the read for all the sectors (except the missing device).
- __raid_recover_end_io()
This does the real work, it's called inside the delayed work function
raid_recover_end_io_work().
This one recovery path involves at least 3 different functions, which is
a big burden for readers.
This patch will change the behavior by:
- Introduce a unified recovery entrance, recover_rbio()
- Use submit-and-wait method
So the workflow is not interrupted by the endio function jump.
This doesn't bring performance change, but reduce the burden for
reviewers.
- Run the main function in the rmw_workers workqueue
Now raid56_parity_recover() only needs to setup the work, and
queue the work using start_async_work().
Now readers only need to do one function jump (start_async_work()) to
find out the main entrance of recovery path.
Furthermore, recover_rbio() function can easily be reused by other paths.
The old recovery path is still utilized by degraded write path.
It will be cleaned up when we have migrated the write path.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This includes extra changes:
- The allocation for unmap_array[] and pointers[]
Now we allocate them in one go, and free them together.
- Remove @err
Use errno_to_blk_status(ret) instead.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This new helper will be also utilized in the incoming refactor of
recovery path.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently finish_rmw() will update the P/Q stripes before submitting
the writes.
It's done behind a for(;;) loop, it's a little congested indent-wise, so
extract the code into a helper called generate_pq_vertical().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This refactor includes the following behavior change first:
- Don't error out if only P/Q is corrupted
The old code will directly error out if only P/Q is corrupted.
Although it is an logical error if we go into rebuild path with
only P/Q corrupted, there is no need to error out.
Just skip the rebuild and return the already good data.
Then comes the following refactor which shouldn't cause behavior
changes:
- Introduce a helper to do vertical stripe recovery
This not only reduce one indent level, but also paves the road for
later data checksum verification in RMW cycles.
- Sort rbio->faila/b before recovery
So we don't need to do the same swap every vertical stripe
- Replace a BUG_ON() with ASSERT()
Or checkpatch won't let me pass.
- Mark recovered sectors uptodate after the recover loop
- Do the cleanup for pointers unconditionally
We only need to initialize @pointers and @unmap_array to NULL, so
we can safely free them unconditionally.
- Mark the repaired sector uptodate in recover_vertical()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The two structures appear on the same call paths, btrfs_bio_ctrl is
embedded in extent_page_data and we pass bio_ctrl to some functions.
After merging there are fewer indirections and we have only one control
structure. The packing remains same.
The btrfs_bio_ctrl was selected as the target structure as the operation
is closer to bio processing.
Structure layout:
struct btrfs_bio_ctrl {
struct bio * bio; /* 0 8 */
int mirror_num; /* 8 4 */
enum btrfs_compression_type compress_type; /* 12 4 */
u32 len_to_stripe_boundary; /* 16 4 */
u32 len_to_oe_boundary; /* 20 4 */
btrfs_bio_end_io_t end_io_func; /* 24 8 */
bool extent_locked; /* 32 1 */
bool sync_io; /* 33 1 */
/* size: 40, cachelines: 1, members: 8 */
/* padding: 6 */
/* last cacheline: 40 bytes */
};
Signed-off-by: David Sterba <dsterba@suse.com>
The semantics of the two members is a boolean, so change the type
accordingly. We have space in extent_page_data due to alignment there's
no change in size.
Signed-off-by: David Sterba <dsterba@suse.com>
The div_factor* helpers calculate fraction or percentage fraction. The
name is a bit confusing, we use it only for percentage calculations and
there are two helpers.
There's a helper mult_frac that's for general fractions, that tries to
be accurate but we multiply and divide by small numbers so we can use
the div_u64 helper.
Rename the div_factor* helpers and use 1..100 percentage range, also drop
the case checking for percentage == 100, it's never hit.
The conversions:
* div_factor calculates tenths and the numbers need to be adjusted
* div_factor_fine is direct replacement
Signed-off-by: David Sterba <dsterba@suse.com>
If when doing a direct IO write we need to fallback to buffered IO, we
this comment at btrfs_direct_write() that says we can't directly fallback
to buffered IO if we have a NOWAIT iocb, because we have no support for
NOWAIT buffered writes. That is not true anymore, as support for NOWAIT
buffered writes was added recently in commit 926078b21d ("btrfs: enable
nowait async buffered writes").
However we still can't fallback to a buffered write in case we have a
NOWAIT iocb, because we'll need to flush delalloc and wait for it to
complete after doing the buffered write, and that can block for several
reasons, the main reason being waiting for IO to complete.
So update the comment to mention all that.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The header files should use the /* */ comment style, introduced in
commit f3a84ccd28 ("btrfs: move the tree mod log code into its own
file").
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we have inline extent read code behind two levels of
indentation, factor them them out into a new function,
read_inline_extent(), to make it a little easier to read.
Since we're here, also remove @extent_offset and @pg_offset arguments
from uncompress_inline() function, as it's not possible to have inline
extents at non-inline file offset.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The argument @new_inline changes the following members of extent_map:
- em->compress_type
- EXTENT_FLAG_COMPRESSED of em->flags
However neither members makes a difference for inline extents:
- Inline extent read never use above em members
As inside btrfs_get_extent() we directly use the file extent item to
do the read.
- Inline extents are never to be split
Thus code really needs em->compress_type or that flag will never be
executed on inlined extents.
(btrfs_drop_extent_cache() would be one example)
- Fiemap no longer relies on extent maps
Recent fiemap optimization makes fiemap to search subvolume tree
directly, without using any extent map at all.
Thus those members make no difference for inline extents any more.
Furthermore such exception without much explanation is really a source
of confusion.
Thus this patch will completely remove the argument, and always set the
involved members, unifying the behavior.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently for inline extents read inside btrfs_get_extent(), we will
reset several extent map members:
- em->start
Reset to extent_start, which is completely unnecessary.
The extent_start and em->start should have already be zero, ensured by
tree-checker already.
- em->len
Reset the round_up(copy_size, fs_info->sectorsize), which is again
unnecessary.
- em->orig_block_len
Reset to em->len (sectorsize), while it is originally unset from
btrfs_extent_item_to_extent_map().
This makes no difference, as all extent map handling paths will
ignore the orig_block_len if they found it's an inlined extent.
Such inline extent orig_block_len ignoring examples can be found in
btrfs_drop_extent_cache().
- em->orig_start
Reset to em->start (0), while it is originally set to EXTENT_MAP_HOLE.
This makes no difference either, as all extent map handling paths will
ignore the em->orig_start if they found it's an inline extent.
Thus all these em members resetting are unnecessary.
Replace them with ASSERT()s checking the only two members (block_start
and length) that make sense.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we calculate inline extent read in a way that inline extent
can start at non-zero offset.
This is consistent with the inode selftests, which puts an inline extent
at file offset 5.
Meanwhile the inline extent creation code will only create inline extent
at file offset 0.
Furthermore with the introduction of tree-checker on file extents, we are
actively rejecting inline extent which starts at non-zero file offset.
And so far we haven't yet seen any report of rejected inline extents at
non-zero file offset.
This all means, the extra calculation to support inline extents at
non-zero file offset is mostly paper weight, and damaging the
readability of the code.
Thus this patch will:
- Add extra ASSERT()s to make sure involved file offset are all 0
- Remove @extent_offset calculation
- Simplify the involved code
As several variables are now single-use, no need to declare them as
a variable anymore.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In our inode-tests.c, we create an inline offset at file offset 5, which
is no longer possible since the introduction of tree-checker.
Thus I don't think we should spend time maintaining some corner cases
which are already ruled out by tree-checker.
So this patch will:
- Change the inline extent to start at file offset 0
Also change its length to 6 to cover the original length
- Add an extra ASSERT() for btrfs_add_extent_mapping()
This is to make sure tree-checker is working correctly.
- Update the inode selftest
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these out of ctree.h into orphan.h to cut down on code in ctree.h.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This will make syncing fs.h to user space a little easier if we can pull
the super block specific helpers out of fs.h and put them in super.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these out of ctree.h into super.h to cut down on code in ctree.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already have a few of these in fs.h, move the remaining checks out of
ctree.h into fs.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these out of ctree.h into verity.h to cut down on code in ctree.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already have a dev-replace.h, simply move these prototypes and
helpers into dev-replace.h where they belong.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these out of ctree.h into scrub.h to cut down on code in ctree.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these out of ctree.h into relocation.h to cut down on code in
ctree.h
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these out of ctree.h into acl.h to cut down on code in ctree.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These belong in extent-tree.h, they were missed because they were not
grouped with the other extent-tree.c prototypes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The code for these functions are in messages.c, move the defines and
prototypes to messages.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these out of ctree.h into file.h to cut down on code in ctree.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these out of ctree.h into ioctl.h to cut down on code in ctree.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these out of ctree.h into uuid-tree.h to cut down on the code in
ctree.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these prototypes out of ctree.h and into file-item.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these prototypes out of ctree.h and into their own header file.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the defrag code is all in one file, create a defrag.h and move
all the defrag related prototypes and helper out of ctree.h and into
defrag.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the other big portion of defrag code that has existed in
ioctl.c. Move it to its new home in defrag.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This currently exists in file.c, move it to the more natural location in
defrag.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ reformat comments ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This currently has only one helper in it, and it's for tree based
defrag. We have the various defrag code in 3 different places, so
rename this to defrag.c. Followup patches will move the code into this
new file.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I initially wanted to make a new header file for this, but these
prototypes do naturally fit into btrfs_inode.h. If we want to extract
vfs from pure btrfs code in the future we may need to split this up, but
btrfs_inode embeds the vfs_inode, so it makes sense to put the
prototypes in this header for now.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These helpers are core to btrfs, and in order to more easily sync
various parts of the btrfs kernel code into btrfs-progs we need to be
able to carry these helpers with us. However we want to have our own
implementation for the helpers themselves, currently they're implemented
in different files that we want to sync inside of btrfs-progs itself.
Move these into their own C file, this will allow us to contain our
overrides in btrfs-progs in it's own file without messing with the rest
of the codebase.
In copying things over I fixed up a few whitespace errors that already
existed.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When moving the printk messages into their own file I got a compiler
error because the includes grabbed compression.h, but nothing pulled in
the blk_types.h dependency that compression.h has because it uses
blkstatus_t. Add blk_types.h to compression.h so that this sort of
thing doesn't happen in the future.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's several structures that are embedded inside of fs_info.h, so if
we don't have all the proper includes when we include fs.h we'll get a
variety of compile errors. I fixed this by adding a temporary c file
that just had #include "fs.h" and then added include files until the
compiler stopped complaining.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is used by the volumes code and the tree checker code. We want to
maintain inline however, so simply move it to volumes.h.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Do away with the defines and use an enum as it's cleaner.
Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update, reformat or reword function comments. This also removes the kdoc
marker so we don't get reports when the function name is missing.
Changes made:
- remove kdoc markers
- reformat the brief description to be a proper sentence
- reword to imperative voice
- align parameter list
- fix typos
Signed-off-by: David Sterba <dsterba@suse.com>
The last user of this was removed in 7f9fe61440 ("btrfs: improve
global reserve stealing logic"), drop this code as it's no longer called
by anybody.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I wrote the following coccinelle script to find function declarations
that didn't have the corresponding code for them
@funcproto@
identifier func;
type T;
position p0;
@@
T func@p0(...);
@funccode@
identifier funcproto.func;
position p1;
@@
func@p1(...) { ... }
@script:python depends on !funccode@
p0 << funcproto.p0;
@@
print("Proto with no function at %s:%s" % (p0[0].file, p0[0].line))
and ran it against btrfs, which identified the 4 function prototypes
I've removed in this patch.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move all the root-tree.c prototypes to root-tree.h, and then update all
the necessary files to include the new header.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This batch of prototypes no longer have code associated with them, so
remove them.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These exist in delalloc-space.c, move them from ctree.h into
delalloc-space.h.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move all the extent tree related prototypes to extent-tree.h out of
ctree.h, and then go include it everywhere needed so everything
compiles.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This was prototyped in ctree.h and the code existed in extent-tree.c,
but it's space-info related so move it into space-info.c.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These are defined already in space-info.h, remove them from ctree.h.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've accumulated some whitespace problems in ctree.h, clean these up.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These more naturally fit in with the locking related code, and they're
all defines so they can easily go anywhere, move them out of ctree.h
into locking.h
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have a lot of the fs_info related helpers and stuff
isolated, copy these over to fs.h out of ctree.h.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
For directories with encrypted files/filenames, we need to store a flag
indicating this fact. There's no room in other fields, so we'll need to
borrow a bit from dir_type. Since it's now a combination of type and
flags, we rename it to dir_flags to reflect its new usage.
The new flag, FT_ENCRYPTED, indicates a directory containing encrypted
data, which is orthogonal to file type; therefore, add the new
flag, and make conversion from directory type to file type strip the
flag.
As the file types almost never change we can afford to use the bits.
Actual usage will be guarded behind an incompat bit, this patch only
adds the support for later use by fscrypt.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While struct qstr is more natural without fscrypt, since it's provided
by dentries, struct fscrypt_str is provided by the fscrypt handlers
processing dentries, and is thus more natural in the fscrypt world.
Replace all of the struct qstr uses with struct fscrypt_str.
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Most places where we get a struct qstr, we are doing so from a dentry.
With fscrypt, the dentry's name may be encrypted on-disk, so fscrypt
provides a helper to convert a dentry name to the appropriate disk name
if necessary. Convert each of the dentry name accesses to use
fscrypt_setup_filename(), then convert the resulting fscrypt_name back
to an unencrypted qstr. This does not work for nokey names, but the
specific locations that could spawn nokey names are noted.
At present, since there are no encrypted directories, nothing goes down
the filename encryption paths.
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Many functions throughout btrfs take name buffer and name length
arguments. Most of these functions at the highest level are usually
called with these arguments extracted from a supplied dentry's name.
But the entire name can be passed instead, making each function a little
more elegant.
Each function whose arguments are currently the name and length
extracted from a dentry is herein converted to instead take a pointer to
the name in the dentry. The couple of calls to these calls without a
struct dentry are converted to create an appropriate qstr to pass in.
Additionally, every function which is only called with a name/len
extracted directly from a qstr is also converted.
This change has positive effect on stack consumption, frame of many
functions is reduced but this will be used in the future for fscrypt
related structures.
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The module exit function exit_btrfs_fs() is duplicating a section of code
in init_btrfs_fs(). Add a helper to remove the duplicated code. Due
to the init/exit section requirements the function must be inline and
not a plain static as it could cause section mismatch.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pas GFP_KERNEL as parameter so we can use it directly in
alloc_scrub_sector.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's only one caller that calls scrub_setup_recheck_block in the
memalloc_nofs_save/_restore protection so it's effectively already
GFP_NOFS and it's safe to use GFP_KERNEL.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass GFP_NOFS, we can drop the parameter and use it
directly.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's only one caller that passes GFP_NOFS, we can drop the parameter
an use the flags directly.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This was added while I was moving this code to its new home, it can be
removed now.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a large patch, but because they're all macros it's impossible to
split up. Simply copy all of the item accessors in ctree.h and paste
them in accessors.h, and then update any files to include the header so
everything compiles.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments, style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
This is specific to the item-accessor code, move it out of ctree.h into
accessor.h/.c and then update the users to include the new header file.
This un-inlines btrfs_init_map_token, however this is only called once
per function so it's not critical to be inlined. This also saves 904
bytes of code on a release build.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rename struct-funcs.c to accessors.c so we can move the item accessors
out of ctree.h. accessors.c is a better description of the code that is
contained in these files.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is fs wide information, move it out of ctree.h into fs.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we're not using this code anywhere we can remove it as well as
the member from fs_info.
We don't have any mount options or on/off features that would utilize
the pending infrastructure, the last one was inode_cache.
There was a patchset [1] to enable some features from sysfs that would
break things if it would be set immediately. In case we'll need that
kind of logic again the patch can be reverted, but for the current use
it can be replaced by the single state bit to do the commit.
[1] https://lore.kernel.org/linux-btrfs/1422609654-19519-1-git-send-email-quwenruo@cn.fujitsu.com/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we are only using fs_info->pending_changes to indicate that we
need a transaction commit. The original users for this were removed
years ago and we don't have more usage in sight, so this is the only
remaining reason to have this field. Add a flag so we can remove this
code.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These definitions are fs wide, take them out of ctree.h and put them in
fs.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These are fs wide definitions and helpers, move them out of ctree.h and
into fs.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These helpers use functions not defined in fs.h, they're simply
accessors of the super block in fs_info, convert them to macros so
that we don't have a weird dependency between fs.h and accessors.h.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to use fs.h to hold fs wide related helpers and definitions,
move the FS_STATE enum and related helpers to fs.h, and then update all
files that need these definitions to include fs.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The printk index work can be pushed into the printk helpers themselves,
this allows us to further sanitize messages.h, removing the last
include in the header itself.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a bunch of printk helpers that are in ctree.h. These have
nothing to do with ctree.c, so move them into their own header.
Subsequent patches will cleanup the printk helpers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These call functions that aren't defined in, or will be moved out of,
ctree.h Move them to super.c where the other assert/error message code
is defined. Drop the __noreturn attribute for btrfs_assertfail as
objtool does not like it and fails with warnings like
fs/btrfs/dir-item.o: warning: objtool: .text.unlikely: unexpected end of section
fs/btrfs/xattr.o: warning: objtool: btrfs_setxattr() falls through to next function btrfs_setxattr_trans.cold()
fs/btrfs/xattr.o: warning: objtool: .text.unlikely: unexpected end of section
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several fs wide related helpers in ctree.h. The bulk of these
are the incompat flag test helpers, but there are things such as
btrfs_fs_closing() and the read only helpers that also aren't directly
related to the ctree code. Move these into a fs.h header, which will
serve as the location for file system wide related helpers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a define for the data buffer size (though the maximum size is not
limited by it) BTRFS_SEND_BUF_SIZE_V2 so it's more visible.
Signed-off-by: Wang Yugui <wangyugui@e16-tech.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Callers that pass non-zero generation always want to perform the
generation check, we can simply encode that in one parameter and drop
check_generation. Add function documentation.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a request to automatically enable async discard for capable
devices. We can do that, the async mode is designed to wait for larger
freed extents and is not intrusive, with limits to iops, kbps or latency.
The status and tunables will be exported in /sys/fs/btrfs/FSID/discard .
The automatic selection is done if there's at least one discard capable
device in the filesystem (not capable devices are skipped). Mounting
with any other discard option will honor that option, notably mounting
with nodiscard will keep it disabled.
Link: https://lore.kernel.org/linux-btrfs/CAEg-Je_b1YtdsCR0zS5XZ_SbvJgN70ezwvRwLiCZgDGLbeMB=w@mail.gmail.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
The sysfs_emit is the safe API for writing to the sysfs files,
previously converted from scnprintf, there's one left to do in
btrfs_read_policy_show.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We sometimes have to allocate new extent states when clearing or setting
new bits in an extent io tree. Generally we preallocate this before
taking the tree spin lock, but we can use this preallocated extent state
sometimes and then need to try to do a GFP_ATOMIC allocation under the
lock.
Unfortunately sometimes this fails, and then we hit the BUG_ON() and
bring the box down. This happens roughly 20 times a week in our fleet.
However the vast majority of callers use GFP_NOFS, which means that if
this GFP_ATOMIC allocation fails, we could simply drop the spin lock, go
back and allocate a new extent state with our given gfp mask, and begin
again from where we left off.
For the remaining callers that do not use GFP_NOFS, they are generally
using GFP_NOWAIT, which still allows for some reclaim. So allow these
allocations to attempt to happen outside of the spin lock so we don't
need to rely on GFP_ATOMIC allocations.
This in essence creates an infinite loop for anything that isn't
GFP_NOFS. To address this we may want to migrate to using mempools for
extent states so that we will always have emergency reserves in order to
make our allocations.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As of "btrfs: do not use GFP_ATOMIC in the read endio" we no longer have
any users of unlock_extent_atomic, remove it.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have done read endio in an async thread for a very, very long time,
which makes the use of GFP_ATOMIC and unlock_extent_atomic() unneeded in
our read endio path. We've noticed under heavy memory pressure in our
fleet that we can fail these allocations, and then often trip a
BUG_ON(!allocation), which isn't an ideal outcome. Begin to address
this by simply not using GFP_ATOMIC, which will allow us to do things
like actually allocate a extent state when doing
set_extent_bits(UPTODATE) in the endio handler.
End io handlers are not called in atomic context, besides we have been
allocating failrec with GFP_NOFS so we'd notice there's a problem.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BACKGROUND]
When committing a transaction, we will update block group items for all
dirty block groups.
But in fact, dirty block groups don't always need to update their block
group items.
It's pretty common to have a metadata block group which experienced
several COW operations, but still have the same amount of used bytes.
In that case, we may unnecessarily COW a tree block doing nothing.
[ENHANCEMENT]
This patch will introduce btrfs_block_group::commit_used member to
remember the last used bytes, and use that new member to skip
unnecessary block group item update.
This would be more common for large filesystems, where metadata block
group can be as large as 1GiB, containing at most 64K metadata items.
In that case, if COW added and then deleted one metadata item near the
end of the block group, then it's completely possible we don't need to
touch the block group item at all.
[BENCHMARK]
The change itself can have quite a high chance (20~80%) to skip block
group item updates in lot of workloads.
As a result, it would result shorter time spent on
btrfs_write_dirty_block_groups(), and overall reduce the execution time
of the critical section of btrfs_commit_transaction().
Here comes a fio command, which will do random writes in 4K block size,
causing a very heavy metadata updates.
fio --filename=$mnt/file --size=512M --rw=randwrite --direct=1 --bs=4k \
--ioengine=libaio --iodepth=64 --runtime=300 --numjobs=4 \
--name=random_write --fallocate=none --time_based --fsync_on_close=1
The file size (512M) and number of threads (4) means 2GiB file size in
total, but during the full 300s run time, my dedicated SATA SSD is able
to write around 20~25GiB, which is over 10 times the file size.
Thus after we fill the initial 2G, we should not cause much block group
item updates.
Please note, the fio numbers by themselves don't have much change, but
if we look deeper, there is some reduced execution time, especially for
the critical section of btrfs_commit_transaction().
I added extra trace_printk() to measure the following per-transaction
execution time:
- Critical section of btrfs_commit_transaction()
By re-using the existing update_commit_stats() function, which
has already calculated the interval correctly.
- The while() loop for btrfs_write_dirty_block_groups()
Although this includes the execution time of btrfs_run_delayed_refs(),
it should still be representative overall.
Both result involves transid 7~30, the same amount of transaction
committed.
The result looks like this:
| Before | After | Diff
----------------------+-------------------+----------------+--------
Transaction interval | 229247198.5 | 215016933.6 | -6.2%
Block group interval | 23133.33333 | 18970.83333 | -18.0%
The change in block group item updates is more obvious, as skipped block
group item updates also mean less delayed refs.
And the overall execution time for that block group update loop is
pretty small, thus we can assume the extent tree is already mostly
cached. If we can skip an uncached tree block, it would cause more
obvious change.
Unfortunately the overall reduction in commit transaction critical
section is much smaller, as the block group item updates loop is not
really the major part, at least not for the above fio script.
But still we have a observable reduction in the critical section.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The base transaction bits can be defined as bits in a contiguous
sequence, although right now there's a hole from bit 1 to 8.
The bits are used for btrfs_trans_handle::type, and there's another set
of TRANS_STATE_* defines that are for btrfs_transaction::state. They are
mutually exclusive though the hole in the sequence looks like was made
for the states.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The defines/enums are used only for tracepoints and are not part of the
on-disk format.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Define helper macro that can be used in enum {} to utilize the automatic
increment to define all bits without directly defining the values or
using additional linear bits.
1. capture the sequence value, N
2. use the value to define the given enum with N-th bit set
3. reset the sequence back to N
Use for enums that do not require fixed values for symbolic names (like
for on-disk structures):
enum {
ENUM_BIT(FIRST),
ENUM_BIT(SECOND),
ENUM_BIT(THIRD)
};
Where the values would be 0x1, 0x2 and 0x4.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BACKGROUND]
In theory init_btrfs_fs() and exit_btrfs_fs() should match their
sequence, thus normally they should look like this:
init_btrfs_fs() | exit_btrfs_fs()
----------------------+------------------------
init_A(); |
init_B(); |
init_C(); |
| exit_C();
| exit_B();
| exit_A();
So is for the error path of init_btrfs_fs().
But it's not the case, some exit functions don't match their init
functions sequence in init_btrfs_fs().
Furthermore in init_btrfs_fs(), we need to have a new error label for
each new init function we added. This is not really expandable,
especially recently we may add several new functions to init_btrfs_fs().
[ENHANCEMENT]
The patch will introduce the following things to enhance the situation:
- struct init_sequence
Just a wrapper of init and exit function pointers.
The init function must use int type as return value, thus some init
functions need to be updated to return 0.
The exit function can be NULL, as there are some init sequence just
outputting a message.
- struct mod_init_seq[] array
This is a const array, recording all the initialization we need to do
in init_btrfs_fs(), and the order follows the old init_btrfs_fs().
- bool mod_init_result[] array
This is a bool array, recording if we have initialized one entry in
mod_init_seq[].
The reason to split mod_init_seq[] and mod_init_result[] is to avoid
section mismatch in reference.
All init function are in .init.text, but if mod_init_seq[] records
the @initialized member it can no longer be const, thus will be put
into .data section, and cause modpost warning.
For init_btrfs_fs() we just call all init functions in their order in
mod_init_seq[] array, and after each call, setting corresponding
mod_init_result[] to true.
For exit_btrfs_fs() and error handling path of init_btrfs_fs(), we just
iterate mod_init_seq[] in reverse order, and skip all uninitialized
entry.
With this patch, init_btrfs_fs()/exit_btrfs_fs() will be much easier to
expand and will always follow the strict order.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers of btrfs_tree_mod_log_insert_key() are now passing a GFP_NOFS
flag to it, so remove the flag from it and from alloc_tree_mod_elem() and
use it directly within alloc_tree_mod_elem().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When fixing up the first key of each node above the current level, at
fixup_low_keys(), we are doing a GFP_ATOMIC allocation for inserting an
operation record for the tree mod log. However we can do just fine with
GFP_NOFS nowadays. The need for GFP_ATOMIC was for the old days when we
had custom locks with spinning behaviour for extent buffers and we were
in spinning mode while at fixup_low_keys(). Now we use rw semaphores for
extent buffer locks, so we can safely use GFP_NOFS.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I have observed the following case play out and lead to unnecessary
relocations:
1. write a file across multiple block groups
2. delete the file
3. several block groups fall below the reclaim threshold
4. reclaim the first, moving extents into the others
5. reclaim the others which are now actually very full, leading to poor
reclaim behavior with lots of writing, allocating new block groups,
etc.
I believe the risk of missing some reasonable reclaims is worth it
when traded off against the savings of avoiding overfull reclaims.
Going forward, it could be interesting to make the check more advanced
(zoned aware, fragmentation aware, etc...) so that it can be a really
strong signal both at extent delete and reclaim time.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
As we delete extents from a block group, at some deletion we cross below
the reclaim threshold. It is possible we are still in the middle of
deleting more extents and might soon hit 0. If the block group is empty
by the time the reclaim worker runs, we will still relocate it.
This works just fine, as relocating an empty block group ultimately
results in properly deleting it. However, we have more direct ways of
removing empty block groups in the cleaner thread. Those are either
async discard or the unused_bgs list. In fact, when we decide whether to
relocate a block group during extent deletion, we do check for emptiness
and prefer the discard/unused_bgs mechanisms when possible.
Not using relocation for this case reduces some modest overhead from
empty bg relocation:
- extra transactions
- extra metadata use/churn for creating relocation metadata
- trying to read the extent tree to look for extents (and in this case
finding none)
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During fiemap, when determining if a data extent is shared or not, if we
don't find the extent is directly shared, then we need to determine if
it's shared through subtrees. For that we need to resolve the indirect
reference we found in order to figure out the path in the inode's fs tree,
which is a path starting at the fs tree's root node and going down to the
leaf that contains the file extent item that points to the data extent.
We then proceed to determine if any extent buffer in that path is shared
with other trees or not.
However when the generation of the data extent is more recent than the
last generation used to snapshot the root, we don't need to determine
the path, since the data extent can not be shared through snapshots.
For this case we currently still determine the leaf of that path (at
find_parent_nodes(), but then stop determining the other nodes in the
path (at btrfs_is_data_extent_shared()) as it's pointless.
So do the check of the data extent's generation earlier, at
find_parent_nodes(), before trying to resolve the indirect reference to
determine the leaf in the path. This saves us from doing one expensive
b+tree search in the fs tree of our target inode, as well as other minor
work.
The following test was run on a non-debug kernel (Debian's default kernel
config):
$ cat test-fiemap.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
# Use compression to quickly create files with a lot of extents
# (each with a size of 128K).
mount -o compress=lzo $DEV $MNT
# 40G gives 327680 extents, each with a size of 128K.
xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
# Add some more files to increase the size of the fs and extent
# trees (in the real world there's a lot of files and extents
# from other files).
xfs_io -f -c "pwrite -S 0xcd -b 1M 0 20G" $MNT/file1
xfs_io -f -c "pwrite -S 0xef -b 1M 0 20G" $MNT/file2
xfs_io -f -c "pwrite -S 0x73 -b 1M 0 20G" $MNT/file3
umount $MNT
mount -o compress=lzo $DEV $MNT
start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata not cached)"
echo
start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata cached)"
umount $MNT
Before applying this patch:
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 1285 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 742 milliseconds (metadata cached)
After applying this patch:
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 689 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 393 milliseconds (metadata cached)
That's a -46.4% total reduction for the metadata not cached case, and
a -47.0% reduction for the cached metadata case.
The test is somewhat limited in the sense the gains may be higher in
practice, because in the test the filesystem is small, so we have small
fs and extent trees, plus there's no concurrent access to the trees as
well, therefore no lock contention there.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During fiemap, when determining if a data extent is shared or not, if we
don't find the extent is directly shared, then we need to determine if
it's shared through subtrees. For that we need to resolve the indirect
reference we found in order to figure out the path in the inode's fs tree,
which is a path starting at the fs tree's root node and going down to the
leaf that contains the file extent item that points to the data extent.
We then proceed to determine if any extent buffer in that path is shared
with other trees or not.
Currently whenever we find the data extent that a file extent item points
to is not directly shared, we always resolve the path in the fs tree, and
then check if any extent buffer in the path is shared. This is a lot of
work and when we have file extent items that belong to the same leaf, we
have the same path, so we only need to calculate it once.
This change does that, it keeps track of the current and previous leaf,
and when we find that a data extent is not directly shared, we try to
compute the fs tree path only once and then use it for every other file
extent item in the same leaf, using the existing cached path result for
the leaf as long as the cache results are valid.
This saves us from doing expensive b+tree searches in the fs tree of our
target inode, as well as other minor work.
The following test was run on a non-debug kernel (Debian's default kernel
config):
$ cat test-with-snapshots.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
# Use compression to quickly create files with a lot of extents
# (each with a size of 128K).
mount -o compress=lzo $DEV $MNT
# 40G gives 327680 extents, each with a size of 128K.
xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
# Add some more files to increase the size of the fs and extent
# trees (in the real world there's a lot of files and extents
# from other files).
xfs_io -f -c "pwrite -S 0xcd -b 1M 0 20G" $MNT/file1
xfs_io -f -c "pwrite -S 0xef -b 1M 0 20G" $MNT/file2
xfs_io -f -c "pwrite -S 0x73 -b 1M 0 20G" $MNT/file3
# Create a snapshot so all the extents become indirectly shared
# through subtrees, with a generation less than or equals to the
# generation used to create the snapshot.
btrfs subvolume snapshot -r $MNT $MNT/snap1
umount $MNT
mount -o compress=lzo $DEV $MNT
start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata not cached)"
echo
start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata cached)"
umount $MNT
Result before applying this patch:
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 1204 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 729 milliseconds (metadata cached)
Result after applying this patch:
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 732 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 421 milliseconds (metadata cached)
That's a -46.1% total reduction for the metadata not cached case, and
a -42.2% reduction for the cached metadata case.
The test is somewhat limited in the sense the gains may be higher in
practice, because in the test the filesystem is small, so we have small
fs and extent trees, plus there's no concurrent access to the trees as
well, therefore no lock contention there.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the static functions to lookup and store sharedness check of an
extent buffer to a location above find_all_parents(), because in the
next patch the lookup function will be used by find_all_parents().
The store function is also moved just because it's the counter part
to the lookup function and it's best to have their definitions close
together.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During fiemap we process all the file extent items of an inode, by their
file offset order (left to right b+tree order), and then check if the data
extent they point at is shared or not. Until now we didn't cache those
results, we only did it for b+tree nodes/leaves since for each unique
b+tree path we have access to hundreds of file extent items. However, it
is also common to repeat checking the sharedness of a particular data
extent in a very short time window, and the cases that lead to that are
the following:
1) COW writes.
If have a file extent item like this:
[ bytenr X, offset = 0, num_bytes = 512K ]
file offset 0 512K
Then a 4K write into file offset 64K happens, we end up with the
following file extent item layout:
[ bytenr X, offset = 0, num_bytes = 64K ]
file offset 0 64K
[ bytenr Y, offset = 0, num_bytes = 4K ]
file offset 64K 68K
[ bytenr X, offset = 68K, num_bytes = 444K ]
file offset 68K 512K
So during fiemap we well check for the sharedness of the data extent
with bytenr X twice. Typically for COW writes and for at least
moderately updated files, we end up with many file extent items that
point to different sections of the same data extent.
2) Writing into a NOCOW file after a snapshot is taken.
This happens if the target extent was created in a generation older
than the generation where the last snapshot for the root (the tree the
inode belongs to) was made.
This leads to a scenario like the previous one.
3) Writing into sections of a preallocated extent.
For example if a file has the following layout:
[ bytenr X, offset = 0, num_bytes = 1M, type = prealloc ]
0 1M
After doing a 4K write into file offset 0 and another 4K write into
offset 512K, we get the following layout:
[ bytenr X, offset = 0, num_bytes = 4K, type = regular ]
0 4K
[ bytenr X, offset = 4K, num_bytes = 508K, type = prealloc ]
4K 512K
[ bytenr X, offset = 512K, num_bytes = 4K, type = regular ]
512K 516K
[ bytenr X, offset = 516K, num_bytes = 508K, type = prealloc ]
516K 1M
So we end up with 4 consecutive file extent items pointing to the data
extent at bytenr X.
4) Hole punching in the middle of an extent.
For example if a file has the following file extent item:
[ bytenr X, offset = 0, num_bytes = 8M ]
0 8M
And then hole is punched for the file range [4M, 6M[, we our file
extent item split into two:
[ bytenr X, offset = 0, num_bytes = 4M ]
0 4M
[ 2M hole, implicit or explicit depending on NO_HOLES feature ]
4M 6M
[ bytenr X, offset = 6M, num_bytes = 2M ]
6M 8M
Again, we end up with two file extent items pointing to the same
data extent.
5) When reflinking (clone and deduplication) within the same file.
This is probably the least common case of all.
In cases 1, 2, 4 and 4, when we have multiple file extent items that point
to the same data extent, their distance is usually short, typically
separated by a few slots in a b+tree leaf (or across sibling leaves). For
case 5, the distance can vary a lot, but it's typically the less common
case.
This change caches the result of the sharedness checks for data extents,
but only for the last 8 extents that we notice that our inode refers to
with multiple file extent items. Whenever we want to check if a data
extent is shared, we lookup the cache which consists of doing a linear
scan of an 8 elements array, and if we find the data extent there, we
return the result and don't check the extent tree and delayed refs.
The array/cache is small so that doing the search has no noticeable
negative impact on the performance in case we don't have file extent items
within a distance of 8 slots that point to the same data extent.
Slots in the cache/array are overwritten in a simple round robin fashion,
as that approach fits very well.
Using this simple approach with only the last 8 data extents seen is
effective as usually when multiple file extents items point to the same
data extent, their distance is within 8 slots. It also uses very little
memory and the time to cache a result or lookup the cache is negligible.
The following test was run on non-debug kernel (Debian's default kernel
config) to measure the impact in the case of COW writes (first example
given above), where we run fiemap after overwriting 33% of the blocks of
a file:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
mount $DEV $MNT
FILE_SIZE=$((1 * 1024 * 1024 * 1024))
# Create the file full of 1M extents.
xfs_io -f -s -c "pwrite -b 1M -S 0xab 0 $FILE_SIZE" $MNT/foobar
block_count=$((FILE_SIZE / 4096))
# Overwrite about 33% of the file blocks.
overwrite_count=$((block_count / 3))
echo -e "\nOverwriting $overwrite_count 4K blocks (out of $block_count)..."
RANDOM=123
for ((i = 1; i <= $overwrite_count; i++)); do
off=$(((RANDOM % block_count) * 4096))
xfs_io -c "pwrite -S 0xcd $off 4K" $MNT/foobar > /dev/null
echo -ne "\r$i blocks overwritten..."
done
echo -e "\n"
# Unmount and mount to clear all cached metadata.
umount $MNT
mount $DEV $MNT
start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds"
umount $MNT
Result before applying this patch:
fiemap took 128 milliseconds
Result after applying this patch:
fiemap took 92 milliseconds (-28.1%)
The test is somewhat limited in the sense the gains may be higher in
practice, because in the test the filesystem is small, so we have small
fs and extent trees, plus there's no concurrent access to the trees as
well, therefore no lock contention there.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At find_parent_nodes(), at its last step, when iterating over all direct
references, we are checking if we have a share context and if we have
a reference with a different root from the one in the share context.
However that logic is pointless because of two reasons:
1) After the previous patch in the series (subject "btrfs: remove roots
ulist when checking data extent sharedness"), the roots argument is
always NULL when using a share check context (struct share_check), so
this code is never triggered;
2) Even before that previous patch, we could not hit this code because
if we had a reference with a root different from the one in our share
context, then we would have exited earlier when doing either of the
following:
- Adding a second direct ref to the direct refs red black tree
resulted in extent_is_shared() returning true when called from
add_direct_ref() -> add_prelim_ref(), after processing delayed
references or while processing references in the extent tree;
- When adding a second reference to the indirect refs red black
tree (same as above, extent_is_shared() returns true);
- If we only have one indirect reference and no direct references,
then when resolving it at resolve_indirect_refs() we immediately
return that the target extent is shared, therefore never reaching
that loop that iterates over all direct references at
find_parent_nodes();
- If we have 1 indirect reference and 1 direct reference, then we
also exit early because extent_is_shared() ends up returning true
when called through add_prelim_ref() (by add_direct_ref() or
add_indirect_ref()) or add_delayed_refs(). Same applies as when
having a combination of direct, indirect and indirect with missing
key references.
This logic had been obsoleted since commit 3ec4d3238a ("btrfs:
allow backref search checks for shared extents"), which introduced the
early exits in case an extent is shared.
So just remove that logic, and assert at find_parent_nodes() that when we
have a share context we don't have a roots ulist and that we haven't found
the extent to be directly shared after processing delayed references and
all references from the extent tree.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_is_data_extent_shared() is passing a ulist for the roots
argument of find_parent_nodes(), however it does not use that ulist for
anything and for this context that list always ends up with at most one
element.
Since find_parent_nodes() is able to deal with a NULL ulist for its roots
argument, make btrfs_is_data_extent_shared() pass it NULL and avoid the
burden of allocating memory for the unnused roots ulist, initializing it,
releasing it and allocating one struct ulist_node for it during the call
to find_parent_nodes().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When calling btrfs_is_data_extent_shared() we pass two ulists that were
allocated by the caller. This is because the single caller, fiemap, calls
btrfs_is_data_extent_shared() multiple times and the ulists can be reused,
instead of allocating new ones before each call and freeing them after
each call.
Now that we have a context structure/object that we pass to
btrfs_is_data_extent_shared(), we can move those ulists to it, and hide
their allocation and the context's allocation in a helper function, as
well as the freeing of the ulists and the context object. This allows to
reduce the number of parameters passed to btrfs_is_data_extent_shared(),
the need to pass the ulists from extent_fiemap() to fiemap_process_hole()
and having the caller deal with allocating and releasing the ulists.
Also rename one of the ulists from 'tmp' / 'tmp_ulist' to 'refs', since
that's a much better name as it reflects what the list is used for (and
matching the argument name for find_parent_nodes()).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Right now we are using a struct btrfs_backref_shared_cache to pass state
across multiple btrfs_is_data_extent_shared() calls. The structure's name
closely follows its current purpose, which is to cache previous checks
for the sharedness of metadata extents. However we will start using the
structure for more things other than caching sharedness checks, so rename
it to struct btrfs_backref_share_check_ctx.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we pass a root and an inode number as arguments for
btrfs_is_data_extent_shared() and the inode number is always from an
inode that belongs to that root (it wouldn't make sense otherwise).
In every context that we call btrfs_is_data_extent_shared() (fiemap only),
we have an inode available, so directly pass the inode to the function
instead of a root and inode number. This reduces the number of parameters
and it makes the function's signature conform to most other functions we
have.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing backref walking to determine if an extent is shared, we are
testing if the inode number, stored in the 'inum' field of struct
share_check, is 0. However that can never be case, since the all instances
of the structure are created at btrfs_is_data_extent_shared(), which
always initializes it with the inode number from a fs tree (and the number
for any inode from any tree can never be 0). So remove the checks.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing backref walking to determine if an extent is shared, we are
testing the root_objectid of the given share_check struct is 0, but that
is an impossible case, since btrfs_is_data_extent_shared() always
initializes the root_objectid field with the id of the given root, and
no root can have an objectid of 0. So remove those checks.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When allocating an extent buffer, at __alloc_extent_buffer(), there's no
point in explicitly assigning zero to the bflags field of the new extent
buffer because we allocated it with kmem_cache_zalloc().
So just remove the redundant initialization, it saves one mov instruction
in the generated assembly code for x86_64 ("movq $0x0,0x10(%rax)").
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_clone_extent_buffer(), before allocating the pages array for the
new extent buffer we are calling memset() to zero out the pages array of
the extent buffer. This is pointless however, because the extent buffer
already has every element in its pages array pointing to NULL, as it was
allocated with kmem_cache_zalloc(). The memset() was introduced with
commit dd137dd1f2 ("btrfs: factor out allocating an array of pages"),
but even before that commit we already depended on the pages array being
initialized to NULL for the error paths that need to call
btrfs_release_extent_buffer().
So remove the memset(), it's useless and slightly increases the object
text size.
Before this change:
$ size fs/btrfs/extent_io.o
text data bss dec hex filename
70580 5469 40 76089 12939 fs/btrfs/extent_io.o
After this change:
$ size fs/btrfs/extent_io.o
text data bss dec hex filename
70564 5469 40 76073 12929 fs/btrfs/extent_io.o
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During fiemap and lseek (hole and data seeking), there's no point in
iterating the inode's io tree to count delalloc bits if the inode's
delalloc bytes counter has a value of zero, as that counter is updated
whenever we set a range for delalloc or clear a range from delalloc.
So skip the counting and io tree iteration if the inode's delalloc bytes
counter has a value of zero. This helps save time when processing a file
range corresponding to a hole or prealloc (unwritten) extent.
This patch is part of a series comprised of the following patches:
btrfs: get the next extent map during fiemap/lseek more efficiently
btrfs: skip unnecessary extent map searches during fiemap and lseek
btrfs: skip unnecessary delalloc search during fiemap and lseek
The following test was performed on a release kernel (Debian's default
kernel config) before and after applying those 3 patches.
# Wrapper to call fiemap in extent count only mode.
# (struct fiemap::fm_extent_count set to 0)
$ cat fiemap.c
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <sys/ioctl.h>
#include <linux/fs.h>
#include <linux/fiemap.h>
int main(int argc, char **argv)
{
struct fiemap fiemap = { 0 };
int fd;
if (argc != 2) {
printf("usage: %s <path>\n", argv[0]);
return 1;
}
fd = open(argv[1], O_RDONLY);
if (fd < 0) {
fprintf(stderr, "error opening file: %s\n",
strerror(errno));
return 1;
}
/* fiemap.fm_extent_count set to 0, to count extents only. */
fiemap.fm_length = FIEMAP_MAX_OFFSET;
if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
fprintf(stderr, "fiemap error: %s\n",
strerror(errno));
return 1;
}
close(fd);
printf("fm_mapped_extents = %d\n", fiemap.fm_mapped_extents);
return 0;
}
$ gcc -o fiemap fiemap.c
And the wrapper shell script that creates a file with many holes and runs
fiemap against it:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV
mount $DEV $MNT
FILE_SIZE=$((1 * 1024 * 1024 * 1024))
echo -n > $MNT/foobar
for ((off = 0; off < $FILE_SIZE; off += 8192)); do
xfs_io -c "pwrite -S 0xab $off 4K" $MNT/foobar > /dev/null
done
# flush all delalloc
sync
start=$(date +%s%N)
./fiemap $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds"
umount $MNT
Result before applying patchset:
fm_mapped_extents = 131072
fiemap took 63 milliseconds
Result after applying patchset:
fm_mapped_extents = 131072
fiemap took 39 milliseconds (-38.1%)
Running the same test for a 512M file instead of a 1G file, gave the
following results.
Result before applying patchset:
fm_mapped_extents = 65536
fiemap took 29 milliseconds
Result after applying patchset:
fm_mapped_extents = 65536
fiemap took 20 milliseconds (-31.0%)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we have no outstanding extents it means we don't have any extent maps
corresponding to delalloc that is flushing, as when an ordered extent is
created we increment the number of outstanding extents to 1 and when we
remove the ordered extent we decrement them by 1. So skip extent map tree
searches if the number of outstanding ordered extents is 0, saving time as
the tree is not empty if we have previously made some reads or flushed
delalloc, as in those cases it can have a very large number of extent maps
for files with many extents.
This helps save time when processing a file range corresponding to a hole
or prealloc (unwritten) extent.
The next patch in the series has a performance test in its changelog and
its subject is:
"btrfs: skip unnecessary delalloc search during fiemap and lseek"
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At find_delalloc_subrange(), when we need to get the next extent map, we
do a full search on the extent map tree (a red black tree). This is fine
but it's a lot more efficient to simply use rb_next(), which typically
requires iterating over less nodes of the tree and never needs to compare
the ranges of nodes with the one we are looking for.
So add a public helper to extent_map.{h,c} to get the extent map that
immediately follows another extent map, using rb_next(), and use that
helper at find_delalloc_subrange().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For Btrfs RAID56, we have a caching system for btrfs raid bios (rbio).
We call cache_rbio_pages() to mark a qualified rbio ready for cache.
The timing happens at:
- finish_rmw()
At this timing, we have already read all necessary sectors, along with
the rbio sectors, we have covered all data stripes.
- __raid_recover_end_io()
At this timing, we have rebuild the rbio, thus all data sectors
involved (either from stripe or bio list) are uptodate now.
Thus at the timing of cache_rbio_pages(), we should have all data
sectors uptodate.
This patch will make it explicit that all data sectors are uptodate at
cache_rbio_pages() timing, mostly to prepare for the incoming
verification at RMW time.
This patch will add:
- Extra ASSERT()s in cache_rbio_pages()
This is to make sure all data sectors, which are not covered by bio,
are already uptodate.
- Extra ASSERT()s in steal_rbio()
Since only cached rbio can be stolen, thus every data sector should
already be uptodate in the source rbio.
- Update __raid_recover_end_io() to update recovered sector->uptodate
Previously __raid_recover_end_io() will only mark failed sectors
uptodate if it's doing an RMW.
But this can trigger new ASSERT()s, as for recovery case, a recovered
failed sector will not be marked uptodate, and trigger ASSERT() in
later cache_rbio_pages() call.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently inside alloc_rbio(), we allocate a larger memory to contain
the following members:
- struct btrfs_raid_rbio itself
- stripe_pages array
- bio_sectors array
- stripe_sectors array
- finish_pointers array
Then update rbio pointers to point the extra space after the rbio
structure itself.
Thus it introduced a complex CONSUME_ALLOC() macro to help the thing.
This is too hacky, and is going to make later pointers expansion harder.
This patch will change it to use regular kcalloc() for each pointer
inside btrfs_raid_bio, making the later expansion much easier.
And introduce a helper free_raid_bio_pointers() to free up all the
pointer members in btrfs_raid_bio, which will be used in both
free_raid_bio() and error path of alloc_rbio().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The cleanup involves two things:
- Remove the "__" prefix
There is no naming confliction.
- Remove the forward declaration
There is no special function call involved.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside of FB, as well as some user reports, we've had a consistent
problem of occasional ENOSPC transaction aborts. Inside FB we were
seeing ~100-200 ENOSPC aborts per day in the fleet, which is a really
low occurrence rate given the size of our fleet, but it's not nothing.
There are two causes of this particular problem.
First is delayed allocation. The reservation system for delalloc
assumes that contiguous dirty ranges will result in 1 file extent item.
However if there is memory pressure that results in fragmented writeout,
or there is fragmentation in the block groups, this won't necessarily be
true. Consider the case where we do a single 256MiB write to a file and
then close it. We will have 1 reservation for the inode update, the
reservations for the checksum updates, and 1 reservation for the file
extent item. At some point later we decide to write this entire range
out, but we're so fragmented that we break this into 100 different file
extents. Since we've already closed the file and are no longer writing
to it there's nothing to trigger a refill of the delalloc block rsv to
satisfy the 99 new file extent reservations we need. At this point we
exhaust our delalloc reservation, and we begin to steal from the global
reserve. If you have enough of these cases going in parallel you can
easily exhaust the global reserve, get an ENOSPC at
btrfs_alloc_tree_block() time, and then abort the transaction.
The other case is the delayed refs reserve. The delayed refs reserve
updates its size based on outstanding delayed refs and dirty block
groups. However we only refill this block reserve when returning
excess reservations and when we call btrfs_start_transaction(root, X).
We will reserve 2*X credits at transaction start time, and fill in X
into the delayed refs reserve to make sure it stays topped off.
Generally this works well, but clearly has downsides. If we do a
particularly delayed ref heavy operation we may never catch up in our
reservations. Additionally running delayed refs generates more delayed
refs, and at that point we may be committing the transaction and have no
way to trigger a refill of our delayed refs rsv. Then a similar thing
occurs with the delalloc reserve.
Generally speaking we well over-reserve in all of our block rsvs. If we
reserve 1 credit we're usually reserving around 264k of space, but we'll
often not use any of that reservation, or use a few blocks of that
reservation. We can be reasonably sure that as long as you were able to
reserve space up front for your operation you'll be able to find space
on disk for that reservation.
So introduce a new flushing state, BTRFS_RESERVE_FLUSH_EMERGENCY. This
gets used in the case that we've exhausted our reserve and the global
reserve. It simply forces a reservation if we have enough actual space
on disk to make the reservation, which is almost always the case. This
keeps us from hitting ENOSPC aborts in these odd occurrences where we've
not kept up with the delayed work.
Fixing this in a complete way is going to be relatively complicated and
time consuming. This patch is what I discussed with Filipe earlier this
year, and what I put into our kernels inside FB. With this patch we're
down to 1-2 ENOSPC aborts per week, which is a significant reduction.
This is a decent stop gap until we can work out a more wholistic
solution to these two corner cases.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These are wrapped in CONFIG_FS_VERITY, but we can have the definitions
without verity enabled. Move these definitions up with the other
accessor helpers.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This uses btrfs_header_nritems, which I will be moving out of ctree.h.
In order to avoid needing to include the relevant header in ctree.h,
simply move this helper function into ctree.c.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename parameters ]
Signed-off-by: David Sterba <dsterba@suse.com>
This is local to the free-space-cache.c code, remove it from ctree.h and
inode.c, create new init/exit functions for the cachep, and move it
locally to free-space-cache.c.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is local to the ctree code, remove it from ctree.h and inode.c,
create new init/exit functions for the cachep, and move it locally to
ctree.c.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is local to the transaction code, remove it from ctree.h and
inode.c, create new helpers in the transaction to handle the init work
and move the cachep locally to transaction.c.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This isn't used outside of inode.c, there's no reason to define it in
btrfs_inode.h. Drop the inline and add __cold as it's for errors that
are not in any hot path.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This code is used in space-info.c, move the definitions to space-info.h.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function uses functions that are not defined in block-group.h, move
it into block-group.c in order to keep the header clean.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These definitions are used for discard statistics, move them out of
ctree.h and put them in free-space-cache.h.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is only used locally in scrub.c, move it out of ctree.h into
scrub.c.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have maximum link and name length limits, move these to btrfs_tree.h
as they're on disk limitations.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
This inline helper calls btrfs_fs_compat_ro(), which is defined in
another header. To avoid weird header dependency problems move this
helper into disk-io.c with the rest of the global root helpers.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The bulk of our on-disk definitions exist in btrfs_tree.h, which user
space can use. Keep things consistent and move the rest of the on disk
definitions out of ctree.h into btrfs_tree.h. Note I did have to update
all u8's to __u8, but otherwise this is a strict copy and paste.
Most of the definitions are mainly for internal use and are not
guaranteed stable public API and may change as we need. Compilation
failures by user applications can happen.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments, style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
The last user of this definition was removed in patch f26c923860
("btrfs: remove reada infrastructure") so we can remove this definition.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This hasn't been used since 138a12d865 ("btrfs: rip out
btrfs_space_info::total_bytes_pinned") so it is safe to remove.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The last users of these helpers were removed in 5297199a8b ("btrfs:
remove inode number cache feature") so delete these helpers.
The point was for mount options that were applicable after transaction
commit so they could not be applied immediately. We don't have such
options anymore and if we do the patch can be reverted.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since leaf is already NULL, and no other branch will go to fail_unlock,
the fail_unlock label is useless and can be removed
Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't use a cached state here at all, which generally makes sense as
async reads are going to unlock at endio time. However for blocking
reads we will call wait_extent_bit() for our range. Since the
lock_extent() stuff will return the cached_state for the start of the
range this is a helpful optimization to have for this case, we'll have
the exact state we want to wait on. Add a cached state here and simply
throw it away if we're a non-blocking read, otherwise we'll get a small
improvement by eliminating some tree searches.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently if we fail to lock a range we'll return the start of the range
that we failed to lock. We'll then search down to this range and wait
on any extent states in this range.
However we can avoid this search altogether if we simply cache the
extent_state that had the contention. We can pass this into
wait_extent_bit() and start from that extent_state without doing the
search. In the most optimistic case we can avoid all searches, more
likely we'll avoid the initial search and have to perform the search
after we wait on the failed state, or worst case we must search both
times which is what currently happens.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All of the relocation code avoids using the cached state, despite
everywhere using the normal
lock_extent()
// do something
unlock_extent()
pattern. Fix this by plumbing a cached state throughout all of these
functions in order to allow for less tree searches.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that try_lock_extent() takes a cached_state, plumb the cached_state
through btrfs_try_lock_ordered_range() and then use a cached_state in
btrfs_check_nocow_lock everywhere to avoid extra tree searches on the
extent_io_tree.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With nowait becoming more pervasive throughout our codebase go ahead and
add a cached_state to try_lock_extent(). This allows us to be faster
about clearing the locked area if we have contention, and then gives us
the same optimization for unlock if we are able to lock the range.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
* kvm-arm64/mte-map-shared:
: .
: Update the MTE support to allow the VMM to use shared mappings
: to back the memslots exposed to MTE-enabled guests.
:
: Patches courtesy of Catalin Marinas and Peter Collingbourne.
: .
: Fix a number of issues with MTE, such as races on the tags
: being initialised vs the PG_mte_tagged flag as well as the
: lack of support for VM_SHARED when KVM is involved.
:
: Patches from Catalin Marinas and Peter Collingbourne.
: .
Documentation: document the ABI changes for KVM_CAP_ARM_MTE
KVM: arm64: permit all VM_MTE_ALLOWED mappings with MTE enabled
KVM: arm64: unify the tests for VMAs in memslots when MTE is enabled
arm64: mte: Lock a page for MTE tag initialisation
mm: Add PG_arch_3 page flag
KVM: arm64: Simplify the sanitise_mte_tags() logic
arm64: mte: Fix/clarify the PG_mte_tagged semantics
mm: Do not enable PG_arch_2 for all 64-bit architectures
Signed-off-by: Marc Zyngier <maz@kernel.org>
Some discs containing the UDF file system are unable to be mounted,
failing with the following message:
UDF-fs: error (device sr0): udf_fill_super: minUDFReadRev=260
(max is 250)
The UDF 2.60 specification [0] states in the section Basic Restrictions
& Requirements (page 10):
The Minimum UDF Read Revision value shall be at most #0250 for all
media with a UDF 2.60 file system. This indicates that a UDF 2.50
implementation can read all UDF 2.60 media. Media that do not have a
Metadata Partition may use a value lower than #250.
The conclusion is that the discs failing to mount were burned with a
faulty software, which didn't follow the specification. This can be
worked around by increasing UDF_MAX_READ_VERSION to 0x260, to match the
Minimum Read Revision. No other changes are required, as reading UDF
2.60 is backward compatible with UDF 2.50.
[0] http://www.osta.org/specs/pdf/udf260.pdf
Signed-off-by: Bartosz Taudul <wolf@nereid.pl>
Signed-off-by: Jan Kara <jack@suse.cz>
./fs/xfs/xfs_iomap.c: xfs_error.h is included more than once.
./fs/xfs/xfs_iomap.c: xfs_errortag.h is included more than once.
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3337
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In the if (dev_of_node(dev) && !pdata) path, the "err" may be assigned a
value of 0, so the error return code -EINVAL may be incorrectly set
to 0. To fix set valid return code before calling to goto.
Fixes: 35da60941e ("pstore/ram: add Device Tree bindings")
Signed-off-by: Wang Yufen <wangyufen@huawei.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/1669969374-46582-1-git-send-email-wangyufen@huawei.com
UBSAN reported a shift-out-of-bounds warning:
left shift of 1 by 31 places cannot be represented in type 'int'
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x8d/0xcf lib/dump_stack.c:106
ubsan_epilogue+0xa/0x44 lib/ubsan.c:151
__ubsan_handle_shift_out_of_bounds+0x1e7/0x208 lib/ubsan.c:322
check_special_flags fs/binfmt_misc.c:241 [inline]
create_entry fs/binfmt_misc.c:456 [inline]
bm_register_write+0x9d3/0xa20 fs/binfmt_misc.c:654
vfs_write+0x11e/0x580 fs/read_write.c:582
ksys_write+0xcf/0x120 fs/read_write.c:637
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x34/0x80 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x4194e1
Since the type of Node's flags is unsigned long, we should define these
macros with same type too.
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221102025123.1117184-1-liushixin2@huawei.com
address post-6.0 issues, which is hopefully a sign that things are
converging.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY4pQpQAKCRDdBJ7gKXxA
jquxAP9Lqif7CGDgdq8uWY2hHS/Ujc3k7Ohgyzs37olnCuU8KwEA6/J7SpjsBgtY
OfzvnwxpCTh8Kfzu/oNckIHo/EEiIA8=
=o6qT
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2022-12-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
"15 hotfixes, 11 marked cc:stable.
Only three or four of the latter address post-6.0 issues, which is
hopefully a sign that things are converging"
* tag 'mm-hotfixes-stable-2022-12-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
revert "kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible"
Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled
drm/amdgpu: temporarily disable broken Clang builds due to blown stack-frame
mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths
mm/khugepaged: fix GUP-fast interaction by sending IPI
mm/khugepaged: take the right locks for page table retraction
mm: migrate: fix THP's mapcount on isolation
mm: introduce arch_has_hw_nonleaf_pmd_young()
mm: add dummy pmd_young() for architectures not having it
mm/damon/sysfs: fix wrong empty schemes assumption under online tuning in damon_sysfs_set_schemes()
tools/vm/slabinfo-gnuplot: use "grep -E" instead of "egrep"
nilfs2: fix NULL pointer dereference in nilfs_palloc_commit_free_entry()
hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
madvise: use zap_page_range_single for madvise dontneed
mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE
One-element arrays are deprecated, and we are replacing them with flexible
array members instead. So, replace one-element arrays with flexible-array
members in multiple structs in fs/ksmbd/smb_common.h and one in
fs/ksmbd/smb2pdu.h.
Important to mention is that doing a build before/after this patch results
in no binary output differences.
This helps with the ongoing efforts to tighten the FORTIFY_SOURCE routines
on memcpy() and help us make progress towards globally enabling
-fstrict-flex-arrays=3 [1].
Link: https://github.com/KSPP/linux/issues/242
Link: https://github.com/KSPP/linux/issues/79
Link: https://gcc.gnu.org/pipermail/gcc-patches/2022-October/602902.html [1]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/Y3OxronfaPYv9qGP@work
While doing fault injection test, I got the following report:
------------[ cut here ]------------
kobject: '(null)' (0000000039956980): is not initialized, yet kobject_put() is being called.
WARNING: CPU: 3 PID: 6306 at kobject_put+0x23d/0x4e0
CPU: 3 PID: 6306 Comm: 283 Tainted: G W 6.1.0-rc2-00005-g307c1086d7c9 #1253
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:kobject_put+0x23d/0x4e0
Call Trace:
<TASK>
cdev_device_add+0x15e/0x1b0
__iio_device_register+0x13b4/0x1af0 [industrialio]
__devm_iio_device_register+0x22/0x90 [industrialio]
max517_probe+0x3d8/0x6b4 [max517]
i2c_device_probe+0xa81/0xc00
When device_add() is injected fault and returns error, if dev->devt is not set,
cdev_add() is not called, cdev_del() is not needed. Fix this by checking dev->devt
in error path.
Fixes: 233ed09d7f ("chardev: add helper function to register char devs with a struct device")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Link: https://lore.kernel.org/r/20221202030237.520280-1-yangyingliang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When gfs2_create_inode() fails after creating a new inode, it uses the
GIF_FREE_VFS_INODE and GIF_ALLOC_FAILED inode flags to communicate to
gfs2_evict_inode() which parts of the inode need to be deallocated and
destroyed. In some error cases, the inode ends up being allocated on
disk and then accidentally left behind. In others, the inode is
partially constructed and then not properly destroyed. Clean this up by
completely handling the inode deallocation and destruction in
gfs2_evict_inode().
This means that gfs2_evict_inode() may now be faced with partially
constructed inodes, so add the necessary checks to cope with that. In
particular, make sure that for incompletely constructed inodes, we're
not accessing the buffers backing the on-disk blocks; the contents may
be undefined.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Initialize variable "ip" earlier so that it can be used interchangeably
with "inode" everywhere.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In gfs2_create_inode, get rid of the ghs array in favor of two separate
variables. This makes the code much less irritating.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
We have reserved the number of blocks we want to allocate, so the actual
allocation isn't expected to fail. Nevertheless, make the code behave
correctly even when things go wrong.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The console_lock is used in part to guarantee safe list iteration.
The console_list_lock should be used because list synchronization
responsibility will be removed from the console_lock in a later
change.
Note, the console_lock is still needed to serialize the device()
callback with other console operations.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-35-john.ogness@linutronix.de
The console_lock is held throughout the start/show/stop procedure
to print out device/driver information about all registered
consoles. Since the console_lock is being used for multiple reasons,
explicitly document these reasons. This will be useful when the
console_lock is split into fine-grained locking.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-11-john.ogness@linutronix.de
Replace the open coded single linked list with a hlist so a conversion
to SRCU protected list walks can reuse the existing primitives.
Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20221116162152.193147-3-john.ogness@linutronix.de
The type should be struct posix_acl * instead of void *.
Cc: Christian Brauner <brauner@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Add support for XTS and CTS mode variant of SM4 algorithm. The former is
used to encrypt file contents, while the latter (SM4-CTS-CBC) is used to
encrypt filenames.
SM4 is a symmetric algorithm widely used in China, and is even mandatory
algorithm in some special scenarios. We need to provide these users with
the ability to encrypt files or disks using SM4-XTS.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221201125819.36932-3-tianjia.zhang@linux.alibaba.com
While investigating test failures in xfs/17[1-3] in alwayscow mode, I
noticed through code inspection that xfs_bmap_alloc_userdata isn't
setting XFS_ALLOC_USERDATA when allocating extents for a file's CoW
fork. COW staging extents should be flagged as USERDATA, since user
data are persisted to these blocks before being remapped into a file.
This mis-classification has a few impacts on the behavior of the system.
First, the filestreams allocator is supposed to keep allocating from a
chosen AG until it runs out of space in that AG. However, it only does
that for USERDATA allocations, which means that COW allocations aren't
tied to the filestreams AG. Fortunately, few people use filestreams, so
nobody's noticed.
A more serious problem is that xfs_alloc_ag_vextent_small looks for a
buffer to invalidate *if* the USERDATA flag is set and the AG is so full
that the allocation had to come from the AGFL because the cntbt is
empty. The consequences of not invalidating the buffer are severe --
if the AIL incorrectly checkpoints a buffer that is now being used to
store user data, that action will clobber the user's written data.
Fix filestreams and yet another data corruption vector by flagging COW
allocations as USERDATA.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
xfs_btree_check_block contains debugging knobs. With XFS_DEBUG setting up,
turn on the debugging knob can trigger the assert of xfs_btree_islastblock,
test script as follows:
while true
do
mount $disk $mountpoint
fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null
echo 1 > /sys/fs/xfs/sda/errortag/btree_chk_sblk
sleep 10
umount $mountpoint
done
Kick off fsstress and only *then* turn on the debugging knob. If it
happens that the knob gets turned on after the cntbt lookup succeeds
but before the call to xfs_btree_islastblock, then we *can* end up in
the situation where a previously checked btree block suddenly starts
returning EFSCORRUPTED from xfs_btree_check_block. Kaboom.
Darrick give a very detailed explanation as follows:
Looking back at commit 27d9ee577d, I think the point of all this was
to make sure that the cursor has actually performed a lookup, and that
the btree block at whatever level we're asking about is ok.
If the caller hasn't ever done a lookup, the bc_levels array will be
empty, so cur->bc_levels[level].bp pointer will be NULL. The call to
xfs_btree_get_block will crash anyway, so the "ASSERT(block);" part is
pointless.
If the caller did a lookup but the lookup failed due to block
corruption, the corresponding cur->bc_levels[level].bp pointer will also
be NULL, and we'll still crash. The "ASSERT(xfs_btree_check_block);"
logic is also unnecessary.
If the cursor level points to an inode root, the block buffer will be
incore, so it had better always be consistent.
If the caller ignores a failed lookup after a successful one and calls
this function, the cursor state is garbage and the assert wouldn't have
tripped anyway. So get rid of the assert.
Fixes: 27d9ee577d ("xfs: actually check xfs_btree_check_block return in xfs_btree_islastblock")
Signed-off-by: Guo Xuenan <guoxuenan@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
This series fixes a bug in the refcount code where we don't merge
records correctly if the refcount is hovering around MAXREFCOUNT. This
fixes regressions in xfs/179 when fsdax is enabled. xfs/179 itself will
be modified to exploit the bug through the pagecache path.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmOI5VUACgkQ+H93GTRK
tOsf+A/9GsOyXhF2dAPt2zvDpoiHvR/y42/LQUoKah2cxUXOhfPIu53xoRgNgNOh
1O3D1s9HjudI1Bo6/R+xeijq/upKGmKmOJL52vULy8UqsbDLvV8RHQ5YfHbIIghW
HiYsD3aX5qhTKr7KHMrJzZcokSTAIGgIAT3QK7X0BrWUSXK12BN53A5D3XSFEMPu
K1rws+XEZkOlgNti1DqTrRa6OdU+Y0CKOobNciNGTEBi6Gg4QFaudoobrBAYGNpQ
R2icLAJG2lxlv3VRr2GqfgZ4XFh+q8TmXfb0tF+/rS/XWeDoElci+t4zpyDyeetH
H4IdfXuiJMw9UUj8h7N299qRYlUFhfqfwMs/9FBU+W8UTV9ke6796r+/3ARcUino
GoHwNpXEQZHatU5zVj1v2/6yVOCLapoz1WHa3pkbvwBGPKvWS2DGs0Mx5W28b/QE
M/uRj/CL4re0lEqsnNHu0J2tMiLb5A33pkdAZTfsoOwXdjpZdiXCuUHJzD71UDb8
hePXm5c9XBSSLIV44bpgBCDqu7acybENRUXJ2+1kBebxaKEpH8nrPKIuHRzX2Ulh
KfCZwwMLR24AYYHcFRsb8p4QtBaSu0jF4WJQJoggSsHErQfDdNDfTr5nL86kF+3P
zp9/8WOi4ZfRGAMOS/kTjZRFnG+hDE+EqvAzuNyf9vlytF84KZU=
=qgYH
-----END PGP SIGNATURE-----
Merge tag 'maxrefcount-fixes-6.2_2022-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.2-mergeD
xfs: fix broken MAXREFCOUNT handling
This series fixes a bug in the refcount code where we don't merge
records correctly if the refcount is hovering around MAXREFCOUNT. This
fixes regressions in xfs/179 when fsdax is enabled. xfs/179 itself will
be modified to exploit the bug through the pagecache path.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
* tag 'maxrefcount-fixes-6.2_2022-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
xfs: estimate post-merge refcounts correctly
xfs: hoist refcount record merge predicates
Upon enabling fsdax + reflink for XFS, xfs/179 began to report refcount
metadata corruptions after being run. Specifically, xfs_repair noticed
single-block refcount records that could be combined but had not been.
The root cause of this is improper MAXREFCOUNT edge case handling in
xfs_refcount_merge_extents. When we're trying to find candidates for a
refcount btree record merge, we compute the refcount attribute of the
merged record, but we fail to account for the fact that once a record
hits rc_refcount == MAXREFCOUNT, it is pinned that way forever. Hence
the computed refcount is wrong, and we fail to merge the extents.
Fix this by adjusting the merge predicates to compute the adjusted
refcount correctly.
Fixes: 3172725814 ("xfs: adjust refcount of an extent of blocks in refcount btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Xiao Yang <yangx.jy@fujitsu.com>
Hoist these multiline conditionals into separate static inline helpers
to improve readability and set the stage for corruption fixes that will
be introduced in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Xiao Yang <yangx.jy@fujitsu.com>
we might want to know why jbd2 thread using high io for detail,
split ext4_journal_start trace to ext4_journal_start_sb and
ext4_journal_start_inode, show ino and handle type when possible.
Signed-off-by: changfengnan <changfengnan@bytedance.com>
Link: https://lore.kernel.org/r/20221008120518.74870-1-changfengnan@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Before the commit 461c3af045 ("ext4: Change handle_mount_opt() to use
fs_parameter") ext4 mount option journal_path did follow links in the
provided path.
Bring this behavior back by allowing to pass pathwalk flags to
fs_lookup_param().
Fixes: 461c3af045 ("ext4: Change handle_mount_opt() to use fs_parameter")
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20221004135803.32283-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Return value directly from ext4_group_extend_no_check()
instead of getting value from redundant variable err.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Jinpeng Cui <cui.jinpeng2@zte.com.cn>
Link: https://lore.kernel.org/r/20220831160843.305836-1-cui.jinpeng2@zte.com.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In do_writepages, if the value returned by ext4_writepages is "-ENOMEM"
and "wbc->sync_mode == WB_SYNC_ALL", retry until the condition is not met.
In __ext4_get_inode_loc, if the bh returned by sb_getblk is NULL,
the function returns -ENOMEM.
In __getblk_slow, if the return value of grow_buffers is less than 0,
the function returns NULL.
When the three processes are connected in series like the following stack,
an infinite loop may occur:
do_writepages <--- keep retrying
ext4_writepages
mpage_map_and_submit_extent
mpage_map_one_extent
ext4_map_blocks
ext4_ext_map_blocks
ext4_ext_handle_unwritten_extents
ext4_ext_convert_to_initialized
ext4_split_extent
ext4_split_extent_at
__ext4_ext_dirty
__ext4_mark_inode_dirty
ext4_reserve_inode_write
ext4_get_inode_loc
__ext4_get_inode_loc <--- return -ENOMEM
sb_getblk
__getblk_gfp
__getblk_slow <--- return NULL
grow_buffers
grow_dev_page <--- return -ENXIO
ret = (block < end_block) ? 1 : -ENXIO;
In this issue, bg_inode_table_hi is overwritten as an incorrect value.
As a result, `block < end_block` cannot be met in grow_dev_page.
Therefore, __ext4_get_inode_loc always returns '-ENOMEM' and do_writepages
keeps retrying. As a result, the writeback process is in the D state due
to an infinite loop.
Add a check on inode table block in the __ext4_get_inode_loc function by
referring to ext4_read_inode_bitmap to avoid this infinite loop.
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220817132701.3015912-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In anticipation of putting random seeds in EFI variables, it's important
that the random GUID namespace of variables remains hidden from
userspace. We accomplish this by not populating efivarfs with entries
from that GUID, as well as denying the creation of new ones in that
GUID.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Fix the following W=1 kernel build warning(s):
fs/fat/nfs.c:21: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
fs/fat/nfs.c:139: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
Link: https://lkml.kernel.org/r/20221111075648.4005-1-liubo03@inspur.com
Signed-off-by: Bo Liu <liubo03@inspur.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There is a memory leak reported by kmemleak:
unreferenced object 0xffff88810cc65e60 (size 32):
comm "mount.ocfs2", pid 23753, jiffies 4302528942 (age 34735.105s)
hex dump (first 32 bytes):
10 00 00 00 00 00 00 00 00 01 01 01 01 01 01 01 ................
01 01 01 01 01 01 01 01 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff8170f73d>] __kmalloc+0x4d/0x150
[<ffffffffa0ac3f51>] ocfs2_compute_replay_slots+0x121/0x330 [ocfs2]
[<ffffffffa0b65165>] ocfs2_check_volume+0x485/0x900 [ocfs2]
[<ffffffffa0b68129>] ocfs2_mount_volume.isra.0+0x1e9/0x650 [ocfs2]
[<ffffffffa0b7160b>] ocfs2_fill_super+0xe0b/0x1740 [ocfs2]
[<ffffffff818e1fe2>] mount_bdev+0x312/0x400
[<ffffffff819a086d>] legacy_get_tree+0xed/0x1d0
[<ffffffff818de82d>] vfs_get_tree+0x7d/0x230
[<ffffffff81957f92>] path_mount+0xd62/0x1760
[<ffffffff81958a5a>] do_mount+0xca/0xe0
[<ffffffff81958d3c>] __x64_sys_mount+0x12c/0x1a0
[<ffffffff82f26f15>] do_syscall_64+0x35/0x80
[<ffffffff8300006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
This call stack is related to two problems. Firstly, the ocfs2 super uses
"replay_map" to trace online/offline slots, in order to recover offline
slots during recovery and mount. But when ocfs2_truncate_log_init()
returns an error in ocfs2_mount_volume(), the memory of "replay_map" will
not be freed in error handling path. Secondly, the memory of "replay_map"
will not be freed if d_make_root() returns an error in ocfs2_fill_super().
But the memory of "replay_map" will be freed normally when completing
recovery and mount in ocfs2_complete_mount_recovery().
Fix the first problem by adding error handling path to free "replay_map"
when ocfs2_truncate_log_init() fails. And fix the second problem by
calling ocfs2_free_replay_slots(osb) in the error handling path
"out_dismount". In addition, since ocfs2_free_replay_slots() is static,
it is necessary to remove its static attribute and declare it in header
file.
Link: https://lkml.kernel.org/r/20221109074627.2303950-1-lizetao1@huawei.com
Fixes: 9140db04ef ("ocfs2: recover orphans in offline slots during recovery and mount")
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The simple attribute files do not accept a negative value since the commit
488dac0c92 ("libfs: fix error cast of negative value in
simple_attr_write()"), so we have to use a 64-bit value to write a
negative value for a debugfs file created by debugfs_create_atomic_t().
This restores the previous behaviour by introducing
DEFINE_DEBUGFS_ATTRIBUTE_SIGNED for a signed value.
Link: https://lkml.kernel.org/r/20220919172418.45257-4-akinobu.mita@gmail.com
Fixes: 488dac0c92 ("libfs: fix error cast of negative value in simple_attr_write()")
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reported-by: Zhao Gongyi <zhaogongyi@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Wei Yongjun <weiyongjun1@huawei.com>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "fix error when writing negative value to simple attribute
files".
The simple attribute files do not accept a negative value since the commit
488dac0c92 ("libfs: fix error cast of negative value in
simple_attr_write()"), but some attribute files want to accept a negative
value.
This patch (of 3):
The simple attribute files do not accept a negative value since the commit
488dac0c92 ("libfs: fix error cast of negative value in
simple_attr_write()"), so we have to use a 64-bit value to write a
negative value.
This adds DEFINE_SIMPLE_ATTRIBUTE_SIGNED for a signed value.
Link: https://lkml.kernel.org/r/20220919172418.45257-1-akinobu.mita@gmail.com
Link: https://lkml.kernel.org/r/20220919172418.45257-2-akinobu.mita@gmail.com
Fixes: 488dac0c92 ("libfs: fix error cast of negative value in simple_attr_write()")
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reported-by: Zhao Gongyi <zhaogongyi@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Wei Yongjun <weiyongjun1@huawei.com>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Removing the try_to_release_page() wrapper", v3.
This patchset replaces the remaining calls of try_to_release_page() with
the folio equivalent: filemap_release_folio(). This allows us to remove
the wrapper.
This patch (of 4):
Convert move_extent_per_page() to use folios. This change removes 5 calls
to compound_head() and is in preparation for the removal of the
try_to_release_page() wrapper.
Link: https://lkml.kernel.org/r/20221118073055.55694-1-vishal.moola@gmail.com
Link: https://lkml.kernel.org/r/20221118073055.55694-2-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The ei pointer does not need to cast the type.
Link: https://lkml.kernel.org/r/20221107015659.3221-1-zeming@nfschina.com
Signed-off-by: Li zeming <zeming@nfschina.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since commit 9a10064f56 ("mm: add a field to store names for private
anonymous memory"), name for private anonymous memory, but not shared
anonymous, can be set. However, naming shared anonymous memory just as
useful for tracking purposes.
Extend the functionality to be able to set names for shared anon.
There are two ways to create anonymous shared memory, using memfd or
directly via mmap():
1. fd = memfd_create(...)
mem = mmap(..., MAP_SHARED, fd, ...)
2. mem = mmap(..., MAP_SHARED | MAP_ANONYMOUS, -1, ...)
In both cases the anonymous shared memory is created the same way by
mapping an unlinked file on tmpfs.
The memfd way allows to give a name for anonymous shared memory, but
not useful when parts of shared memory require to have distinct names.
Example use case: The VMM maps VM memory as anonymous shared memory (not
private because VMM is sandboxed and drivers are running in their own
processes). However, the VM tells back to the VMM how parts of the memory
are actually used by the guest, how each of the segments should be backed
(i.e. 4K pages, 2M pages), and some other information about the segments.
The naming allows us to monitor the effective memory footprint for each
of these segments from the host without looking inside the guest.
Sample output:
/* Create shared anonymous segmenet */
anon_shmem = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
/* Name the segment: "MY-NAME" */
rv = prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME,
anon_shmem, SIZE, "MY-NAME");
cat /proc/<pid>/maps (and smaps):
7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 [anon_shmem:MY-NAME]
If the segment is not named, the output is:
7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 /dev/zero (deleted)
Link: https://lkml.kernel.org/r/20221115020602.804224-1-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Colin Cross <ccross@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: xu xin <cgel.zte@gmail.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Syzbot reported a null-ptr-deref bug:
NILFS (loop0): segctord starting. Construction interval = 5 seconds, CP
frequency < 30 seconds
general protection fault, probably for non-canonical address
0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
CPU: 1 PID: 3603 Comm: segctord Not tainted
6.1.0-rc2-syzkaller-00105-gb229b6ca5abb #0
Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google
10/11/2022
RIP: 0010:nilfs_palloc_commit_free_entry+0xe5/0x6b0
fs/nilfs2/alloc.c:608
Code: 00 00 00 00 fc ff df 80 3c 02 00 0f 85 cd 05 00 00 48 b8 00 00 00
00 00 fc ff df 4c 8b 73 08 49 8d 7e 10 48 89 fa 48 c1 ea 03 <80> 3c 02
00 0f 85 26 05 00 00 49 8b 46 10 be a6 00 00 00 48 c7 c7
RSP: 0018:ffffc90003dff830 EFLAGS: 00010212
RAX: dffffc0000000000 RBX: ffff88802594e218 RCX: 000000000000000d
RDX: 0000000000000002 RSI: 0000000000002000 RDI: 0000000000000010
RBP: ffff888071880222 R08: 0000000000000005 R09: 000000000000003f
R10: 000000000000000d R11: 0000000000000000 R12: ffff888071880158
R13: ffff88802594e220 R14: 0000000000000000 R15: 0000000000000004
FS: 0000000000000000(0000) GS:ffff8880b9b00000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb1c08316a8 CR3: 0000000018560000 CR4: 0000000000350ee0
Call Trace:
<TASK>
nilfs_dat_commit_free fs/nilfs2/dat.c:114 [inline]
nilfs_dat_commit_end+0x464/0x5f0 fs/nilfs2/dat.c:193
nilfs_dat_commit_update+0x26/0x40 fs/nilfs2/dat.c:236
nilfs_btree_commit_update_v+0x87/0x4a0 fs/nilfs2/btree.c:1940
nilfs_btree_commit_propagate_v fs/nilfs2/btree.c:2016 [inline]
nilfs_btree_propagate_v fs/nilfs2/btree.c:2046 [inline]
nilfs_btree_propagate+0xa00/0xd60 fs/nilfs2/btree.c:2088
nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337
nilfs_collect_file_data+0x45/0xd0 fs/nilfs2/segment.c:568
nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1018
nilfs_segctor_scan_file+0x3f4/0x6f0 fs/nilfs2/segment.c:1067
nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1197 [inline]
nilfs_segctor_collect fs/nilfs2/segment.c:1503 [inline]
nilfs_segctor_do_construct+0x12fc/0x6af0 fs/nilfs2/segment.c:2045
nilfs_segctor_construct+0x8e3/0xb30 fs/nilfs2/segment.c:2379
nilfs_segctor_thread_construct fs/nilfs2/segment.c:2487 [inline]
nilfs_segctor_thread+0x3c3/0xf30 fs/nilfs2/segment.c:2570
kthread+0x2e4/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
</TASK>
...
If DAT metadata file is corrupted on disk, there is a case where
req->pr_desc_bh is NULL and blocknr is 0 at nilfs_dat_commit_end() during
a b-tree operation that cascadingly updates ancestor nodes of the b-tree,
because nilfs_dat_commit_alloc() for a lower level block can initialize
the blocknr on the same DAT entry between nilfs_dat_prepare_end() and
nilfs_dat_commit_end().
If this happens, nilfs_dat_commit_end() calls nilfs_dat_commit_free()
without valid buffer heads in req->pr_desc_bh and req->pr_bitmap_bh, and
causes the NULL pointer dereference above in
nilfs_palloc_commit_free_entry() function, which leads to a crash.
Fix this by adding a NULL check on req->pr_desc_bh and req->pr_bitmap_bh
before nilfs_palloc_commit_free_entry() in nilfs_dat_commit_free().
This also calls nilfs_error() in that case to notify that there is a fatal
flaw in the filesystem metadata and prevent further operations.
Link: https://lkml.kernel.org/r/00000000000097c20205ebaea3d6@google.com
Link: https://lkml.kernel.org/r/20221114040441.1649940-1-zhangpeng362@huawei.com
Link: https://lkml.kernel.org/r/20221119120542.17204-1-konishi.ryusuke@gmail.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+ebe05ee8e98f755f61d0@syzkaller.appspotmail.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The atomic_read was accidentally replaced with atomic_inc_return,
which prevents the server from getting cleaned up and causes rmmod
to hang with a warning:
Can't purge s=00000001
Fixes: 2757a4dc18 ("afs: Fix access after dec in put functions")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20221130174053.2665818-1-marc.dionne@auristor.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
xfs log io error will trigger xlog shut down, and end_io worker call
xlog_state_shutdown_callbacks to unpin and release the buf log item.
The race condition is that when there are some thread doing transaction
commit and happened not to be intercepted by xlog_is_shutdown, then,
these log item will be insert into CIL, when unpin and release these
buf log item, UAF will occur. BTW, add delay before `xlog_cil_commit`
can increase recurrence probability.
The following call graph actually encountered this bad situation.
fsstress io end worker kworker/0:1H-216
xlog_ioend_work
->xlog_force_shutdown
->xlog_state_shutdown_callbacks
->xlog_cil_process_committed
->xlog_cil_committed
->xfs_trans_committed_bulk
->xfs_trans_apply_sb_deltas ->li_ops->iop_unpin(lip, 1);
->xfs_trans_getsb
->_xfs_trans_bjoin
->xfs_buf_item_init
->if (bip) { return 0;} //relog
->xlog_cil_commit
->xlog_cil_insert_items //insert into CIL
->xfs_buf_ioend_fail(bp);
->xfs_buf_ioend
->xfs_buf_item_done
->xfs_buf_item_relse
->xfs_buf_item_free
when cil push worker gather percpu cil and insert super block buf log item
into ctx->log_items then uaf occurs.
==================================================================
BUG: KASAN: use-after-free in xlog_cil_push_work+0x1c8f/0x22f0
Write of size 8 at addr ffff88801800f3f0 by task kworker/u4:4/105
CPU: 0 PID: 105 Comm: kworker/u4:4 Tainted: G W
6.1.0-rc1-00001-g274115149b42 #136
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
Workqueue: xfs-cil/sda xlog_cil_push_work
Call Trace:
<TASK>
dump_stack_lvl+0x4d/0x66
print_report+0x171/0x4a6
kasan_report+0xb3/0x130
xlog_cil_push_work+0x1c8f/0x22f0
process_one_work+0x6f9/0xf70
worker_thread+0x578/0xf30
kthread+0x28c/0x330
ret_from_fork+0x1f/0x30
</TASK>
Allocated by task 2145:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
__kasan_slab_alloc+0x54/0x60
kmem_cache_alloc+0x14a/0x510
xfs_buf_item_init+0x160/0x6d0
_xfs_trans_bjoin+0x7f/0x2e0
xfs_trans_getsb+0xb6/0x3f0
xfs_trans_apply_sb_deltas+0x1f/0x8c0
__xfs_trans_commit+0xa25/0xe10
xfs_symlink+0xe23/0x1660
xfs_vn_symlink+0x157/0x280
vfs_symlink+0x491/0x790
do_symlinkat+0x128/0x220
__x64_sys_symlink+0x7a/0x90
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Freed by task 216:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
kasan_save_free_info+0x2a/0x40
__kasan_slab_free+0x105/0x1a0
kmem_cache_free+0xb6/0x460
xfs_buf_ioend+0x1e9/0x11f0
xfs_buf_item_unpin+0x3d6/0x840
xfs_trans_committed_bulk+0x4c2/0x7c0
xlog_cil_committed+0xab6/0xfb0
xlog_cil_process_committed+0x117/0x1e0
xlog_state_shutdown_callbacks+0x208/0x440
xlog_force_shutdown+0x1b3/0x3a0
xlog_ioend_work+0xef/0x1d0
process_one_work+0x6f9/0xf70
worker_thread+0x578/0xf30
kthread+0x28c/0x330
ret_from_fork+0x1f/0x30
The buggy address belongs to the object at ffff88801800f388
which belongs to the cache xfs_buf_item of size 272
The buggy address is located 104 bytes inside of
272-byte region [ffff88801800f388, ffff88801800f498)
The buggy address belongs to the physical page:
page:ffffea0000600380 refcount:1 mapcount:0 mapping:0000000000000000
index:0xffff88801800f208 pfn:0x1800e
head:ffffea0000600380 order:1 compound_mapcount:0 compound_pincount:0
flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff)
raw: 001fffff80010200 ffffea0000699788 ffff88801319db50 ffff88800fb50640
raw: ffff88801800f208 000000000015000a 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88801800f280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88801800f300: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88801800f380: fc fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88801800f400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88801800f480: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Disabling lock debugging due to kernel taint
Signed-off-by: Guo Xuenan <guoxuenan@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Fix uaf in xfs_trans_ail_delete during xlog force shutdown.
In commit cd6f79d1fb ("xfs: run callbacks before waking waiters in
xlog_state_shutdown_callbacks") changed the order of running callbacks
and wait for iclog completion to avoid unmount path untimely destroy AIL.
But which seems not enough to ensue this, adding mdelay in
`xfs_buf_item_unpin` can prove that.
The reproduction is as follows. To ensure destroy AIL safely,
we should wait all xlog ioend workers done and sync the AIL.
==================================================================
BUG: KASAN: use-after-free in xfs_trans_ail_delete+0x240/0x2a0
Read of size 8 at addr ffff888023169400 by task kworker/1:1H/43
CPU: 1 PID: 43 Comm: kworker/1:1H Tainted: G W
6.1.0-rc1-00002-gc28266863c4a #137
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
Workqueue: xfs-log/sda xlog_ioend_work
Call Trace:
<TASK>
dump_stack_lvl+0x4d/0x66
print_report+0x171/0x4a6
kasan_report+0xb3/0x130
xfs_trans_ail_delete+0x240/0x2a0
xfs_buf_item_done+0x7b/0xa0
xfs_buf_ioend+0x1e9/0x11f0
xfs_buf_item_unpin+0x4c8/0x860
xfs_trans_committed_bulk+0x4c2/0x7c0
xlog_cil_committed+0xab6/0xfb0
xlog_cil_process_committed+0x117/0x1e0
xlog_state_shutdown_callbacks+0x208/0x440
xlog_force_shutdown+0x1b3/0x3a0
xlog_ioend_work+0xef/0x1d0
process_one_work+0x6f9/0xf70
worker_thread+0x578/0xf30
kthread+0x28c/0x330
ret_from_fork+0x1f/0x30
</TASK>
Allocated by task 9606:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
__kasan_kmalloc+0x7a/0x90
__kmalloc+0x59/0x140
kmem_alloc+0xb2/0x2f0
xfs_trans_ail_init+0x20/0x320
xfs_log_mount+0x37e/0x690
xfs_mountfs+0xe36/0x1b40
xfs_fs_fill_super+0xc5c/0x1a70
get_tree_bdev+0x3c5/0x6c0
vfs_get_tree+0x85/0x250
path_mount+0xec3/0x1830
do_mount+0xef/0x110
__x64_sys_mount+0x150/0x1f0
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Freed by task 9662:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
kasan_save_free_info+0x2a/0x40
__kasan_slab_free+0x105/0x1a0
__kmem_cache_free+0x99/0x2d0
kvfree+0x3a/0x40
xfs_log_unmount+0x60/0xf0
xfs_unmountfs+0xf3/0x1d0
xfs_fs_put_super+0x78/0x300
generic_shutdown_super+0x151/0x400
kill_block_super+0x9a/0xe0
deactivate_locked_super+0x82/0xe0
deactivate_super+0x91/0xb0
cleanup_mnt+0x32a/0x4a0
task_work_run+0x15f/0x240
exit_to_user_mode_prepare+0x188/0x190
syscall_exit_to_user_mode+0x12/0x30
do_syscall_64+0x42/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
The buggy address belongs to the object at ffff888023169400
which belongs to the cache kmalloc-128 of size 128
The buggy address is located 0 bytes inside of
128-byte region [ffff888023169400, ffff888023169480)
The buggy address belongs to the physical page:
page:ffffea00008c5a00 refcount:1 mapcount:0 mapping:0000000000000000
index:0xffff888023168f80 pfn:0x23168
head:ffffea00008c5a00 order:1 compound_mapcount:0 compound_pincount:0
flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff)
raw: 001fffff80010200 ffffea00006b3988 ffffea0000577a88 ffff88800f842ac0
raw: ffff888023168f80 0000000000150007 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888023169300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888023169380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888023169400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888023169480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888023169500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Disabling lock debugging due to kernel taint
Fixes: cd6f79d1fb ("xfs: run callbacks before waking waiters in xlog_state_shutdown_callbacks")
Signed-off-by: Guo Xuenan <guoxuenan@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
I've been running near-continuous integration testing of online fsck,
and I've noticed that once a day, one of the ARM VMs will fail the test
with out of order records in the data fork.
xfs/804 races fsstress with online scrub (aka scan but do not change
anything), so I think this might be a bug in the core xfs code. This
also only seems to trigger if one runs the test for more than ~6 minutes
via TIME_FACTOR=13 or something.
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/tree/tests/xfs/804?h=djwong-wtf
I added a debugging patch to the kernel to check the data fork extents
after taking the ILOCK, before dropping ILOCK, and before and after each
bmapping operation. So far I've narrowed it down to the delalloc code
inserting a record in the wrong place in the iext tree:
xfs_bmap_add_extent_hole_delay, near line 2691:
case 0:
/*
* New allocation is not contiguous with another
* delayed allocation.
* Insert a new entry.
*/
oldlen = newlen = 0;
xfs_iunlock_check_datafork(ip); <-- ok here
xfs_iext_insert(ip, icur, new, state);
xfs_iunlock_check_datafork(ip); <-- bad here
break;
}
I recorded the state of the data fork mappings and iext cursor state
when a corrupt data fork is detected immediately after the
xfs_bmap_add_extent_hole_delay call in xfs_bmapi_reserve_delalloc:
ino 0x140bb3 func xfs_bmapi_reserve_delalloc line 4164 data fork:
ino 0x140bb3 nr 0x0 nr_real 0x0 offset 0xb9 blockcount 0x1f startblock 0x935de2 state 1
ino 0x140bb3 nr 0x1 nr_real 0x1 offset 0xe6 blockcount 0xa startblock 0xffffffffe0007 state 0
ino 0x140bb3 nr 0x2 nr_real 0x1 offset 0xd8 blockcount 0xe startblock 0x935e01 state 0
Here we see that a delalloc extent was inserted into the wrong position
in the iext leaf, same as all the other times. The extra trace data I
collected are as follows:
ino 0x140bb3 fork 0 oldoff 0xe6 oldlen 0x4 oldprealloc 0x6 isize 0xe6000
ino 0x140bb3 oldgotoff 0xea oldgotstart 0xfffffffffffffffe oldgotcount 0x0 oldgotstate 0
ino 0x140bb3 crapgotoff 0x0 crapgotstart 0x0 crapgotcount 0x0 crapgotstate 0
ino 0x140bb3 freshgotoff 0xd8 freshgotstart 0x935e01 freshgotcount 0xe freshgotstate 0
ino 0x140bb3 nowgotoff 0xe6 nowgotstart 0xffffffffe0007 nowgotcount 0xa nowgotstate 0
ino 0x140bb3 oldicurpos 1 oldleafnr 2 oldleaf 0xfffffc00f0609a00
ino 0x140bb3 crapicurpos 2 crapleafnr 2 crapleaf 0xfffffc00f0609a00
ino 0x140bb3 freshicurpos 1 freshleafnr 2 freshleaf 0xfffffc00f0609a00
ino 0x140bb3 newicurpos 1 newleafnr 3 newleaf 0xfffffc00f0609a00
The first line shows that xfs_bmapi_reserve_delalloc was called with
whichfork=XFS_DATA_FORK, off=0xe6, len=0x4, prealloc=6.
The second line ("oldgot") shows the contents of @got at the beginning
of the call, which are the results of the first iext lookup in
xfs_buffered_write_iomap_begin.
Line 3 ("crapgot") is the result of duplicating the cursor at the start
of the body of xfs_bmapi_reserve_delalloc and performing a fresh lookup
at @off.
Line 4 ("freshgot") is the result of a new xfs_iext_get_extent right
before the call to xfs_bmap_add_extent_hole_delay. Totally garbage.
Line 5 ("nowgot") is contents of @got after the
xfs_bmap_add_extent_hole_delay call.
Line 6 is the contents of @icur at the beginning fo the call. Lines 7-9
are the contents of the iext cursors at the point where the block
mappings were sampled.
I think @oldgot is a HOLESTARTBLOCK extent because the first lookup
didn't find anything, so we filled in imap with "fake hole until the
end". At the time of the first lookup, I suspect that there's only one
32-block unwritten extent in the mapping (hence oldicurpos==1) but by
the time we get to recording crapgot, crapicurpos==2.
Dave then added:
Ok, that's much simpler to reason about, and implies the smoke is
coming from xfs_buffered_write_iomap_begin() or
xfs_bmapi_reserve_delalloc(). I suspect the former - it does a lot
of stuff with the ILOCK_EXCL held.....
.... including calling xfs_qm_dqattach_locked().
xfs_buffered_write_iomap_begin
ILOCK_EXCL
look up icur
xfs_qm_dqattach_locked
xfs_qm_dqattach_one
xfs_qm_dqget_inode
dquot cache miss
xfs_iunlock(ip, XFS_ILOCK_EXCL);
error = xfs_qm_dqread(mp, id, type, can_alloc, &dqp);
xfs_ilock(ip, XFS_ILOCK_EXCL);
....
xfs_bmapi_reserve_delalloc(icur)
Yup, that's what is letting the magic smoke out -
xfs_qm_dqattach_locked() can cycle the ILOCK. If that happens, we
can pass a stale icur to xfs_bmapi_reserve_delalloc() and it all
goes downhill from there.
Back to Darrick now:
So. Fix this by moving the dqattach_locked call up before we take the
ILOCK, like all the other callers in that file.
Fixes: a526c85c22 ("xfs: move xfs_file_iomap_begin_delay around") # goes further back than this
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
-Wuninitialized complains about @target in xfsaild_push being
uninitialized in the case where the waitqueue is active but there is no
last item in the AIL to wait for. I /think/ it should never be the case
that the subsequent xfs_trans_ail_cursor_first returns a log item and
hence we'll never end up at XFS_LSN_CMP, but let's make this explicit.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When -Wstringop-truncation is enabled, the compiler complains about
truncation of the null byte at the end of the xattr name prefix. This
is intentional, since we're concatenating the two strings together and
do _not_ want a null byte in the middle of the name.
We've already ensured that the name buffer is long enough to handle
prefix and name, and the prefix_len is supposed to be the length of the
prefix string without the null byte, so use memcpy here instead.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Every now and then I see fstests failures on aarch64 (64k pages) that
trigger on the following sequence:
mkfs.xfs $dev
mount $dev $mnt
touch $mnt/a
umount $mnt
xfs_db -c 'path /a' -c 'print' $dev
99% of the time this succeeds, but every now and then xfs_db cannot find
/a and fails. This turns out to be a race involving udev/blkid, the
page cache for the block device, and the xfs_db process.
udev is triggered whenever anyone closes a block device or unmounts it.
The default udev rules invoke blkid to read the fs super and create
symlinks to the bdev under /dev/disk. For this, it uses buffered reads
through the page cache.
xfs_db also uses buffered reads to examine metadata. There is no
coordination between xfs_db and udev, which means that they can run
concurrently. Note there is no coordination between the kernel and
blkid either.
On a system with 64k pages, the page cache can cache the superblock and
the root inode (and hence the root dir) with the same 64k page. If
udev spawns blkid after the mkfs and the system is busy enough that it
is still running when xfs_db starts up, they'll both read from the same
page in the pagecache.
The unmount writes updated inode metadata to disk directly. The XFS
buffer cache does not use the bdev pagecache, nor does it invalidate the
pagecache on umount. If the above scenario occurs, the pagecache no
longer reflects what's on disk, xfs_db reads the stale metadata, and
fails to find /a. Most of the time this succeeds because closing a bdev
invalidates the page cache, but when processes race, everyone loses.
Fix the problem by invalidating the bdev pagecache after flushing the
bdev, so that xfs_db will see up to date metadata.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
generic_remap_checks() can reduce the effective request length (i.e.,
after the reflink extend to EOF case is handled) down to zero. If this
occurs, __generic_remap_file_range_prep() proceeds through dio
serialization, file mapping flush calls, and may invoke file_modified()
before returning back to the filesystem caller, all of which immediately
check for len == 0 and return.
While this is mostly harmless, it is spurious and not completely
without side effect. A filemap write call can submit I/O (but not
wait on it) when the specified end byte precedes the start but
happens to land on the same aligned page boundary, which can occur
from __generic_remap_file_range_prep() when len is 0.
The dedupe path already has a len == 0 check to break out before
doing range comparisons. Lift this check a bit earlier in the
function to cover the general case of len == 0 and avoid the
unnecessary work. While here, account for the case where
generic_remap_check_len() may also reduce length to zero.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
It's fairly useless to complain about using an obsolete feature without
telling the user which process used it. My Fedora desktop randomly drops
this message, but I would really need this patch to figure out what
triggers is.
[ jlayton: print pid as well as process name ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
nfsd currently doesn't access i_flctx safely everywhere. This requires a
smp_load_acquire, as the pointer is set via cmpxchg (a release
operation).
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
nfs currently doesn't access i_flctx safely. This requires a
smp_load_acquire, as the pointer is set via cmpxchg (a release
operation).
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
lockd currently doesn't access i_flctx safely. This requires a
smp_load_acquire, as the pointer is set via cmpxchg (a release
operation).
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
ksmbd currently doesn't access i_flctx safely. This requires a
smp_load_acquire, as the pointer is set via cmpxchg (a release
operation).
Cc: Steve French <sfrench@samba.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
cifs currently doesn't access i_flctx safely. This requires a
smp_load_acquire, as the pointer is set via cmpxchg (a release
operation).
Cc: Steve French <smfrench@samba.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
ceph currently doesn't access i_flctx safely. This requires a
smp_load_acquire, as the pointer is set via cmpxchg (a release
operation).
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
There are a number of places in the kernel that are accessing the
inode->i_flctx field without smp_load_acquire. This is required to
ensure that the caller doesn't see a partially-initialized structure.
Add a new accessor function for it to make this clear and convert all of
the relevant accesses in locks.c to use it. Also, convert
locks_free_lock_context to use the helper as well instead of just doing
a "bare" assignment.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Ceph has a need to know whether a particular inode has any locks set on
it. It's currently tracking that by a num_locks field in its
filp->private_data, but that's problematic as it tries to decrement this
field when releasing locks and that can race with the file being torn
down.
Add a new vfs_inode_has_locks helper that just returns whether any locks
are currently held on the inode.
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Instead of looking up the algorithm by name in hash_algo_name[] to get
its hash_algo ID, just store the hash_algo ID in the fsverity_hash_alg
struct. Verify at boot time that every fsverity_hash_alg has a valid
hash_algo ID with matching digest size.
Remove an unnecessary memset() of the whole digest array to 0 before the
digest is copied into it.
Finally, remove the pr_debug statement. There is already a pr_debug for
the fsverity digest when the file is opened.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Link: https://lore.kernel.org/r/20221129045139.69803-1-ebiggers@kernel.org
Fix the following coccicheck warning:
fs/ext4/inline.c:183: WARNING opportunity for min().
fs/ext4/extents.c:2631: WARNING opportunity for max().
fs/ext4/extents.c:2632: WARNING opportunity for min().
fs/ext4/extents.c:5559: WARNING opportunity for max().
fs/ext4/super.c:6908: WARNING opportunity for min().
min()/max() and min_t() macro is defined in include/linux/minmax.h.
It avoids multiple evaluations of the arguments when non-constant and
performs strict type-checking.
Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jiangshan Yi <yijiangshan@kylinos.cn>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20220817025928.612851-1-13667453960@163.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In ext4_evict_inode(), if we evicting an inode in the 'no_delete' path,
it cannot be raced by another mark_inode_dirty(). If it happens,
someone else may accidentally dirty it without holding inode refcount
and probably cause use-after-free issues in the writeback procedure.
It's indiscoverable and hard to debug, so add an WARN_ON_ONCE() to
check and detect this issue in advance.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220629112647.4141034-2-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
As with PG_arch_2, this flag is only allowed on 64-bit architectures due
to the shortage of bits available. It will be used by the arm64 MTE code
in subsequent patches.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Steven Price <steven.price@arm.com>
[catalin.marinas@arm.com: added flag preserving in __split_huge_page_tail()]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221104011041.290951-5-pcc@google.com
Commit 4beba9486a ("mm: Add PG_arch_2 page flag") introduced a new
page flag for all 64-bit architectures. However, even if an architecture
is 64-bit, it may still have limited spare bits in the 'flags' member of
'struct page'. This may happen if an architecture enables SPARSEMEM
without SPARSEMEM_VMEMMAP as is the case with the newly added loongarch.
This architecture port needs 19 more bits for the sparsemem section
information and, while it is currently fine with PG_arch_2, adding any
more PG_arch_* flags will trigger build-time warnings.
Add a new CONFIG_ARCH_USES_PG_ARCH_X option which can be selected by
architectures that need more PG_arch_* flags beyond PG_arch_1. Select it
on arm64.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
[pcc@google.com: fix build with CONFIG_ARM64_MTE disabled]
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Price <steven.price@arm.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221104011041.290951-2-pcc@google.com
As a step towards freeing the PG_error flag for other uses, change ext4
and f2fs to stop using PG_error to track verity errors. Instead, if a
verity error occurs, just mark the whole bio as failed. The coarser
granularity isn't really a problem since it isn't any worse than what
the block layer provides, and errors from a multi-page readahead aren't
reported to applications unless a single-page read fails too.
f2fs supports compression, which makes the f2fs changes a bit more
complicated than desired, but the basic premise still works.
Note: there are still a few uses of PageError in f2fs, but they are on
the write path, so they are unrelated and this patch doesn't touch them.
Reviewed-by: Chao Yu <chao@kernel.org>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221129070401.156114-1-ebiggers@kernel.org
The fileserver probing code attempts to work out the best fileserver to
use for a volume by retrieving the RTT calculated by AF_RXRPC for the
probe call sent to each server and comparing them. Sometimes, however,
no RTT estimate is available and rxrpc_kernel_get_srtt() returns false,
leading good fileservers to be given an RTT of UINT_MAX and thus causing
the rotation algorithm to ignore them.
Fix afs_select_fileserver() to ignore rxrpc_kernel_get_srtt()'s return
value and just take the estimated RTT it provides - which will be capped
at 1 second.
Fixes: 1d4adfaf65 ("rxrpc: Make rxrpc_kernel_get_srtt() indicate validity")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/166965503999.3392585.13954054113218099395.stgit@warthog.procyon.org.uk/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new error injection knob so that we can arbitrarily slow down
pagecache writes to test for race conditions and aberrant reclaim
behavior if the writeback mechanisms are slow to issue writeback. This
will enable functional testing for the ifork sequence counters
introduced in commit 304a68b9c6 ("xfs: use iomap_valid method to
detect stale cached iomaps") that fixes write racing with reclaim
writeback.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Add a new error injection knob so that we can arbitrarily slow down
writeback to test for race conditions and aberrant reclaim behavior if
the writeback mechanisms are slow to issue writeback. This will enable
functional testing for the ifork sequence counters introduced in commit
745b3f76d1 ("xfs: maintain a sequence count for inode fork
manipulations").
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
This patch series fixes a data corruption that occurs in a specific
multi-threaded write workload. The workload combined
racing unaligned adjacent buffered writes with low memory conditions
that caused both writeback and memory reclaim to race with the
writes.
The result of this was random partial blocks containing zeroes
instead of the correct data. The underlying problem is that iomap
caches the write iomap for the duration of the write() operation,
but it fails to take into account that the extent underlying the
iomap can change whilst the write is in progress.
The short story is that an iomap can span mutliple folios, and so
under low memory writeback can be cleaning folios the write()
overlaps. Whilst the overlapping data is cached in memory, this
isn't a problem, but because the folios are now clean they can be
reclaimed. Once reclaimed, the write() does the wrong thing when
re-instantiating partial folios because the iomap no longer reflects
the underlying state of the extent. e.g. it thinks the extent is
unwritten, so it zeroes the partial range, when in fact the
underlying extent is now written and so it should have read the data
from disk. This is how we get random zero ranges in the file
instead of the correct data.
The gory details of the race condition can be found here:
https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
Fixing the problem has two aspects. The first aspect of the problem
is ensuring that iomap can detect a stale cached iomap during a
write in a race-free manner. We already do this stale iomap
detection in the writeback path, so we have a mechanism for
detecting that the iomap backing the data range may have changed
and needs to be remapped.
In the case of the write() path, we have to ensure that the iomap is
validated at a point in time when the page cache is stable and
cannot be reclaimed from under us. We also need to validate the
extent before we start performing any modifications to the folio
state or contents. Combine these two requirements together, and the
only "safe" place to validate the iomap is after we have looked up
and locked the folio we are going to copy the data into, but before
we've performed any initialisation operations on that folio.
If the iomap fails validation, we then mark it stale, unlock the
folio and end the write. This effectively means a stale iomap
results in a short write. Filesystems should already be able to
handle this, as write operations can end short for many reasons and
need to iterate through another mapping cycle to be completed. Hence
the iomap changes needed to detect and handle stale iomaps during
write() operations is relatively simple...
However, the assumption is that filesystems should already be able
to handle write failures safely, and that's where the second
(first?) part of the problem exists. That is, handling a partial
write is harder than just "punching out the unused delayed
allocation extent". This is because mmap() based faults can race
with writes, and if they land in the delalloc region that the write
allocated, then punching out the delalloc region can cause data
corruption.
This data corruption problem is exposed by generic/346 when iomap is
converted to detect stale iomaps during write() operations. Hence
write failure in the filesytems needs to handle the fact that the
write() in progress doesn't necessarily own the data in the page
cache over the range of the delalloc extent it just allocated.
As a result, we can't just truncate the page cache over the range
the write() didn't reach and punch all the delalloc extent. We have
to walk the page cache over the untouched range and skip over any
dirty data region in the cache in that range. Which is ....
non-trivial.
That is, iterating the page cache has to handle partially populated
folios (i.e. block size < page size) that contain data. The data
might be discontiguous within a folio. Indeed, there might be
*multiple* discontiguous data regions within a single folio. And to
make matters more complex, multi-page folios mean we just don't know
how many sub-folio regions we might have to iterate to find all
these regions. All the corner cases between the conversions and
rounding between filesystem block size, folio size and multi-page
folio size combined with unaligned write offsets kept breaking my
brain.
However, if we convert the code to track the processed
write regions by byte ranges instead of fileystem block or page
cache index, we could simply use mapping_seek_hole_data() to find
the start and end of each discrete data region within the range we
needed to scan. SEEK_DATA finds the start of the cached data region,
SEEK_HOLE finds the end of the region. These are byte based
interfaces that understand partially uptodate folio regions, and so
can iterate discrete sub-folio data regions directly. This largely
solved the problem of discovering the dirty regions we need to keep
the delalloc extent over.
However, to use mapping_seek_hole_data() without needing to export
it, we have to move all the delalloc extent cleanup to the iomap
core and so now the iomap core can clean up delayed allocation
extents in a safe, sane and filesystem neutral manner.
With all this done, the original data corruption never occurs
anymore, and we now have a generic mechanism for ensuring that page
cache writes do not do the wrong thing when writeback and reclaim
change the state of the physical extent and/or page cache contents
whilst the write is in progress.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmOFSzwUHGRhdmlkQGZy
b21vcmJpdC5jb20ACgkQregpR/R1+h3djhAAwOf9VeLO7TW/0B1XfE3ktWGiDmEG
ekB8mkB7CAHB9SBq7TZMHjktJIJxY81q5+Iq9qHGiW3asoVbmWvkeRSJgXljhTby
D2KsUIT1NK/X6DhC9FhNjv/Q2GJ0nY6s65RLudUEkelYBFhGMM0kdXX+fZmtZ4yT
T/lRYk/KBFpeQCaGRcFXK55TnB/B9muOI9FyKvh2DNWe6r0Xu3Obb3a9k+snZA9R
EeUpAosDSrXzP4c2w2ovpU2eutUdo4eYTHIzXKGkhktbRhmCRLn4NlxvFCanoe8h
eSS85sb8DHUh2iyaaB8yrJ6LL3MuBytOi24rNBeyd1KAyEtT21+cTUK/QAahzble
pL8l6TA7ZXbhYcbk5uQvFEIAInR+0ffjde61uE14N55awq0Vdrym7C7D2ri60iw6
ts45AVYKYeF61coAbwvmaJyvqvQ0tUlmVZXI4lQzN2O17Lr6004gJFMjDRsXXU7H
eHLUt496Geq39rglw85y8G0vmxxGZ9iIGkeC1kUSSCmlvx/JfuJlbWBgyMGtNRBI
qzv0jmk67Ft1seQSMWQJttxCZs4uOF2gwERYGAF7iUR8F4PGob/N1e2/hpg75G8q
0S8u1N1p8Cv5u/jwybqy8FnSC2MlUZl6SQURaVQDy2DLMKHb4T1diu0jrCbiSPiF
JKfQ7aNQxaEZIJw=
=cv9i
-----END PGP SIGNATURE-----
Merge tag 'xfs-iomap-stale-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-6.2-mergeB
xfs, iomap: fix data corruption due to stale cached iomaps
This patch series fixes a data corruption that occurs in a specific
multi-threaded write workload. The workload combined
racing unaligned adjacent buffered writes with low memory conditions
that caused both writeback and memory reclaim to race with the
writes.
The result of this was random partial blocks containing zeroes
instead of the correct data. The underlying problem is that iomap
caches the write iomap for the duration of the write() operation,
but it fails to take into account that the extent underlying the
iomap can change whilst the write is in progress.
The short story is that an iomap can span mutliple folios, and so
under low memory writeback can be cleaning folios the write()
overlaps. Whilst the overlapping data is cached in memory, this
isn't a problem, but because the folios are now clean they can be
reclaimed. Once reclaimed, the write() does the wrong thing when
re-instantiating partial folios because the iomap no longer reflects
the underlying state of the extent. e.g. it thinks the extent is
unwritten, so it zeroes the partial range, when in fact the
underlying extent is now written and so it should have read the data
from disk. This is how we get random zero ranges in the file
instead of the correct data.
The gory details of the race condition can be found here:
https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
Fixing the problem has two aspects. The first aspect of the problem
is ensuring that iomap can detect a stale cached iomap during a
write in a race-free manner. We already do this stale iomap
detection in the writeback path, so we have a mechanism for
detecting that the iomap backing the data range may have changed
and needs to be remapped.
In the case of the write() path, we have to ensure that the iomap is
validated at a point in time when the page cache is stable and
cannot be reclaimed from under us. We also need to validate the
extent before we start performing any modifications to the folio
state or contents. Combine these two requirements together, and the
only "safe" place to validate the iomap is after we have looked up
and locked the folio we are going to copy the data into, but before
we've performed any initialisation operations on that folio.
If the iomap fails validation, we then mark it stale, unlock the
folio and end the write. This effectively means a stale iomap
results in a short write. Filesystems should already be able to
handle this, as write operations can end short for many reasons and
need to iterate through another mapping cycle to be completed. Hence
the iomap changes needed to detect and handle stale iomaps during
write() operations is relatively simple...
However, the assumption is that filesystems should already be able
to handle write failures safely, and that's where the second
(first?) part of the problem exists. That is, handling a partial
write is harder than just "punching out the unused delayed
allocation extent". This is because mmap() based faults can race
with writes, and if they land in the delalloc region that the write
allocated, then punching out the delalloc region can cause data
corruption.
This data corruption problem is exposed by generic/346 when iomap is
converted to detect stale iomaps during write() operations. Hence
write failure in the filesytems needs to handle the fact that the
write() in progress doesn't necessarily own the data in the page
cache over the range of the delalloc extent it just allocated.
As a result, we can't just truncate the page cache over the range
the write() didn't reach and punch all the delalloc extent. We have
to walk the page cache over the untouched range and skip over any
dirty data region in the cache in that range. Which is ....
non-trivial.
That is, iterating the page cache has to handle partially populated
folios (i.e. block size < page size) that contain data. The data
might be discontiguous within a folio. Indeed, there might be
*multiple* discontiguous data regions within a single folio. And to
make matters more complex, multi-page folios mean we just don't know
how many sub-folio regions we might have to iterate to find all
these regions. All the corner cases between the conversions and
rounding between filesystem block size, folio size and multi-page
folio size combined with unaligned write offsets kept breaking my
brain.
However, if we convert the code to track the processed
write regions by byte ranges instead of fileystem block or page
cache index, we could simply use mapping_seek_hole_data() to find
the start and end of each discrete data region within the range we
needed to scan. SEEK_DATA finds the start of the cached data region,
SEEK_HOLE finds the end of the region. These are byte based
interfaces that understand partially uptodate folio regions, and so
can iterate discrete sub-folio data regions directly. This largely
solved the problem of discovering the dirty regions we need to keep
the delalloc extent over.
However, to use mapping_seek_hole_data() without needing to export
it, we have to move all the delalloc extent cleanup to the iomap
core and so now the iomap core can clean up delayed allocation
extents in a safe, sane and filesystem neutral manner.
With all this done, the original data corruption never occurs
anymore, and we now have a generic mechanism for ensuring that page
cache writes do not do the wrong thing when writeback and reclaim
change the state of the physical extent and/or page cache contents
whilst the write is in progress.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
* tag 'xfs-iomap-stale-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: drop write error injection is unfixable, remove it
xfs: use iomap_valid method to detect stale cached iomaps
iomap: write iomap validity checks
xfs: xfs_bmap_punch_delalloc_range() should take a byte range
iomap: buffered write failure should not truncate the page cache
xfs,iomap: move delalloc punching to iomap
xfs: use byte ranges for write cleanup ranges
xfs: punching delalloc extents on write failure is racy
xfs: write page faults in iomap are not buffered writes
With the changes to scan the page cache for dirty data to avoid data
corruptions from partial write cleanup racing with other page cache
operations, the drop writes error injection no longer works the same
way it used to and causes xfs/196 to fail. This is because xfs/196
writes to the file and populates the page cache before it turns on
the error injection and starts failing -overwrites-.
The result is that the original drop-writes code failed writes only
-after- overwriting the data in the cache, followed by invalidates
the cached data, then punching out the delalloc extent from under
that data.
On the surface, this looks fine. The problem is that page cache
invalidation *doesn't guarantee that it removes anything from the
page cache* and it doesn't change the dirty state of the folio. When
block size == page size and we do page aligned IO (as xfs/196 does)
everything happens to align perfectly and page cache invalidation
removes the single page folios that span the written data. Hence the
followup delalloc punch pass does not find cached data over that
range and it can punch the extent out.
IOWs, xfs/196 "works" for block size == page size with the new
code. I say "works", because it actually only works for the case
where IO is page aligned, and no data was read from disk before
writes occur. Because the moment we actually read data first, the
readahead code allocates multipage folios and suddenly the
invalidate code goes back to zeroing subfolio ranges without
changing dirty state.
Hence, with multipage folios in play, block size == page size is
functionally identical to block size < page size behaviour, and
drop-writes is manifestly broken w.r.t to this case. Invalidation of
a subfolio range doesn't result in the folio being removed from the
cache, just the range gets zeroed. Hence after we've sequentially
walked over a folio that we've dirtied (via write data) and then
invalidated, we end up with a dirty folio full of zeroed data.
And because the new code skips punching ranges that have dirty
folios covering them, we end up leaving the delalloc range intact
after failing all the writes. Hence failed writes now end up
writing zeroes to disk in the cases where invalidation zeroes folios
rather than removing them from cache.
This is a fundamental change of behaviour that is needed to avoid
the data corruption vectors that exist in the old write fail path,
and it renders the drop-writes injection non-functional and
unworkable as it stands.
As it is, I think the error injection is also now unnecessary, as
partial writes that need delalloc extent are going to be a lot more
common with stale iomap detection in place. Hence this patch removes
the drop-writes error injection completely. xfs/196 can remain for
testing kernels that don't have this data corruption fix, but those
that do will report:
xfs/196 3s ... [not run] XFS error injection drop_writes unknown on this kernel.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Now that iomap supports a mechanism to validate cached iomaps for
buffered write operations, hook it up to the XFS buffered write ops
so that we can avoid data corruptions that result from stale cached
iomaps. See:
https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
or the ->iomap_valid() introduction commit for exact details of the
corruption vector.
The validity cookie we store in the iomap is based on the type of
iomap we return. It is expected that the iomap->flags we set in
xfs_bmbt_to_iomap() is not perturbed by the iomap core and are
returned to us in the iomap passed via the .iomap_valid() callback.
This ensures that the validity cookie is always checking the correct
inode fork sequence numbers to detect potential changes that affect
the extent cached by the iomap.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
A recent multithreaded write data corruption has been uncovered in
the iomap write code. The core of the problem is partial folio
writes can be flushed to disk while a new racing write can map it
and fill the rest of the page:
writeback new write
allocate blocks
blocks are unwritten
submit IO
.....
map blocks
iomap indicates UNWRITTEN range
loop {
lock folio
copyin data
.....
IO completes
runs unwritten extent conv
blocks are marked written
<iomap now stale>
get next folio
}
Now add memory pressure such that memory reclaim evicts the
partially written folio that has already been written to disk.
When the new write finally gets to the last partial page of the new
write, it does not find it in cache, so it instantiates a new page,
sees the iomap is unwritten, and zeros the part of the page that
it does not have data from. This overwrites the data on disk that
was originally written.
The full description of the corruption mechanism can be found here:
https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
To solve this problem, we need to check whether the iomap is still
valid after we lock each folio during the write. We have to do it
after we lock the page so that we don't end up with state changes
occurring while we wait for the folio to be locked.
Hence we need a mechanism to be able to check that the cached iomap
is still valid (similar to what we already do in buffered
writeback), and we need a way for ->begin_write to back out and
tell the high level iomap iterator that we need to remap the
remaining write range.
The iomap needs to grow some storage for the validity cookie that
the filesystem provides to travel with the iomap. XFS, in
particular, also needs to know some more information about what the
iomap maps (attribute extents rather than file data extents) to for
the validity cookie to cover all the types of iomaps we might need
to validate.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
All the callers of xfs_bmap_punch_delalloc_range() jump through
hoops to convert a byte range to filesystem blocks before calling
xfs_bmap_punch_delalloc_range(). Instead, pass the byte range to
xfs_bmap_punch_delalloc_range() and have it do the conversion to
filesystem blocks internally.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
iomap_file_buffered_write_punch_delalloc() currently invalidates the
page cache over the unused range of the delalloc extent that was
allocated. While the write allocated the delalloc extent, it does
not own it exclusively as the write does not hold any locks that
prevent either writeback or mmap page faults from changing the state
of either the page cache or the extent state backing this range.
Whilst xfs_bmap_punch_delalloc_range() already handles races in
extent conversion - it will only punch out delalloc extents and it
ignores any other type of extent - the page cache truncate does not
discriminate between data written by this write or some other task.
As a result, truncating the page cache can result in data corruption
if the write races with mmap modifications to the file over the same
range.
generic/346 exercises this workload, and if we randomly fail writes
(as will happen when iomap gets stale iomap detection later in the
patchset), it will randomly corrupt the file data because it removes
data written by mmap() in the same page as the write() that failed.
Hence we do not want to punch out the page cache over the range of
the extent we failed to write to - what we actually need to do is
detect the ranges that have dirty data in cache over them and *not
punch them out*.
To do this, we have to walk the page cache over the range of the
delalloc extent we want to remove. This is made complex by the fact
we have to handle partially up-to-date folios correctly and this can
happen even when the FSB size == PAGE_SIZE because we now support
multi-page folios in the page cache.
Because we are only interested in discovering the edges of data
ranges in the page cache (i.e. hole-data boundaries) we can make use
of mapping_seek_hole_data() to find those transitions in the page
cache. As we hold the invalidate_lock, we know that the boundaries
are not going to change while we walk the range. This interface is
also byte-based and is sub-page block aware, so we can find the data
ranges in the cache based on byte offsets rather than page, folio or
fs block sized chunks. This greatly simplifies the logic of finding
dirty cached ranges in the page cache.
Once we've identified a range that contains cached data, we can then
iterate the range folio by folio. This allows us to determine if the
data is dirty and hence perform the correct delalloc extent punching
operations. The seek interface we use to iterate data ranges will
give us sub-folio start/end granularity, so we may end up looking up
the same folio multiple times as the seek interface iterates across
each discontiguous data region in the folio.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCY4SQxQAKCRDh3BK/laaZ
PAWSAP9WHL4ejQtVu2NMaEhZyIxs3weXLrFMPcQqOJ5JZhrgGAD/d6JufR/4jKWK
Sf/VLPsDlXsvPyCMJOSZAsQ5Bt1reA4=
=57DJ
-----END PGP SIGNATURE-----
Merge tag 'fuse-fixes-6.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fix from Miklos Szeredi:
"Fix a regression introduced in -rc4"
* tag 'fuse-fixes-6.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: lock inode unconditionally in fuse_fallocate()
Through this node, you can control the background discard
to run more aggressively or not aggressively when reach the
utilization rate of the space.
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Do cleanup in f2fs_tuning_parameters() and __init_discard_policy(),
let's use macro instead of number.
Suggested-by: Chao Yu <chao@kernel.org>
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Under the current logic, after the discard thread wakes up, it will not
run according to the expected policy, but will use the expected policy
before sleep. Move the strategy selection to after the thread wakes up,
so that the running state of the thread meets expectations.
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When f2fs chooses GC victim in large section & LFS mode,
next_victim_seg[gc_type] is referenced first. After segment is freed,
next_victim_seg[gc_type] has the next segment number.
However, next_victim_seg[gc_type] still has the last segment number
even after the last segment of section is freed. In this case, when f2fs
chooses a victim for the next GC round, the last segment of previous victim
section is chosen as a victim.
Initialize next_victim_seg[gc_type] to NULL_SEGNO for the last segment in
large section.
Fixes: e3080b0120 ("f2fs: support subsectional garbage collection")
Signed-off-by: Yonggil Song <yonggil.song@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use f2fs_do_truncate_blocks() to truncate all blocks in-batch in
__complete_revoke_list().
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since __queue_discard_cmd() never returns an error,
let's make it return void.
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since the file name has already passed to f2fs_new_inode(), let's
move set_file_temperature() into f2fs_new_inode().
Signed-off-by: Sheng Yong <shengyong@oppo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If compress_extension is set, and a newly created file matches the
extension, the file could be marked as compression file. However,
if inline_data is also enabled, there is no chance to check its
extension since f2fs_should_compress() always returns false.
This patch moves set_compress_inode(), which do extension check, in
f2fs_should_compress() to check extensions before setting inline
data flag.
Fixes: 7165841d57 ("f2fs: fix to check inline_data during compressed inode conversion")
Signed-off-by: Sheng Yong <shengyong@oppo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When evicting an inode with default dioread_nolock, it could be raced by
the unwritten extents converting kworker after writeback some new
allocated dirty blocks. It convert unwritten extents to written, the
extents could be merged to upper level and free extent blocks, so it
could mark the inode dirty again even this inode has been marked
I_FREEING. But the inode->i_io_list check and warning in
ext4_evict_inode() missing this corner case. Fortunately,
ext4_evict_inode() will wait all extents converting finished before this
check, so it will not lead to inode use-after-free problem, every thing
is OK besides this warning. The WARN_ON_ONCE was originally designed
for finding inode use-after-free issues in advance, but if we add
current dioread_nolock case in, it will become not quite useful, so fix
this warning by just remove this check.
======
WARNING: CPU: 7 PID: 1092 at fs/ext4/inode.c:227
ext4_evict_inode+0x875/0xc60
...
RIP: 0010:ext4_evict_inode+0x875/0xc60
...
Call Trace:
<TASK>
evict+0x11c/0x2b0
iput+0x236/0x3a0
do_unlinkat+0x1b4/0x490
__x64_sys_unlinkat+0x4c/0xb0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7fa933c1115b
======
rm kworker
ext4_end_io_end()
vfs_unlink()
ext4_unlink()
ext4_convert_unwritten_io_end_vec()
ext4_convert_unwritten_extents()
ext4_map_blocks()
ext4_ext_map_blocks()
ext4_ext_try_to_merge_up()
__mark_inode_dirty()
check !I_FREEING
locked_inode_to_wb_and_lock_list()
iput()
iput_final()
evict()
ext4_evict_inode()
truncate_inode_pages_final() //wait release io_end
inode_io_list_move_locked()
ext4_release_io_end()
trigger WARN_ON_ONCE()
Cc: stable@kernel.org
Fixes: ceff86fdda ("ext4: Avoid freeing inodes on dirty list")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220629112647.4141034-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Before this patch, the varibale 'readdir_ra' takes effect if it's equal
to '1' or not, so we can change type for it from 'int' to 'bool'.
Signed-off-by: Yuwei Guan <Yuwei.Guan@zeekrlife.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The commit 84b89e5d94 ("f2fs: add auto tuning for small devices") add
tuning for small volume device, now support to tune alloce_mode to 'reuse'
if it's small size. But the alloc_mode will change to 'default' when do
remount on this small size dievce. This patch fo fix alloc_mode changed
when do remount for a small volume device.
Signed-off-by: Yuwei Guan <Yuwei.Guan@zeekrlife.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Complaint from Matthew Wilcox in another similar place:
"submit? You don't submit anything at the 'submit' label.
it should be called 'skip' or something. But I think this
is just badly written and you don't need a goto at all."
Let's remove submit label for readability.
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
introduce a new ioctl to replace the whole content of a file atomically,
which means it induces truncate and content update at the same time.
We can start it with F2FS_IOC_START_ATOMIC_REPLACE and complete it with
F2FS_IOC_COMMIT_ATOMIC_WRITE. Or abort it with
F2FS_IOC_ABORT_ATOMIC_WRITE.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In a coming patch, we're going to rework how the filecache refcounting
works. Move some code around in the function to reduce the churn in the
later patches, and rename some of the functions with (hopefully) clearer
names: nfsd_file_flush becomes nfsd_file_fsync, and
nfsd_file_unhash_and_dispose is renamed to nfsd_file_unhash_and_queue.
Also, the nfsd_file_put_final tracepoint is renamed to nfsd_file_free,
to better match the name of the function from which it's called.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
We're counting mapping->nrpages, but not all of those are necessarily
dirty. We don't really have a simple way to count just the dirty pages,
so just remove this stat since it's not accurate.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
fh_match() is costly, especially when filehandles are large (as is
the case for NFSv4). It needs to be used sparingly when searching
data structures. Unfortunately, with common workloads, I see
multiple thousands of objects stored in file_hashtbl[], which has
just 256 buckets, making its bucket hash chains quite lengthy.
Walking long hash chains with the state_lock held blocks other
activity that needs that lock. Sizable hash chains are a common
occurrance once the server has handed out some delegations, for
example -- IIUC, each delegated file is held open on the server by
an nfs4_file object.
To help mitigate the cost of searching with fh_match(), replace the
nfs4_file hash table with an rhashtable, which can dynamically
resize its bucket array to minimize hash chain length.
The result of this modification is an improvement in the latency of
NFSv4 operations, and the reduction of nfsd CPU utilization due to
eliminating the cost of multiple calls to fh_match() and reducing
the CPU cache misses incurred while walking long hash chains in the
nfs4_file hash table.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
find_file() is now the only caller of find_file_locked(), so just
fold these two together.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Remove the call to find_file_locked() in insert_nfs4_file(). Tracing
shows that over 99% of these calls return NULL. Thus it is not worth
the expense of the extra bucket list traversal. insert_file() already
deals correctly with the case where the item is already in the hash
bucket.
Since nfsd4_file_hash_insert() is now just a wrapper around
insert_file(), move the meat of insert_file() into
nfsd4_file_hash_insert() and get rid of it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Refactor to relocate hash deletion operation to a helper function
that is close to most other nfs4_file data structure operations.
The "noinline" annotation will become useful in a moment when the
hlist_del_rcu() is replaced with a more complex rhash remove
operation. It also guarantees that hash remove operations can be
traced with "-p function -l remove_nfs4_file_locked".
This also simplifies the organization of forward declarations: the
to-be-added rhashtable and its param structure will be defined
/after/ put_nfs4_file().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Name this function more consistently. I'm going to use nfsd4_file_
and nfsd4_file_hash_ for these helpers.
Change the @fh parameter to be const pointer for better type safety.
Finally, move the hash insertion operation to the caller. This is
typical for most other "init_object" type helpers, and it is where
most of the other nfs4_file hash table operations are located.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Enable callers to use const pointers for type safety.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Enable callers to use const pointers where they are able to.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Delegation revocation is an exceptional event that is not otherwise
visible externally (eg, no network traffic is emitted). Generate a
trace record when it occurs so that revocation can be observed or
other activity can be triggered. Example:
nfsd-1104 [005] 1912.002544: nfsd_stid_revoke: client 633c9343:4e82788d stateid 00000003:00000001 ref=2 type=DELEG
Trace infrastructure is provided for subsequent additional tracing
related to nfs4_stid activity.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Handing out a delegation stateid is recorded with the
nfsd_deleg_read tracepoint, but there isn't a matching tracepoint
for recording when the stateid is returned.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Remove the lame-duck dprintk()s around nfs4_preprocess_stateid_op()
call sites.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Record what we've learned recently about the NFSD filecache in a
documenting comment so our future selves don't forget what all this
is for.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
NFSv4 operations manage the lifetime of nfsd_file items they use by
means of NFSv4 OPEN and CLOSE. Hence there's no need for them to be
garbage collected.
Introduce a mechanism to enable garbage collection for nfsd_file
items used only by NFSv2/3 callers.
Note that the change in nfsd_file_put() ensures that both CLOSE and
DELEGRETURN will actually close out and free an nfsd_file on last
reference of a non-garbage-collected file.
Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=394
Suggested-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
This reverts commit 5e138c4a75.
That commit attempted to make files available to other users as soon
as all NFSv4 clients were done with them, rather than waiting until
the filecache LRU had garbage collected them.
It gets the reference counting wrong, for one thing.
But it also misses that DELEGRETURN should release a file in the
same fashion. In fact, any nfsd_file_put() on an file held open
by an NFSv4 client needs potentially to release the file
immediately...
Clear the way for implementing that idea.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
In a moment I'm going to introduce separate nfsd_file types, one of
which is garbage-collected; the other, not. The garbage-collected
variety is to be used by NFSv2 and v3, and the non-garbage-collected
variety is to be used by NFSv4.
nfsd_commit() is invoked by both NFSv3 and NFSv4 consumers. We want
nfsd_commit() to find and use the correct variety of cached
nfsd_file object for the NFS version that is in use.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
We had a report of this:
BUG: sleeping function called from invalid context at fs/nfsd/filecache.c:440
...with a stack trace showing nfsd_file_put being called from
nfs4_show_open. This code has always tried to call fput while holding a
spinlock, but we recently changed this to use the filecache, and that
started triggering the might_sleep() in nfsd_file_put.
states_start takes and holds the cl_lock while iterating over the
client's states, and we can't sleep with that held.
Have the various nfs4_show_* functions instead hold the fi_lock instead
of taking a nfsd_file reference.
Fixes: 78599c42ae ("nfsd4: add file to display list of client's opens")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2138357
Reported-by: Zhi Li <yieli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
expfs.c has a bunch of dprintk statements which are unusable due to:
#define dprintk(fmt, args...) do{}while(0)
Use pr_debug so that they can be enabled dynamically.
Also make some minor changes to the debug statements to fix some
incorrect types, and remove __func__ which can be handled by dynamic
debug separately.
Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
rpc.nfsd stopped supporting NFSv2 a year ago. Take the next logical
step toward deprecating it and allow NFSv2 support to be compiled out.
Add a new CONFIG_NFSD_V2 option that can be turned off and rework the
CONFIG_NFSD_V?_ACL option dependencies. Add a description that
discourages enabling it.
Also, change the description of CONFIG_NFSD to state that the always-on
version is now 3 instead of 2.
Finally, add an #ifdef around "case 2:" in __write_versions. When NFSv2
is disabled at compile time, this should make the kernel ignore attempts
to disable it at runtime, but still error out when trying to enable it.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
nfserrno() is common to all nfs versions, but nfsproc.c is specifically
for NFSv2. Move it to vfs.c, and the prototype to vfs.h.
While we're in here, remove the #ifdef EDQUOT check in this function.
It's apparently a holdover from the initial merge of the nfsd code in
1997. No other place in the kernel checks that that symbol is defined
before using it, so I think we can dispense with it here.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The kernel currently errors out if you attempt to enable or disable a
version that it doesn't recognize. Change it to ignore attempts to
disable an unrecognized version. If we don't support it, then there is
no harm in doing so.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
For some reason, the NFSv2 GETACL result encoder was fully converted
to use the new nfs_stream_encode_acl(), but the NFSv3 equivalent was
not similarly converted.
Fixes: 20798dfe24 ("NFSD: Update the NFSv3 GETACL result encoder to use struct xdr_stream")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The xdr_stream conversion inadvertently left some code that set the
page_len of the send buffer. The XDR stream encoders should handle
this automatically now.
This oversight adds garbage past the end of the Reply message.
Clients typically ignore the garbage, but NFSD does not need to send
it, as it leaks stale memory contents onto the wire.
Fixes: f8cba47344 ("NFSD: Update the NFSv2 GETACL result encoder to use struct xdr_stream")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Variable host_err is assigned a value that is never read, it is being
re-assigned a value in every different execution path in the following
switch statement. The assignment is redundant and can be removed.
Cleans up clang-scan warning:
warning: Value stored to 'host_err' is never read [deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Chuck had suggested reverting READ_PLUS so it returns a single DATA
segment covering the requested read range. This prepares the server for
a future "sparse read" function so support can easily be added without
needing to rip out the old READ_PLUS code at the same time.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
ts=4 can cause misunderstanding in code reading. It is better to replace
8 spaces with one tab.
Signed-off-by: Rong Tao <rongtao@cestc.cn>
Signed-off-by: Jan Kara <jack@suse.cz>
ovl_change_flags() is an open-coded variant of fs/fcntl.c:setfl() and it
got missed by commit 164f4064ca ("keep iocb_flags() result cached in
struct file"); the same change applies there.
Reported-by: Pierre Labastie <pierre.labastie@neuf.fr>
Fixes: 164f4064ca ("keep iocb_flags() result cached in struct file")
Cc: <stable@vger.kernel.org> # v6.0
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216738
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In 27cfa25895 "ext2: fix fs corruption when trying to remove
a non-empty directory with IO error" a funny thing has happened:
- page = ext2_get_page(inode, i, dir_has_error, &page_addr);
+ page = ext2_get_page(inode, i, 0, &page_addr);
- if (IS_ERR(page)) {
- dir_has_error = 1;
- continue;
- }
+ if (IS_ERR(page))
+ goto not_empty;
And at not_empty: we hit ext2_put_page(page, page_addr), which does
put_page(page). Which, unless I'm very mistaken, should oops
immediately when given ERR_PTR(-E...) as page.
OK, shit happens, insufficiently tested patches included. But when
commit in question describes the fault-injection test that exercised
that particular failure exit...
Ow.
CC: stable@vger.kernel.org
Fixes: 27cfa25895 ("ext2: fix fs corruption when trying to remove a non-empty directory with IO error")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
ovl_dentry_revalidate_common() can be called in rcu-walk mode. As document
said, "in rcu-walk mode, d_parent and d_inode should not be used without
care".
Check inode here to protect access under rcu-walk mode.
Fixes: bccece1ead ("ovl: allow remote upper")
Reported-and-tested-by: syzbot+a4055c78774bbf3498bb@syzkaller.appspotmail.com
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Cc: <stable@vger.kernel.org> # v5.7
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We should check that the filehandles match before transferring the
sillyrename data to the newly looked-up dentry in case the name was
reused on the server.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When mounting from a NFSv4 referral, path->dentry can end up being a
negative dentry, so derive the struct nfs_server from the dentry
itself instead.
Fixes: 2b0143b5c9 ("VFS: normal filesystems (and lustre): d_inode() annotations")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If we're asked to recover open state while a delegation return is
outstanding, then the state manager thread cannot use a cached open, so
if the server returns a delegation, we can end up deadlocked behind the
pending delegreturn.
To avoid this problem, let's just ask the server not to give us a
delegation unless we're explicitly reclaiming one.
Fixes: be36e185bd ("NFSv4: nfs4_open_recover_helper() must set share access")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Since commit 1a34c8c9a4 ("NFS: Support larger readdir buffers") has
updated dtsize, and with recent improvements to the READDIRPLUS helper
heuristic, the heuristic may not trigger until many dentries are emitted
to userspace. This will cause many thousands of GETATTR calls for "ls
-l" when the directory's pagecache has already been populated. This
manifests as poor performance for long directory listings after an
initially fast "ls -l".
Fix this by emitting only 17 entries for any first pass through the NFS
directory's ->iterate_shared(), which allows userpace to prime the
counters for the heuristic.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The call to nfs4_label_init_security() should return a fully initialised
label.
Fixes: aa9c266962 ("NFS: Client implementation of Labeled-NFS")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We must not change the value of label->len if it is zero, since that
indicates we stored a label.
Fixes: b4487b9354 ("nfs: Fix getxattr kernel panic and memory overflow")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the server returns a reply that includes a security label, then we
must decode it whether or not we can store the results.
Fixes: 1e2f67da89 ("NFS: Remove the nfs4_label argument from decode_getattr_*() functions")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We need to clear the FATTR4_WORD2_SECURITY_LABEL bitmap flag
irrespective of whether or not the label is too long.
Fixes: aa9c266962 ("NFS: Client implementation of Labeled-NFS")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
POSIX typically only refreshes the user's supplementary group
information upon login. Since NFS servers may often refresh their
concept of the user supplementary group membership at their own cadence,
it is possible for the NFS client's access cache to become stale due to
the user's group membership changing on the server after the user has
already logged in on the client.
While it is reasonable to expect that such group membership changes are
rare, and that we do not want to optimise the cache to accommodate them,
it is also not unreasonable for the user to expect that if they log out
and log back in again, that the staleness would clear up.
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmOC6QYACgkQiiy9cAdy
T1EaUgv/eEFpIzkRXuJHuPBlxWM7Gz6DiFkWd83ADcB15+4ADzPVdc1HqbnSw0/0
yPsxfhn95Ydo+zORPJ5puGUpveqNIen8xPeGaORGCCEMCYw5A58K2I8bkSYcMJUd
63hiLmw9ZpaXxkLXX/Y04hSYBwPIhwLSrT1kfPy7jru78DjvmOya1WHaBzQZQ0ej
374cDrG9yhWl+udFWJi1HxQt+fnJecaMNJy51BDeOeN3mM4j/kqHgYw2q5GfcbhL
nyr56UTxHvBpyciguKzqNIUVcFiANr7Il5Yz++NSjRcFUnhgoxPZ+eHMU1X6fpFB
fgLhcDsbBeWhPZnOAsb4X3HEhb41f61i5s2pYzwU7bhXYla0UrqEIS9RNjh3dgPx
WnRErVPiOwPP6mRl+064KITWcv3riDb8QbwEtqJQQT1TUPXM0b190wq59lHqIpJk
rV41bV7s3z839CulF+2Y9fuQVe5SL90kDSxF3gOue4SCzBfbl0Tww5gn3IIHvo+5
cyZZ3wQU
=89M+
-----END PGP SIGNATURE-----
Merge tag '6.1-rc6-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Two small cifs/smb3 client fixes:
- an unlock missing in an error path in copychunk_range found by
xfstest 476
- a fix for a use after free in a debug code path"
* tag '6.1-rc6-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix missing unlock in cifs_file_copychunk_range()
cifs: Use after free in debug code
- Fix rare data corruption on READ operations
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmOCPhcACgkQM2qzM29m
f5ewTRAAuAjJvtnsRxZBILs76U3KpAnn3ghtYKtzCmanq1UEJAlTz5vLLCqJRTXw
OajN4f/T/kD3bY5XxYcsbSMhVJhgd7JMx/DyPjp3rY+nS1cS/CB+TxN3VqhOZJDl
MfEJTN0MxWrq0pXKYxuPECmwk/C7jAU4vjWU+sDWhu4Anig564fdPxGGWw9vRfh5
YP8U8cuM5LIFCM2ZSjPsqWEXgd7pbYpYdOx6JPA74ftia8RBU1YVbIqZ6uFXiTnV
tpKAuQYem2ixV6UfsrVbeHDfsSkQcX2iaTGlcLB5y12nprV0wTOQM6lONTD878vh
37oc8zG9cdpWfcRzkyB8a58jHfSk74pMkV300C38CeV1KtajFVoRkoTvLYJdxbf8
WPcRcbsXbotb4+A1D2H6QvPKUbrlK+HIHe8POU67tKq5BI8xRw+ojTsJ2ANYr9Jt
OviLUraUnKFvCK2LN03+Op4l+UBTIdOGzGMqveILfxPD6zlTYJbQQ64Nih4JLfJG
uDN2892H4He3he5oD9Dzoh2xSWfLV5PJEAaN9wW7pyv8D5HfYBXWPan6JeaMHnfW
+NcoFQ3cSewyMZ1Rw0aCsKfXRGwikJrbAsrlvNe+uFRD6ueeWULp5fnTlOn+0tW1
tp+EzLOMoNkL9NW/rLeCIPN1MNXzgabgd4Diq8U3iTDRxf0FahU=
=zxjk
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.1-6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fix from Chuck Lever:
- Fix rare data corruption on READ operations
* tag 'nfsd-6.1-6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: Fix reads with a non-zero offset that don't end on a page boundary
- Fix a race between zonefs module initialization of sysfs attribute
directory and mounting a drive (from Xiaoxu).
- Fix active zone accounting in the rare case of an IO error due to a
zone transition to offline or read-only state (from me).
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCY4FHbQAKCRDdoc3SxdoY
dtVJAQDQPKQW7GlxgmfsxNZz4bR9CT2DHqoprN1g1bbKM9hYKwEApq0xtpBjyqX2
EaePIFvlYa67f5kWnW4XEd5vetbh0AI=
=imdG
-----END PGP SIGNATURE-----
Merge tag 'zonefs-6.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs fixes from Damien Le Moal:
- Fix a race between zonefs module initialization of sysfs attribute
directory and mounting a drive (from Xiaoxu).
- Fix active zone accounting in the rare case of an IO error due to a
zone transition to offline or read-only state (from me).
* tag 'zonefs-6.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: Fix active zone accounting
zonefs: Fix race between modprobe and mount
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmOBCJEACgkQxWXV+ddt
WDu5Nw/+P59ARfAm/4HRId4iL6UKozSMc+blWLeP9KkjcytdAfek0oGe3gZ7NJVK
8VYa93yNneCTkNFLIEpqEduGQjN04dr0odRUXD/kIR8EEtjbgDrH9ZmL47An5wVH
qE8ILlh2+DXk/QLTpjo8n4mm+MJDJYzfz/jVV9vl8ehMahjj1M0/KmO/vNvDbP2s
owWU1FBjX7TV6kHa+SQGqd1HfXS1YUx203I4SDmPj8vSXtysvSOWClT3HO6i6O5S
MSS3Me+rx9eMFMISNghL8I466+lPlGxK14DmLUE4l0kfoKyd4eHQw+ft76D6Twuz
JqjegAGA1nzqDO0XDXb4WPjrPKG8r8Ven2eInF3kncku9GyeEafL+L+nmj7PHsE7
dixWo2TQ9z1Wm/n1NWlU02ZSLdbetUtYTvZczUhevtNzuYUtILihcFZO3+Cp7V4p
R2WwJ5XXdfS8g8Q9kJCOuVd9fZ+3hQvEF1IwWCP9ZZfmIC6/4/uGGFB6TJu7HmZC
trpQYn9l5aP9L9Uq8t+9j+XoDEzQW0tZGpiYI9ypAa5Q5xbw3Ez2JNTbF7YVqQE2
iFDwuuy/X1iNvifniQgdodKVQLK/PcNrlcNb/gPG6cGCWjlTj3SKT9SlrwAgSDZW
pFWFb9NtN3ORjLeCiONo/ZGpZzM9/XQplub+4WuXQXGNJasRIoE=
=Q4JA
-----END PGP SIGNATURE-----
Merge tag 'for-6.1-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix a regression in nowait + buffered write
- in zoned mode fix endianness when comparing super block generation
- locking and lockdep fixes:
- fix potential sleeping under spinlock when setting qgroup limit
- lockdep warning fixes when btrfs_path is freed after copy_to_user
- do not modify log tree while holding a leaf from fs tree locked
- fix freeing of sysfs files of static features on error
- use kv.alloc for zone map allocation as a fallback to avoid warnings
due to high order allocation
- send, avoid unaligned encoded writes when attempting to clone range
* tag 'for-6.1-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: sysfs: normalize the error handling branch in btrfs_init_sysfs()
btrfs: do not modify log tree while holding a leaf from fs tree locked
btrfs: use kvcalloc in btrfs_get_dev_zone_info
btrfs: qgroup: fix sleep from invalid context bug in btrfs_qgroup_inherit()
btrfs: send: avoid unaligned encoded writes when attempting to clone range
btrfs: zoned: fix missing endianness conversion in sb_write_pointer
btrfs: free btrfs_path before copying subvol info to userspace
btrfs: free btrfs_path before copying fspath to userspace
btrfs: free btrfs_path before copying inodes to userspace
btrfs: free btrfs_path before copying root refs to userspace
btrfs: fix assertion failure and blocking during nowait buffered write
OFFSET_MAX is self-annotated and more readable.
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There have been a lot of hotfixes this cycle, and this is quite a large
batch given how far we are into the -rc cycle. Presumably a reflection of
the unusually large amount of MM material which went into 6.1-rc1.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY4Bd6gAKCRDdBJ7gKXxA
jvX6AQCsG1ld24kMpdD+70XXUyC29g/6/jribgtZApHyDYjxSwD/WmLNpPlUPRax
WB071Y5w65vjSTUKvwU0OLGbHwyxgAw=
=swD5
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2022-11-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"24 MM and non-MM hotfixes. 8 marked cc:stable and 16 for post-6.0
issues.
There have been a lot of hotfixes this cycle, and this is quite a
large batch given how far we are into the -rc cycle. Presumably a
reflection of the unusually large amount of MM material which went
into 6.1-rc1"
* tag 'mm-hotfixes-stable-2022-11-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (24 commits)
test_kprobes: fix implicit declaration error of test_kprobes
nilfs2: fix nilfs_sufile_mark_dirty() not set segment usage as dirty
mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1
mm: fix unexpected changes to {failslab|fail_page_alloc}.attr
swapfile: fix soft lockup in scan_swap_map_slots
hugetlb: fix __prep_compound_gigantic_page page flag setting
kfence: fix stack trace pruning
proc/meminfo: fix spacing in SecPageTables
mm: multi-gen LRU: retry folios written back while isolated
mailmap: update email address for Satya Priya
mm/migrate_device: return number of migrating pages in args->cpages
kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible
MAINTAINERS: update Alex Hung's email address
mailmap: update Alex Hung's email address
mm: mmap: fix documentation for vma_mas_szero
mm/damon/sysfs-schemes: skip stats update if the scheme directory is removed
mm/memory: return vm_fault_t result from migrate_to_ram() callback
mm: correctly charge compressed memory to its memcg
ipc/shm: call underlying open/close vm_ops
gcov: clang: fix the buffer overflow issue
...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCY4A7qAAKCRBZ7Krx/gZQ
61SoAP9bWYuW44RNEkitOIPCi/bmWy5D9alVEjWinmxzsT94NgEA5rfc4uuSCLp1
se+Qgcu5iNKpHTXIOAOX7ozCCQJzQgo=
=GK/t
-----END PGP SIGNATURE-----
Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
"A couple of fixes, one of them for this cycle regression..."
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: vfs_tmpfile: ensure O_EXCL flag is enforced
fs: use acquire ordering in __fget_light()
READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.
Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If a file zone transitions to the offline or readonly state from an
active state, we must clear the zone active flag and decrement the
active seq file counter. Do so in zonefs_account_active() using the new
zonefs inode flags ZONEFS_ZONE_OFFLINE and ZONEFS_ZONE_READONLY. These
flags are set if necessary in zonefs_check_zone_condition() based on the
result of report zones operation after an IO error.
Fixes: 87c9ce3ffe ("zonefs: Add active seq file accounting")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Commit 868f9f2f8e ("vfs: fix copy_file_range() regression in cross-fs
copies") removed fallback to generic_copy_file_range() for cross-fs
cases inside vfs_copy_file_range().
To preserve behavior of nfsd and ksmbd server-side-copy, the fallback to
generic_copy_file_range() was added in nfsd and ksmbd code, but that
call is missing sb_start_write(), fsnotify hooks and more.
Ideally, nfsd and ksmbd would pass a flag to vfs_copy_file_range() that
will take care of the fallback, but that code would be subtle and we got
vfs_copy_file_range() logic wrong too many times already.
Instead, add a flag to explicitly request vfs_copy_file_range() to
perform only generic_copy_file_range() and let nfsd and ksmbd use this
flag only in the fallback path.
This choise keeps the logic changes to minimum in the non-nfsd/ksmbd code
paths to reduce the risk of further regressions.
Fixes: 868f9f2f8e ("vfs: fix copy_file_range() regression in cross-fs copies")
Tested-by: Namjae Jeon <linkinjeon@kernel.org>
Tested-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove the pointless keying argument and associated enum and pass the
fill_super callback and a "bool reconf" instead. Also mark the function
static given that there are no users outside of super.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The only real difference is in filling per-thread notes - getting
the values of registers. And this is the only part that is worth
an ifdef - we don't need to duplicate the logics regarding gathering
threads, filling other notes, etc.
It would've been hard to do back when regset-based variant had been
introduced, mostly due to sharing bits and pieces of helpers with
aout coredumps. As the result, too much had been duplicated and
the copies had drifted away since then. Now it can be done cleanly...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Don't bother with pointless macros - we are not sharing it with aout coredumps
anymore. Just convert the underlying functions to the same arguments (nobody
uses regs, actually) and call them elf_core_copy_task_fpregs(). And unexport
the entire bunch, while we are at it.
[added missing includes in arch/{csky,m68k,um}/kernel/process.c to avoid extra
warnings about the lack of externs getting added to huge piles for those
files. Pointless, but...]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
copy_mnt_ns() has the old tree copied, with mntns binding *and* anything
bound on top of them skipped. Then it proceeds to walk both trees in
parallel. Unfortunately, it doesn't get the "skip the stuff we'd skipped
when copying" quite right. Consequences are minor (the ->mnt_root
comparison will return the situation to sanity pretty soon and the worst
we get is the unexpected subset of opened non-directories being switched
to new namespace), but it's confusing enough and it's not hard to get
the expected behaviour...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
and a use-after-free that can be triggered by a maliciously corrupted
file system.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmN+1NoACgkQ8vlZVpUN
gaNY8Qf/ZJyI8JZTEyrd7KxM1S6eDwUth1kYsQQtqQojd5tTqYOsslSiTp2Yw+LI
4Gw75tDiMA6CYt+BuvbbgOk36YXaPW69sB+uParL/hgK005R50raQZ7oMdjQba4Z
ODa+x27r5SVZIrcEAcHX15+BcjeCDZ/e5RMsV37ww+LsKNlnWNYn4o3S6eIv1ERo
0iqgasbaaATCy37gStRvtnbsyPWDdwL5XlJg0XnqFLzo6Yz1NMcXaxEZehyCaCSc
SixkjSR1gafQu/0ZTdVaH1vXQPmFwCEMOGONdbb+FrQGnIBFv/kvgXeTKTkjOP1+
2GV7UgdPXeUNECTrMBieEpxsLqpcZw==
=8ya6
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Fix a regression in the lazytime code that was introduced in v6.1-rc1,
and a use-after-free that can be triggered by a maliciously corrupted
file system"
* tag 'ext4_for_linus_stable2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
fs: do not update freeing inode i_io_list
ext4: fix use-after-free in ext4_ext_shift_extents
The devnode() in struct class should not be modifying the device that is
passed into it, so mark it as a const * and propagate the function
signature changes out into all relevant subsystems that use this
callback.
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Justin Sanders <justin@coraid.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Benjamin Gaignard <benjamin.gaignard@collabora.com>
Cc: Liam Mark <lmark@codeaurora.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Brian Starkey <Brian.Starkey@arm.com>
Cc: John Stultz <jstultz@google.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Sean Young <sean@mess.org>
Cc: Frank Haverkamp <haver@linux.ibm.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Xie Yongji <xieyongji@bytedance.com>
Cc: Gautam Dawar <gautam.dawar@xilinx.com>
Cc: Dan Carpenter <error27@gmail.com>
Cc: Eli Cohen <elic@nvidia.com>
Cc: Parav Pandit <parav@nvidia.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: alsa-devel@alsa-project.org
Cc: dri-devel@lists.freedesktop.org
Cc: kvm@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-block@vger.kernel.org
Cc: linux-input@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Link: https://lore.kernel.org/r/20221123122523.1332370-2-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This was found when virtual machines with nfs-mounted qcow2 disks
failed to boot properly.
Reported-by: Anders Blomdell <anders.blomdell@control.lth.se>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2142132
Fixes: bfbfb6182a ("nfsd_splice_actor(): handle compound pages")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The type of a->key[0] is char in fscache_volume_same(). If the length
of cache volume key is greater than 127, the value of a->key[0] is less
than 0. In this case, klen becomes much larger than 255 after type
conversion, because the type of klen is size_t. As a result, memcmp()
is read out of bounds.
This causes a slab-out-of-bounds Read in __fscache_acquire_volume(), as
reported by Syzbot.
Fix this by changing the type of the stored key to "u8 *" rather than
"char *" (it isn't a simple string anyway). Also put in a check that
the volume name doesn't exceed NAME_MAX.
BUG: KASAN: slab-out-of-bounds in memcmp+0x16f/0x1c0 lib/string.c:757
Read of size 8 at addr ffff888016f3aa90 by task syz-executor344/3613
Call Trace:
memcmp+0x16f/0x1c0 lib/string.c:757
memcmp include/linux/fortify-string.h:420 [inline]
fscache_volume_same fs/fscache/volume.c:133 [inline]
fscache_hash_volume fs/fscache/volume.c:171 [inline]
__fscache_acquire_volume+0x76c/0x1080 fs/fscache/volume.c:328
fscache_acquire_volume include/linux/fscache.h:204 [inline]
v9fs_cache_session_get_cookie+0x143/0x240 fs/9p/cache.c:34
v9fs_session_init+0x1166/0x1810 fs/9p/v9fs.c:473
v9fs_mount+0xba/0xc90 fs/9p/vfs_super.c:126
legacy_get_tree+0x105/0x220 fs/fs_context.c:610
vfs_get_tree+0x89/0x2f0 fs/super.c:1530
do_new_mount fs/namespace.c:3040 [inline]
path_mount+0x1326/0x1e20 fs/namespace.c:3370
do_mount fs/namespace.c:3383 [inline]
__do_sys_mount fs/namespace.c:3591 [inline]
__se_sys_mount fs/namespace.c:3568 [inline]
__x64_sys_mount+0x27f/0x300 fs/namespace.c:3568
Fixes: 62ab633523 ("fscache: Implement volume registration")
Reported-by: syzbot+a76f6a6e524cf2080aa3@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Zhang Peng <zhangpeng362@huawei.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: v9fs-developer@lists.sourceforge.net
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/Y3OH+Dmi0QIOK18n@codewreck.org/ # Zhang Peng's v1 fix
Link: https://lore.kernel.org/r/20221115140447.2971680-1-zhangpeng362@huawei.com/ # Zhang Peng's v2 fix
Link: https://lore.kernel.org/r/166869954095.3793579.8500020902371015443.stgit@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix kernel-doc warnings. Many of these are about a function's
return value, so use the kernel-doc Return: format to fix those
Use % prefix on numeric constant values.
dir.c: fix typos/spellos
file.c fix typo: s/taret/target/
Fix all of these kernel-doc warnings:
dir.c:305: warning: missing initial short description on line:
* kernfs_name_hash
dir.c:137: warning: No description found for return value of 'kernfs_path_from_node_locked'
dir.c:196: warning: No description found for return value of 'kernfs_name'
dir.c:224: warning: No description found for return value of 'kernfs_path_from_node'
dir.c:292: warning: No description found for return value of 'kernfs_get_parent'
dir.c:312: warning: No description found for return value of 'kernfs_name_hash'
dir.c:404: warning: No description found for return value of 'kernfs_unlink_sibling'
dir.c:588: warning: No description found for return value of 'kernfs_node_from_dentry'
dir.c:806: warning: No description found for return value of 'kernfs_find_ns'
dir.c:879: warning: No description found for return value of 'kernfs_find_and_get_ns'
dir.c:904: warning: No description found for return value of 'kernfs_walk_and_get_ns'
dir.c:927: warning: No description found for return value of 'kernfs_create_root'
dir.c:996: warning: No description found for return value of 'kernfs_root_to_node'
dir.c:1016: warning: No description found for return value of 'kernfs_create_dir_ns'
dir.c:1048: warning: No description found for return value of 'kernfs_create_empty_dir'
dir.c:1306: warning: No description found for return value of 'kernfs_next_descendant_post'
dir.c:1568: warning: No description found for return value of 'kernfs_remove_self'
dir.c:1630: warning: No description found for return value of 'kernfs_remove_by_name_ns'
dir.c:1667: warning: No description found for return value of 'kernfs_rename_ns'
file.c:66: warning: No description found for return value of 'of_on'
file.c:88: warning: No description found for return value of 'kernfs_deref_open_node_locked'
file.c:1036: warning: No description found for return value of '__kernfs_create_file'
inode.c💯 warning: No description found for return value of 'kernfs_setattr'
mount.c:160: warning: No description found for return value of 'kernfs_root_from_sb'
mount.c:198: warning: No description found for return value of 'kernfs_node_dentry'
mount.c:302: warning: No description found for return value of 'kernfs_super_ns'
mount.c:318: warning: No description found for return value of 'kernfs_get_tree'
symlink.c:28: warning: No description found for return value of 'kernfs_create_link'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20221112031456.22980-1-rdunlap@infradead.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Although kset_unregister() can eventually remove all attribute files,
explicitly rolling back with the matching function makes the code logic
look clearer.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode in full mode, or when logging xattrs or when logging
the dir index items of a directory, we are modifying the log tree while
holding a read lock on a leaf from the fs/subvolume tree. This can lead to
a deadlock in rare circumstances, but it is a real possibility, and it was
recently reported by syzbot with the following trace from lockdep:
WARNING: possible circular locking dependency detected
6.1.0-rc5-next-20221116-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.1/16154 is trying to acquire lock:
ffff88807e3084a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256
but task is already holding lock:
ffff88807df33078 (btrfs-log-00){++++}-{3:3}, at: __btrfs_tree_lock+0x32/0x3d0 fs/btrfs/locking.c:197
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (btrfs-log-00){++++}-{3:3}:
down_read_nested+0x9e/0x450 kernel/locking/rwsem.c:1634
__btrfs_tree_read_lock+0x32/0x350 fs/btrfs/locking.c:135
btrfs_tree_read_lock fs/btrfs/locking.c:141 [inline]
btrfs_read_lock_root_node+0x82/0x3a0 fs/btrfs/locking.c:280
btrfs_search_slot_get_root fs/btrfs/ctree.c:1678 [inline]
btrfs_search_slot+0x3ca/0x2c70 fs/btrfs/ctree.c:1998
btrfs_lookup_csum+0x116/0x3f0 fs/btrfs/file-item.c:209
btrfs_csum_file_blocks+0x40e/0x1370 fs/btrfs/file-item.c:1021
log_csums.isra.0+0x244/0x2d0 fs/btrfs/tree-log.c:4258
copy_items.isra.0+0xbfb/0xed0 fs/btrfs/tree-log.c:4403
copy_inode_items_to_log+0x13d6/0x1d90 fs/btrfs/tree-log.c:5873
btrfs_log_inode+0xb19/0x4680 fs/btrfs/tree-log.c:6495
btrfs_log_inode_parent+0x890/0x2a20 fs/btrfs/tree-log.c:6982
btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7083
btrfs_sync_file+0xa41/0x13c0 fs/btrfs/file.c:1921
vfs_fsync_range+0x13e/0x230 fs/sync.c:188
generic_write_sync include/linux/fs.h:2856 [inline]
iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128
btrfs_direct_write fs/btrfs/file.c:1536 [inline]
btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668
call_write_iter include/linux/fs.h:2160 [inline]
do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
do_iter_write+0x182/0x700 fs/read_write.c:861
vfs_iter_write+0x74/0xa0 fs/read_write.c:902
iter_file_splice_write+0x745/0xc90 fs/splice.c:686
do_splice_from fs/splice.c:764 [inline]
direct_splice_actor+0x114/0x180 fs/splice.c:931
splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886
do_splice_direct+0x1ab/0x280 fs/splice.c:974
do_sendfile+0xb19/0x1270 fs/read_write.c:1255
__do_sys_sendfile64 fs/read_write.c:1323 [inline]
__se_sys_sendfile64 fs/read_write.c:1309 [inline]
__x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #1 (btrfs-tree-00){++++}-{3:3}:
__lock_release kernel/locking/lockdep.c:5382 [inline]
lock_release+0x371/0x810 kernel/locking/lockdep.c:5688
up_write+0x2a/0x520 kernel/locking/rwsem.c:1614
btrfs_tree_unlock_rw fs/btrfs/locking.h:189 [inline]
btrfs_unlock_up_safe+0x1e3/0x290 fs/btrfs/locking.c:238
search_leaf fs/btrfs/ctree.c:1832 [inline]
btrfs_search_slot+0x265e/0x2c70 fs/btrfs/ctree.c:2074
btrfs_insert_empty_items+0xbd/0x1c0 fs/btrfs/ctree.c:4133
btrfs_insert_delayed_item+0x826/0xfa0 fs/btrfs/delayed-inode.c:746
btrfs_insert_delayed_items fs/btrfs/delayed-inode.c:824 [inline]
__btrfs_commit_inode_delayed_items fs/btrfs/delayed-inode.c:1111 [inline]
__btrfs_run_delayed_items+0x280/0x590 fs/btrfs/delayed-inode.c:1153
flush_space+0x147/0xe90 fs/btrfs/space-info.c:728
btrfs_async_reclaim_metadata_space+0x541/0xc10 fs/btrfs/space-info.c:1086
process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289
worker_thread+0x669/0x1090 kernel/workqueue.c:2436
kthread+0x2e8/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
check_prev_add kernel/locking/lockdep.c:3097 [inline]
check_prevs_add kernel/locking/lockdep.c:3216 [inline]
validate_chain kernel/locking/lockdep.c:3831 [inline]
__lock_acquire+0x2a43/0x56d0 kernel/locking/lockdep.c:5055
lock_acquire kernel/locking/lockdep.c:5668 [inline]
lock_acquire+0x1e3/0x630 kernel/locking/lockdep.c:5633
__mutex_lock_common kernel/locking/mutex.c:603 [inline]
__mutex_lock+0x12f/0x1360 kernel/locking/mutex.c:747
__btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256
__btrfs_release_delayed_node fs/btrfs/delayed-inode.c:251 [inline]
btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline]
btrfs_remove_delayed_node+0x52/0x60 fs/btrfs/delayed-inode.c:1285
btrfs_evict_inode+0x511/0xf30 fs/btrfs/inode.c:5554
evict+0x2ed/0x6b0 fs/inode.c:664
dispose_list+0x117/0x1e0 fs/inode.c:697
prune_icache_sb+0xeb/0x150 fs/inode.c:896
super_cache_scan+0x391/0x590 fs/super.c:106
do_shrink_slab+0x464/0xce0 mm/vmscan.c:843
shrink_slab_memcg mm/vmscan.c:912 [inline]
shrink_slab+0x388/0x660 mm/vmscan.c:991
shrink_node_memcgs mm/vmscan.c:6088 [inline]
shrink_node+0x93d/0x1f30 mm/vmscan.c:6117
shrink_zones mm/vmscan.c:6355 [inline]
do_try_to_free_pages+0x3b4/0x17a0 mm/vmscan.c:6417
try_to_free_mem_cgroup_pages+0x3a4/0xa70 mm/vmscan.c:6732
reclaim_high.constprop.0+0x182/0x230 mm/memcontrol.c:2393
mem_cgroup_handle_over_high+0x190/0x520 mm/memcontrol.c:2578
try_charge_memcg+0xe0c/0x12f0 mm/memcontrol.c:2816
try_charge mm/memcontrol.c:2827 [inline]
charge_memcg+0x90/0x3b0 mm/memcontrol.c:6889
__mem_cgroup_charge+0x2b/0x90 mm/memcontrol.c:6910
mem_cgroup_charge include/linux/memcontrol.h:667 [inline]
__filemap_add_folio+0x615/0xf80 mm/filemap.c:852
filemap_add_folio+0xaf/0x1e0 mm/filemap.c:934
__filemap_get_folio+0x389/0xd80 mm/filemap.c:1976
pagecache_get_page+0x2e/0x280 mm/folio-compat.c:104
find_or_create_page include/linux/pagemap.h:612 [inline]
alloc_extent_buffer+0x2b9/0x1580 fs/btrfs/extent_io.c:4588
btrfs_init_new_buffer fs/btrfs/extent-tree.c:4869 [inline]
btrfs_alloc_tree_block+0x2e1/0x1320 fs/btrfs/extent-tree.c:4988
__btrfs_cow_block+0x3b2/0x1420 fs/btrfs/ctree.c:440
btrfs_cow_block+0x2fa/0x950 fs/btrfs/ctree.c:595
btrfs_search_slot+0x11b0/0x2c70 fs/btrfs/ctree.c:2038
btrfs_update_root+0xdb/0x630 fs/btrfs/root-tree.c:137
update_log_root fs/btrfs/tree-log.c:2841 [inline]
btrfs_sync_log+0xbfb/0x2870 fs/btrfs/tree-log.c:3064
btrfs_sync_file+0xdb9/0x13c0 fs/btrfs/file.c:1947
vfs_fsync_range+0x13e/0x230 fs/sync.c:188
generic_write_sync include/linux/fs.h:2856 [inline]
iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128
btrfs_direct_write fs/btrfs/file.c:1536 [inline]
btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668
call_write_iter include/linux/fs.h:2160 [inline]
do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
do_iter_write+0x182/0x700 fs/read_write.c:861
vfs_iter_write+0x74/0xa0 fs/read_write.c:902
iter_file_splice_write+0x745/0xc90 fs/splice.c:686
do_splice_from fs/splice.c:764 [inline]
direct_splice_actor+0x114/0x180 fs/splice.c:931
splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886
do_splice_direct+0x1ab/0x280 fs/splice.c:974
do_sendfile+0xb19/0x1270 fs/read_write.c:1255
__do_sys_sendfile64 fs/read_write.c:1323 [inline]
__se_sys_sendfile64 fs/read_write.c:1309 [inline]
__x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> btrfs-tree-00 --> btrfs-log-00
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-log-00);
lock(btrfs-tree-00);
lock(btrfs-log-00);
lock(&delayed_node->mutex);
Holding a read lock on a leaf from a fs/subvolume tree creates a nasty
lock dependency when we are COWing extent buffers for the log tree and we
have two tasks modifying the log tree, with each one in one of the
following 2 scenarios:
1) Modifying the log tree triggers an extent buffer allocation while
holding a write lock on a parent extent buffer from the log tree.
Allocating the pages for an extent buffer, or the extent buffer
struct, can trigger inode eviction and finally the inode eviction
will trigger a release/remove of a delayed node, which requires
taking the delayed node's mutex;
2) Allocating a metadata extent for a log tree can trigger the async
reclaim thread and make us wait for it to release enough space and
unblock our reservation ticket. The reclaim thread can start flushing
delayed items, and that in turn results in the need to lock delayed
node mutexes and in the need to write lock extent buffers of a
subvolume tree - all this while holding a write lock on the parent
extent buffer in the log tree.
So one task in scenario 1) running in parallel with another task in
scenario 2) could lead to a deadlock, one wanting to lock a delayed node
mutex while having a read lock on a leaf from the subvolume, while the
other is holding the delayed node's mutex and wants to write lock the same
subvolume leaf for flushing delayed items.
Fix this by cloning the leaf of the fs/subvolume tree, release/unlock the
fs/subvolume leaf and use the clone leaf instead.
Reported-by: syzbot+9b7c21f486f5e7f8d029@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000ccc93c05edc4d8cf@google.com/
CC: stable@vger.kernel.org # 6.0+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a followup to a previous commit of mine [0], which added the
allow_sys_admin_access && capable(CAP_SYS_ADMIN) check. This patch
rearranges the order of checks in fuse_allow_current_process without
changing functionality.
Commit 9ccf47b26b ("fuse: Add module param for CAP_SYS_ADMIN access
bypassing allow_other") added allow_sys_admin_access &&
capable(CAP_SYS_ADMIN) check to the beginning of the function, with the
reasoning that allow_sys_admin_access should be an 'escape hatch' for users
with CAP_SYS_ADMIN, allowing them to skip any subsequent checks.
However, placing this new check first results in many capable() calls when
allow_sys_admin_access is set, where another check would've also returned
1. This can be problematic when a BPF program is tracing capable() calls.
At Meta we ran into such a scenario recently. On a host where
allow_sys_admin_access is set but most of the FUSE access is from processes
which would pass other checks - i.e. they don't need CAP_SYS_ADMIN 'escape
hatch' - this results in an unnecessary capable() call for each fs op. We
also have a daemon tracing capable() with BPF and doing some data
collection, so tracing these extraneous capable() calls has the potential
to regress performance for an application doing many FUSE ops.
So rearrange the order of these checks such that CAP_SYS_ADMIN 'escape
hatch' is checked last. Add a small helper, fuse_permissible_uidgid, to
make the logic easier to understand. Previously, if allow_other is set on
the fuse_conn, uid/git checking doesn't happen as current_in_userns result
is returned. These semantics are maintained here: fuse_permissible_uidgid
check only happens if allow_other is not set.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In general, as of now, in FUSE, direct writes on the same file are
serialized over inode lock i.e we hold inode lock for the full duration of
the write request. I could not find in fuse code and git history a comment
which clearly explains why this exclusive lock is taken for direct writes.
Following might be the reasons for acquiring an exclusive lock but not be
limited to
1) Our guess is some USER space fuse implementations might be relying on
this lock for serialization.
2) The lock protects against file read/write size races.
3) Ruling out any issues arising from partial write failures.
This patch relaxes the exclusive lock for direct non-extending writes only.
File size extending writes might not need the lock either, but we are not
entirely sure if there is a risk to introduce any kind of regression.
Furthermore, benchmarking with fio does not show a difference between patch
versions that take on file size extension a) an exclusive lock and b) a
shared lock.
A possible example of an issue with i_size extending writes are write error
cases. Some writes might succeed and others might fail for file system
internal reasons - for example ENOSPACE. With parallel file size extending
writes it _might_ be difficult to revert the action of the failing write,
especially to restore the right i_size.
With these changes, we allow non-extending parallel direct writes on the
same file with the help of a flag called FOPEN_PARALLEL_DIRECT_WRITES. If
this flag is set on the file (flag is passed from libfuse to fuse kernel as
part of file open/create), we do not take exclusive lock anymore, but
instead use a shared lock that allows non-extending writes to run in
parallel. FUSE implementations which rely on this inode lock for
serialization can continue to do so and serialized direct writes are still
the default. Implementations that do not do write serialization need to be
updated and need to set the FOPEN_PARALLEL_DIRECT_WRITES flag in their file
open/create reply.
On patch review there were concerns that network file systems (or vfs
multiple mounts of the same file system) might have issues with parallel
writes. We believe this is not the case, as this is just a local lock,
which network file systems could not rely on anyway. I.e. this lock is
just for local consistency.
Signed-off-by: Dharmendra Singh <dsingh@ddn.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Return the value fuse_dev_release() directly instead of storing it in
another redundant variable.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
A while ago we introduced a dedicated vfs{g,u}id_t type in commit
1e5267cd08 ("mnt_idmapping: add vfs{g,u}id_t"). We already switched over
a good part of the VFS. Ultimately we will remove all legacy idmapped
mount helpers that operate only on k{g,u}id_t in favor of the new type safe
helpers that operate on vfs{g,u}id_t.
Cc: Seth Forshee (Digital Ocean) <sforshee@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Commit 8ed1f0e22f ("fs/fuse: fix ioctl type confusion") fixed a type
confusion bug by adding an ->f_op comparison.
Based on some off-list discussion back then, another check was added to
compare the f_cred->user_ns. This is not for security reasons, but was
based on the idea that a FUSE device FD should be using the UID/GID
mappings of its f_cred->user_ns, and those translations are done using
fc->user_ns, which matches the f_cred->user_ns of the initial FUSE device
FD thanks to the check in fuse_fill_super(). See also commit 8cb08329b0
("fuse: Support fuse filesystems outside of init_user_ns").
But FUSE_DEV_IOC_CLONE is, at a higher level, a *cloning* operation that
copies an existing context (with a weird API that involves first opening
/dev/fuse, then tying the resulting new FUSE device FD to an existing FUSE
instance). So if an application is already passing FUSE FDs across userns
boundaries and dealing with the resulting ID mapping complications somehow,
it doesn't make much sense to block this cloning operation.
I've heard that this check is an obstacle for some folks, and I don't see a
good reason to keep it, so remove it.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The previous commit df8629af29 ("fuse: always revalidate if exclusive
create") ensures that the dentries are revalidated on O_EXCL creates. This
commit complements it by also performing revalidation for rename target
dentries. Otherwise, a rename target file that only exists in kernel
dentry cache but not in the filesystem will result in EEXIST if
RENAME_NOREPLACE flag is used.
Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Signed-off-by: Zhang Tianci <zhangtianci.1997@bytedance.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Add a flag to entry expiration that lets the filesystem expire a dentry
without kicking it out from the cache immediately.
This makes a difference for overmounted dentries, where plain invalidation
would detach all submounts before dropping the dentry from the cache. If
only expiry is set on the dentry, then any overmounts are left alone and
until ->d_revalidate() is called.
Note: ->d_revalidate() is not called for the case of following a submount,
so invalidation will only be triggered for the non-overmounted case. The
dentry could also be mounted in a different mount instance, in which case
any submounts will still be detached.
Suggested-by: Jakob Blomer <jblomer@cern.ch>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The use of kmap() is being deprecated in favor of kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
the mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and still valid.
Therefore, replace kmap() with kmap_local_page() in fuse_readdir_cached(),
it being the only call site of kmap() currently left in fs/fuse.
Cc: "Venkataramanan, Anirudh" <anirudh.venkataramanan@intel.com>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
file_modified() must be called with inode lock held. fuse_fallocate()
didn't lock the inode in case of just FALLOC_KEEP_SIZE flags value, which
resulted in a kernel Warning in notify_change().
Lock the inode unconditionally, like all other fallocate implementations
do.
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Reported-and-tested-by: syzbot+462da39f0667b357c4b6@syzkaller.appspotmail.com
Fixes: 4a6f278d48 ("fuse: add file_modified() to fallocate")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When extending segments, nilfs_sufile_alloc() is called to get an
unassigned segment, then mark it as dirty to avoid accidentally allocating
the same segment in the future.
But for some special cases such as a corrupted image it can be unreliable.
If such corruption of the dirty state of the segment occurs, nilfs2 may
reallocate a segment that is in use and pick the same segment for writing
twice at the same time.
This will cause the problem reported by syzkaller:
https://syzkaller.appspot.com/bug?id=c7c4748e11ffcc367cef04f76e02e931833cbd24
This case started with segbuf1.segnum = 3, nextnum = 4 when constructed.
It supposed segment 4 has already been allocated and marked as dirty.
However the dirty state was corrupted and segment 4 usage was not dirty.
For the first time nilfs_segctor_extend_segments() segment 4 was allocated
again, which made segbuf2 and next segbuf3 had same segment 4.
sb_getblk() will get same bh for segbuf2 and segbuf3, and this bh is added
to both buffer lists of two segbuf. It makes the lists broken which
causes NULL pointer dereference.
Fix the problem by setting usage as dirty every time in
nilfs_sufile_mark_dirty(), which is called during constructing current
segment to be written out and before allocating next segment.
[chenzhongjin@huawei.com: add lock protection per Ryusuke]
Link: https://lkml.kernel.org/r/20221121091141.214703-1-chenzhongjin@huawei.com
Link: https://lkml.kernel.org/r/20221118063304.140187-1-chenzhongjin@huawei.com
Fixes: 9ff05123e3 ("nilfs2: segment constructor")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reported-by: <syzbot+77e4f0...@syzkaller.appspotmail.com>
Reported-by: Liu Shixin <liushixin2@huawei.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SecPageTables has a tab after it instead of a space, this can break
fragile parsers that depend on spaces after the stat names.
Link: https://lkml.kernel.org/r/20221117043247.133294-1-yosryahmed@google.com
Fixes: ebc97a52b5 ("mm: add NR_SECONDARY_PAGETABLE to count secondary page table uses.")
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Because that's what Christoph wants for this error handling path
only XFS uses.
It requires a new iomap export for handling errors over delalloc
ranges. This is basically the XFS code as is stands, but even though
Christoph wants this as iomap funcitonality, we still have
to call it from the filesystem specific ->iomap_end callback, and
call into the iomap code with yet another filesystem specific
callback to punch the delalloc extent within the defined ranges.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
xfs_buffered_write_iomap_end() currently converts the byte ranges
passed to it to filesystem blocks to pass them to the bmap code to
punch out delalloc blocks, but then has to convert filesytem
blocks back to byte ranges for page cache truncate.
We're about to make the page cache truncate go away and replace it
with a page cache walk, so having to convert everything to/from/to
filesystem blocks is messy and error-prone. It is much easier to
pass around byte ranges and convert to page indexes and/or
filesystem blocks only where those units are needed.
In preparation for the page cache walk being added, add a helper
that converts byte ranges to filesystem blocks and calls
xfs_bmap_punch_delalloc_range() and convert
xfs_buffered_write_iomap_end() to calculate limits in byte ranges.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
xfs_buffered_write_iomap_end() has a comment about the safety of
punching delalloc extents based holding the IOLOCK_EXCL. This
comment is wrong, and punching delalloc extents is not race free.
When we punch out a delalloc extent after a write failure in
xfs_buffered_write_iomap_end(), we punch out the page cache with
truncate_pagecache_range() before we punch out the delalloc extents.
At this point, we only hold the IOLOCK_EXCL, so there is nothing
stopping mmap() write faults racing with this cleanup operation,
reinstantiating a folio over the range we are about to punch and
hence requiring the delalloc extent to be kept.
If this race condition is hit, we can end up with a dirty page in
the page cache that has no delalloc extent or space reservation
backing it. This leads to bad things happening at writeback time.
To avoid this race condition, we need the page cache truncation to
be atomic w.r.t. the extent manipulation. We can do this by holding
the mapping->invalidate_lock exclusively across this operation -
this will prevent new pages from being inserted into the page cache
whilst we are removing the pages and the backing extent and space
reservation.
Taking the mapping->invalidate_lock exclusively in the buffered
write IO path is safe - it naturally nests inside the IOLOCK (see
truncate and fallocate paths). iomap_zero_range() can be called from
under the mapping->invalidate_lock (from the truncate path via
either xfs_zero_eof() or xfs_truncate_page(), but iomap_zero_iter()
will not instantiate new delalloc pages (because it skips holes) and
hence will not ever need to punch out delalloc extents on failure.
Fix the locking issue, and clean up the code logic a little to avoid
unnecessary work if we didn't allocate the delalloc extent or wrote
the entire region we allocated.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Commit 57fe60df62 ("reiserfs: add atomic addition of selinux attributes
during inode creation") defined reiserfs_security_free() to free the name
and value of a security xattr allocated by the active LSM through
security_old_inode_init_security(). However, this function is not called
in the reiserfs code.
Thus, add a call to reiserfs_security_free() whenever
reiserfs_security_init() is called, and initialize value to NULL, to avoid
to call kfree() on an uninitialized pointer.
Finally, remove the kfree() for the xattr name, as it is not allocated
anymore.
Fixes: 57fe60df62 ("reiserfs: add atomic addition of selinux attributes during inode creation")
Cc: stable@vger.kernel.org
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Mimi Zohar <zohar@linux.ibm.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
After commit cbfecb927f ("fs: record I_DIRTY_TIME even if inode
already has I_DIRTY_INODE") writeback_single_inode can push inode with
I_DIRTY_TIME set to b_dirty_time list. In case of freeing inode with
I_DIRTY_TIME set this can happen after deletion of inode from i_io_list
at evict. Stack trace is following.
evict
fat_evict_inode
fat_truncate_blocks
fat_flush_inodes
writeback_inode
sync_inode_metadata(inode, sync=0)
writeback_single_inode(inode, wbc) <- wbc->sync_mode == WB_SYNC_NONE
This will lead to use after free in flusher thread.
Similar issue can be triggered if writeback_single_inode in the
stack trace update inode->i_io_list. Add explicit check to avoid it.
Fixes: cbfecb927f ("fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE")
Reported-by: syzbot+6ba92bd00d5093f7e371@syzkaller.appspotmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Svyatoslav Feldsherov <feldsherov@google.com>
Link: https://lore.kernel.org/r/20221115202001.324188-1-feldsherov@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The call, kobject_get_ownership(), does not modify the kobject passed
into it, so make it const. This propagates down into the kobj_type
function callbacks so make the kobject passed into them also const,
ensuring that nothing in the kobject is being changed here.
This helps make it more obvious what calls and callbacks do, and do not,
modify structures passed to them.
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: linux-nfs@vger.kernel.org
Cc: bridge@lists.linux-foundation.org
Cc: netdev@vger.kernel.org
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20221121094649.1556002-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch uses assert_spin_locked() instead of lockdep_is_held()
where it's available to use because lockdep_is_held() is only available
if CONFIG_LOCKDEP is set.
In other cases like lockdep_sock_is_held() we surround it by a
CONFIG_LOCKDEP idef.
Fixes: dbb751ffab ("fs: dlm: parallelize lowcomms socket handling")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This is identical to eventfd_signal(), but it allows the caller to pass
in a mask to be used for the poll wakeup key. The use case is avoiding
repeated multishot triggers if we have a dependency between eventfd and
io_uring.
If we setup an eventfd context and register that as the io_uring eventfd,
and at the same time queue a multishot poll request for the eventfd
context, then any CQE posted will repeatedly trigger the multishot request
until it terminates when the CQ ring overflows.
In preparation for io_uring detecting this circular dependency, add the
mentioned helper so that io_uring can pass in EPOLL_URING as part of the
poll wakeup key.
Cc: stable@vger.kernel.org # 6.0
[axboe: fold in !CONFIG_EVENTFD fix from Zhang Qilong]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a race between modprobe and mount as below:
modprobe zonefs | mount -t zonefs
--------------------------------|-------------------------
zonefs_init |
register_filesystem [1] |
| zonefs_fill_super [2]
zonefs_sysfs_init [3] |
1. register zonefs suceess, then
2. user can mount the zonefs
3. if sysfs initialize failed, the module initialize failed.
Then the mount process maybe some error happened since the module
initialize failed.
Let's register zonefs after all dependency resource ready. And
reorder the dependency resource release in module exit.
Fixes: 9277a6d4fb ("zonefs: Export open zone resource information through sysfs")
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Add a blk_crypto_config_supported_natively helper that wraps
__blk_crypto_cfg_supported to retrieve the crypto_profile from the
request queue. With this fscrypt can stop including
blk-crypto-profile.h and rely on the public consumer interface in
blk-crypto.h.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221114042944.1009870-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Switch all public blk-crypto interfaces to use struct block_device
arguments to specify the device they operate on instead of th
request_queue, which is a block layer implementation detail.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221114042944.1009870-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The following error occurred during the fsstress test:
XFS: Assertion failed: VFS_I(ip)->i_nlink >= 2, file: fs/xfs/xfs_inode.c, line: 2452
The problem was that inode race condition causes incorrect i_nlink to be
written to disk, and then it is read into memory. Consider the following
call graph, inodes that are marked as both XFS_IFLUSHING and
XFS_IRECLAIMABLE, i_nlink will be reset to 1 and then restored to original
value in xfs_reinit_inode(). Therefore, the i_nlink of directory on disk
may be set to 1.
xfsaild
xfs_inode_item_push
xfs_iflush_cluster
xfs_iflush
xfs_inode_to_disk
xfs_iget
xfs_iget_cache_hit
xfs_iget_recycle
xfs_reinit_inode
inode_init_always
xfs_reinit_inode() needs to hold the ILOCK_EXCL as it is changing internal
inode state and can race with other RCU protected inode lookups. On the
read side, xfs_iflush_cluster() grabs the ILOCK_SHARED while under rcu +
ip->i_flags_lock, and so xfs_iflush/xfs_inode_to_disk() are protected from
racing inode updates (during transactions) by that lock.
Fixes: ff7bebeb91 ("xfs: refactor the inode recycling code") # goes further back than this
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
xfstests generic/013 and generic/476 reported WARNING as follows:
WARNING: lock held when returning to user space!
6.1.0-rc5+ #4 Not tainted
------------------------------------------------
fsstress/504233 is leaving the kernel with locks still held!
2 locks held by fsstress/504233:
#0: ffff888054c38850 (&sb->s_type->i_mutex_key#21){+.+.}-{3:3}, at:
lock_two_nondirectories+0xcf/0xf0
#1: ffff8880b8fec750 (&sb->s_type->i_mutex_key#21/4){+.+.}-{3:3}, at:
lock_two_nondirectories+0xb7/0xf0
This will lead to deadlock and hungtask.
Fix this by releasing locks when failed to write out on a file range in
cifs_file_copychunk_range().
Fixes: 3e3761f1ec ("smb3: use filemap_write_and_wait_range instead of filemap_write_and_wait")
Cc: stable@vger.kernel.org # 6.0
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This patch is rework of lowcomms handling, the main goal was here to
handle recvmsg() and sendpage() to run parallel. Parallel in two senses:
1. per connection and 2. that recvmsg()/sendpage() doesn't block each
other.
Currently recvmsg()/sendpage() cannot run parallel because two
workqueues "dlm_recv" and "dlm_send" are ordered workqueues. That means
only one work item can be executed. The amount of queue items will be
increased about the amount of nodes being inside the cluster. The current
two workqueues for sending and receiving can also block each other if the
same connection is executed at the same time in dlm_recv and dlm_send
workqueue because a per connection mutex for the socket handling.
To make it more parallel we introduce one "dlm_io" workqueue which is
not an ordered workqueue, the amount of workers are not limited. Due
per connection flags SEND/RECV pending we schedule workers ordered per
connection and per send and receive task. To get rid of the mutex
blocking same workers to do socket handling we switched to a semaphore
which handles socket operations as read lock and sock releases as write
operations, to prevent sock_release() being called while the socket is
being used.
There might be more optimization removing the semaphore and replacing it
with other synchronization mechanism, however due other circumstances
e.g. othercon behaviour it seems complicated to doing this change. I
added comments to remove the othercon handling and moving to a different
synchronization mechanism as this is done. We need to do that to the next
dlm major version upgrade because it is not backwards compatible with the
current connect mechanism.
The processing of dlm messages need to be still handled by a ordered
workqueue. An dlm_process ordered workqueue was introduced which gets
filled by the receive worker. This is probably the next bottleneck of
DLM but the application can't currently parse dlm messages parallel. A
comment was introduced to lift the workqueue context of dlm processing
in a non-sleepable softirq to get messages processing done fast.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch removes a init of an error value to -EINVAL which is not
necessary.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch changes the handling of calling the original
sk_error_report() by not putting it on the stack and calling it later.
If the listen_sock.sk_error_report() is NULL in this moment it indicates
a bug in our implementation.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch removes null checks on private data for sockets. If we have a
null dereference there we having a bug in our implementation that such
callback occurs in this state.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch merges the dlm_node_addrs lookup list to the connection
structure. It is a per node mapping to some configuration setup by
configfs. We don't need two lookup structures. The connection hash has
now a lifetime like the dlm_node_addrs entries. Means we add only new
entries when configure cluster and not while new connections are coming
in, remove connection when a node got fenced and cleanup all connection
when the dlm exits. It should work the same and even will show more
issues because we don't try to somehow keep those two data structures in
sync with the current cluster configuration.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch removes to allocate the dlm_local_addr[] pointers on the
heap. Instead we directly store the type of "struct sockaddr_storage".
This removes function deinit_local() because it was freeing memory only.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch removes save_listen_callbacks() and add_listen_sock() as they
are only used once in lowcomms functionality. For shutdown lowcomms it's
not necessary to whole flush the workqueues to synchronize with
restoring the old sk_data_ready() callback. Only the listen con receive
work need to be cancelled. For each individual node shutdown we should be
sure that last ack was been transmitted which is done by flushing per
connection swork worker.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Since commit 489d8e559c ("fs: dlm: add reliable connection if
reconnect") we have functionality like TCP offers for half-closed
sockets on dlm application protocol layer. This feature is required
because the cluster manager events about leaving resource memberships
can be locally already occurred but other cluster nodes having a pending
leaving membership over the cluster manager protocol happening. In this
time the local dlm node already shutdown it's connection and don't
transmit anymore any new dlm messages, but however it still needs to be
able to accept dlm messages because the pending leave membership request
of the cluster manager protocol which the dlm kernel implementation has
no control about it.
We have this functionality on the application for two reasons, the main
reason is that SCTP does not support such functionality on socket
layer. But we can do it inside application layer.
Another small issue is that this feature is broken in the TCP world
because some NAT devices does not implement such functionality
correctly. This is the same reason why the reliable connection session
layer in DLM exists. We give up on middle devices in the networking
which sends e.g. TCP resets out. In DLM we cannot have any message
dropping and we ensure it over a session layer that it can't happen.
Back to the half-closed grace shutdown handling. It's not necessary
anymore to do it on socket layer (which is only support for TCP sockets)
because we do it on application layer. This patch removes this handling,
if there are still issues then we have a problem on the application
layer for such handling.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch will switch from dlm_allow_conn to check if dlm lowcomms is
running or not to if we actually have a listen socket set or not. The
list socket will be set and unset in lowcomms start and shutdown
functionality. To synchronize with data_ready() callback we will set the
socket callback to NULL while socket lock is held.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Instead of check on list_empty() we can do the same with
list_first_entry_or_null() and return NULL if the returned value is NULL.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch removed a twice INIT_WORK() functionality. We already doing
this inside of dlm_lowcomms_init() functionality which is called only
once dlm is loaded.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch introduces leftovers of init, start, stop and exit
functionality. The dlm application layer should always call the midcomms
layer which getting aware of such event and redirect it to the lowcomms
layer. Some functionality which is currently handled inside the start
functionality of midcomms and lowcomms should be handled in the init
functionality as it only need to be initialized once when dlm is loaded.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
In DLM when we send a dlm message it is easy to add the lock resource
name, but additional lookup is required when to trace the receive
message side. The idea here is to move the lookup work to the user by
using a lookup to find the right send message with recv message. As note
DLM can't drop any message which is guaranteed by a special session
layer.
For doing the lookup a 3 tupel is required as an unique identification
which is dst nodeid, src nodeid and sequence number. This patch adds the
destination nodeid to the dlm message trace points. The source nodeid is
given by the h_nodeid field inside the header.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch renames DLM_IFL_NEED_SCHED to DLM_IFL_CB_PENDING because
CB_PENDING is a proper name to describe this flag. This flag is set when
callback enqueue will return DLM_ENQUEUE_CALLBACK_NEED_SCHED because the
callback worker need to be queued. The flag tells that callbacks are
currently pending to be called and will be unset if the callback work
for the specific lkb is done. The term need schedule is part of this
time but a proper name is to say that there are some callbacks pending
to being called.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch changes the ast hotpath functionality in very unlikely cases
that we do WARN_ON_ONCE() instead of WARN_ON() to not spamming the
console output if we run into states that it would occur over and over
again.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch will drop the lkb reference in an very unlikely case which
should in practice not happened. However if it happens we cleanup the
reference just in case.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
We can have dependencies between epoll and io_uring. Consider an epoll
context, identified by the epfd file descriptor, and an io_uring file
descriptor identified by iofd. If we add iofd to the epfd context, and
arm a multishot poll request for epfd with iofd, then the multishot
poll request will repeatedly trigger and generate events until terminated
by CQ ring overflow. This isn't a desired behavior.
Add EPOLL_URING so that io_uring can pass it in as part of the poll wakeup
key, and io_uring can check for that to detect a potential recursive
invocation.
Cc: stable@vger.kernel.org # 6.0
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Syzkaller reported BUG as follows:
BUG: sleeping function called from invalid context at
include/linux/sched/mm.h:274
Call Trace:
<TASK>
dump_stack_lvl+0xcd/0x134
__might_resched.cold+0x222/0x26b
kmem_cache_alloc+0x2e7/0x3c0
update_qgroup_limit_item+0xe1/0x390
btrfs_qgroup_inherit+0x147b/0x1ee0
create_subvol+0x4eb/0x1710
btrfs_mksubvol+0xfe5/0x13f0
__btrfs_ioctl_snap_create+0x2b0/0x430
btrfs_ioctl_snap_create_v2+0x25a/0x520
btrfs_ioctl+0x2a1c/0x5ce0
__x64_sys_ioctl+0x193/0x200
do_syscall_64+0x35/0x80
Fix this by calling qgroup_dirty() on @dstqgroup, and update limit item in
btrfs_run_qgroups() later outside of the spinlock context.
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When trying to see if we can clone a file range, there are cases where we
end up sending two write operations in case the inode from the source root
has an i_size that is not sector size aligned and the length from the
current offset to its i_size is less than the remaining length we are
trying to clone.
Issuing two write operations when we could instead issue a single write
operation is not incorrect. However it is not optimal, specially if the
extents are compressed and the flag BTRFS_SEND_FLAG_COMPRESSED was passed
to the send ioctl. In that case we can end up sending an encoded write
with an offset that is not sector size aligned, which makes the receiver
fallback to decompressing the data and writing it using regular buffered
IO (so re-compressing the data in case the fs is mounted with compression
enabled), because encoded writes fail with -EINVAL when an offset is not
sector size aligned.
The following example, which triggered a bug in the receiver code for the
fallback logic of decompressing + regular buffer IO and is fixed by the
patchset referred in a Link at the bottom of this changelog, is an example
where we have the non-optimal behaviour due to an unaligned encoded write:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV > /dev/null
mount -o compress $DEV $MNT
# File foo has a size of 33K, not aligned to the sector size.
xfs_io -f -c "pwrite -S 0xab 0 33K" $MNT/foo
xfs_io -f -c "pwrite -S 0xcd 0 64K" $MNT/bar
# Now clone the first 32K of file bar into foo at offset 0.
xfs_io -c "reflink $MNT/bar 0 0 32K" $MNT/foo
# Snapshot the default subvolume and create a full send stream (v2).
btrfs subvolume snapshot -r $MNT $MNT/snap
btrfs send --compressed-data -f /tmp/test.send $MNT/snap
echo -e "\nFile bar in the original filesystem:"
od -A d -t x1 $MNT/snap/bar
umount $MNT
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
echo -e "\nReceiving stream in a new filesystem..."
btrfs receive -f /tmp/test.send $MNT
echo -e "\nFile bar in the new filesystem:"
od -A d -t x1 $MNT/snap/bar
umount $MNT
Before this patch, the send stream included one regular write and one
encoded write for file 'bar', with the later being not sector size aligned
and causing the receiver to fallback to decompression + buffered writes.
The output of the btrfs receive command in verbose mode (-vvv):
(...)
mkfile o258-7-0
rename o258-7-0 -> bar
utimes
clone bar - source=foo source offset=0 offset=0 length=32768
write bar - offset=32768 length=1024
encoded_write bar - offset=33792, len=4096, unencoded_offset=33792, unencoded_file_len=31744, unencoded_len=65536, compression=1, encryption=0
encoded_write bar - falling back to decompress and write due to errno 22 ("Invalid argument")
(...)
This patch avoids the regular write followed by an unaligned encoded write
so that we end up sending a single encoded write that is aligned. So after
this patch the stream content is (output of btrfs receive -vvv):
(...)
mkfile o258-7-0
rename o258-7-0 -> bar
utimes
clone bar - source=foo source offset=0 offset=0 length=32768
encoded_write bar - offset=32768, len=4096, unencoded_offset=32768, unencoded_file_len=32768, unencoded_len=65536, compression=1, encryption=0
(...)
So we get more optimal behaviour and avoid the silent data loss bug in
versions of btrfs-progs affected by the bug referred by the Link tag
below (btrfs-progs v5.19, v5.19.1, v6.0 and v6.0.1).
Link: https://lore.kernel.org/linux-btrfs/cover.1668529099.git.fdmanana@suse.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
generation is an on-disk __le64 value, so use btrfs_super_generation to
convert it to host endian before comparing it.
Fixes: 12659251ca ("btrfs: implement log-structured superblock for ZONED mode")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
->writepage is a very inefficient method to write back data, and only
used through write_cache_pages or as a fallback when no ->migrate_folio
method is present.
Set ->migrate_folio to the generic buffer_head based helper, and remove
the ->writepage implementation in extfat.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
->writepage is a very inefficient method to write back data, and only
used through write_cache_pages or as a fallback when no ->migrate_folio
method is present.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
We do not need to writeout modified directory blocks immediately when
modifying them while the page is locked. It is enough to do the flush
somewhat later which has the added benefit that inode times can be
flushed as well. It also allows us to stop depending on
write_one_page() function.
Signed-off-by: Jan Kara <jack@suse.cz>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmN6wAgeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG0EYH/3/RO90NbrFItraN
Lzr+d3VdbGjTu8xd1M+PRTmwh3zxLpB+Jwqr0T0A2gzL9B/D+AUPUJdrCVbv9DqS
FLJAVqoeV20dNBAHSffOOLPsgCZ+Eu+LzlNN7Iqde0e8cyZICFMNktitui84Xm/i
1NgFVgz9OZ6+aieYvUj3FrFq0p8GTIaC/oybDZrxYKcO8ZzKVMJ11swRw10wwq0g
qOOECvV3w7wlQ8upQZkzFxItKFc7EexZI6R4elXeGSJJ9Hlc092dv/zsKB9dwV+k
WcwkJrZRoezYXzgGBFxUcQtzi+ethjrPjuJuM1rYLUSIcfIW/0lkaSLgRoBu8D+I
1GfXkXs=
=gt6P
-----END PGP SIGNATURE-----
Merge 6.1-rc6 into char-misc-next
We need the char/misc fixes in here as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This debug code dereferences "old_iface" after it was already freed by
the call to release_iface(). Re-order the debugging to avoid this
issue.
Fixes: b54034a73b ("cifs: during reconnect, update interface if necessary")
Cc: stable@vger.kernel.org # 5.19+
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmN4VVMACgkQiiy9cAdy
T1F3AQv+N3jzbBdJobg19mr0tVrpz0Itw8BaPTAi2SmY5UFCXg8AEhT9bVj1XoSC
lfTZJ7qhV6v1qu8HPyHY8sfW6Pna20FLsL8GyMD2dOU9NwfscecWJSeZX36OkaQX
8Q12cz+J6DgM2NJw0VW6NUCeY6DilbdYG23ZXO0i1NSL8rCMpr7mhK8j5SYgL3Bn
yTpd0DZ30fHdcAzt6bAvsekFPaqqbaBtSJl2Z7ibg5wiq/gt0WZ9eqUco9aMdfqW
HJ9xeoTDq/836ZtJiHNJ5li7/jEO6AYoImkhS3Cr5TXSqtgDdPWVKM7P7CTL42yV
xVcp3OQSiCV9tbGoqJkLMlOqpuM4f2oCkNHxj/bmblCV0QGty7zNet+s4yks2B0d
fnmzIwYRbHCV1+rPOiSTA7zhR9qQKIoPEw2tsuWm+Pw/LnraTvnc6Mf0TefK68O/
y/fEq2/zuvt4uHdBjDTUYqwzi9qYyglU/zga3FIRCG87WbOSUXSwVWIDaOgSrg9O
FX+KKdA3
=/bd9
-----END PGP SIGNATURE-----
Merge tag '6.1-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
- two missing and one incorrect return value checks
- fix leak on tlink mount failure
* tag '6.1-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: add check for returning value of SMB2_set_info_init
cifs: Fix wrong return value checking when GETFLAGS
cifs: add check for returning value of SMB2_close_init
cifs: Fix connections leak when tlink setup failed
If O_EXCL is *not* specified, then linkat() can be
used to link the temporary file into the filesystem.
If O_EXCL is specified then linkat() should fail (-1).
After commit 863f144f12 ("vfs: open inside ->tmpfile()")
the O_EXCL flag is no longer honored by the vfs layer for
tmpfile, which means the file can be linked even if O_EXCL
flag is specified, which is a change in behaviour for
userspace!
The open flags was previously passed as a parameter, so it
was uneffected by the changes to file->f_flags caused by
finish_open(). This patch fixes the issue by storing
file->f_flags in a local variable so the O_EXCL test
logic is restored.
This regression was detected by Android CTS Bionic fcntl()
tests running on android-mainline [1].
[1] https://android.googlesource.com/platform/bionic/+/
refs/heads/master/tests/fcntl_test.cpp#352
Fixes: 863f144f12 ("vfs: open inside ->tmpfile()")
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Tested-by: Will McVicker <willmcvicker@google.com>
Signed-off-by: Peter Griffin <peter.griffin@linaro.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- Fix the IO error recovery path for failures happening in the last
zone of device, and that zone is a "runt" zone (smaller than the
other zone). The current code was failing to properly obtain a zone
report in that case.
- Remove the unused to_attr() function as it is unused, causing
compilation warnings with clang.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCY3gofwAKCRDdoc3SxdoY
dpUlAQDa1mHGQzT0Lrg7y+4HILKsX1dmzJ6wIWTKaFg1f68GNgD/cjZB/jp3UKEi
nmXAv/dGVx6ZBtIEybLDb522i1xccAg=
=gAqT
-----END PGP SIGNATURE-----
Merge tag 'zonefs-6.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs fixes from Damien Le Moal:
- Fix the IO error recovery path for failures happening in the last
zone of device, and that zone is a "runt" zone (smaller than the
other zone). The current code was failing to properly obtain a zone
report in that case.
- Remove the unused to_attr() function as it is unused, causing
compilation warnings with clang.
* tag 'zonefs-6.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: Remove to_attr() helper function
zonefs: fix zone report size in __zonefs_io_error()
The vfs_getxattr_alloc() function currently returns a ssize_t value
despite the fact that it only uses int values internally for return
values. Fix this by converting vfs_getxattr_alloc() to return an
int type and adjust the callers as necessary. As part of these
caller modifications, some of the callers are fixed to properly free
the xattr value buffer on both success and failure to ensure that
memory is not leaked in the failure case.
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
When squashfs_read_table() returns an error or `sb->s_magic !=
SQUASHFS_MAGIC`, enters the error branch and calls
msblk->thread_ops->destroy(msblk) to destroy msblk. However,
msblk->thread_ops has not been initialized. Therefore, the following
problem is triggered:
==================================================================
BUG: KASAN: null-ptr-deref in squashfs_fill_super+0xe7a/0x13b0
Read of size 8 at addr 0000000000000008 by task swapper/0/1
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc3-next-20221031 #367
Call Trace:
<TASK>
dump_stack_lvl+0x73/0x9f
print_report+0x743/0x759
kasan_report+0xc0/0x120
__asan_load8+0xd3/0x140
squashfs_fill_super+0xe7a/0x13b0
get_tree_bdev+0x27b/0x450
squashfs_get_tree+0x19/0x30
vfs_get_tree+0x49/0x150
path_mount+0xaae/0x1350
init_mount+0xad/0x100
do_mount_root+0xbc/0x1d0
mount_block_root+0x173/0x316
mount_root+0x223/0x236
prepare_namespace+0x1eb/0x237
kernel_init_freeable+0x528/0x576
kernel_init+0x29/0x250
ret_from_fork+0x1f/0x30
</TASK>
==================================================================
To solve this issue, msblk->thread_ops is initialized immediately after
msblk is assigned a value.
Link: https://lkml.kernel.org/r/20221101073343.3961562-1-libaokun1@huawei.com
Fixes: b0645770d3c7 ("squashfs: add the mount parameter theads=<single|multi|percpu>")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Xiaoming Ni <nixiaoming@huawei.com>
Reviewed-by: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The maximum number of threads in the decompressor_multi.c file is fixed
and cannot be adjusted according to user needs. Therefore, the mount
parameter needs to be added to allow users to configure the number of
threads as required. The upper limit is num_online_cpus() * 2.
Link: https://lkml.kernel.org/r/20221019030930.130456-3-nixiaoming@huawei.com
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Reviewed-by: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Jianguo Chen <chenjianguo3@huawei.com>
Cc: Jubin Zhong <zhongjubin@huawei.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series 'squashfs: Add the mount parameter "threads="'.
Currently, Squashfs supports multiple decompressor parallel modes.
However, this mode can be configured only during kernel building and does
not support flexible selection during runtime.
In the current patch set, the mount parameter "threads=" is added to allow
users to select the parallel decompressor mode and configure the number of
decompressors when mounting a file system.
"threads=<single|multi|percpu|1|2|3|...>"
The upper limit is num_online_cpus() * 2.
This patch (of 2):
Squashfs supports three decompression concurrency modes:
Single-thread mode: concurrent reads are blocked and the memory
overhead is small.
Multi-thread mode/percpu mode: reduces concurrent read blocking but
increases memory overhead.
The corresponding schema must be fixed at compile time. During mounting,
the concurrent decompression mode cannot be adjusted based on file read
blocking.
The mount parameter theads=<single|multi|percpu> is added to select
the concurrent decompression mode of a single SquashFS file system
image.
Link: https://lkml.kernel.org/r/20221019030930.130456-1-nixiaoming@huawei.com
Link: https://lkml.kernel.org/r/20221019030930.130456-2-nixiaoming@huawei.com
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Reviewed-by: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Jianguo Chen <chenjianguo3@huawei.com>
Cc: Jubin Zhong <zhongjubin@huawei.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If field s_log_block_size of superblock data is corrupted and too large,
init_nilfs() and load_nilfs() still can trigger a shift-out-of-bounds
warning followed by a kernel panic (if panic_on_warn is set):
shift exponent 38973 is too large for 32-bit type 'int'
Call Trace:
<TASK>
dump_stack_lvl+0xcd/0x134
ubsan_epilogue+0xb/0x50
__ubsan_handle_shift_out_of_bounds.cold.12+0x17b/0x1f5
init_nilfs.cold.11+0x18/0x1d [nilfs2]
nilfs_mount+0x9b5/0x12b0 [nilfs2]
...
This fixes the issue by adding and using a new helper function for getting
block size with sanity check.
Link: https://lkml.kernel.org/r/20221027044306.42774-3-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "nilfs2: fix UBSAN shift-out-of-bounds warnings on mount
time".
The first patch fixes a bug reported by syzbot, and the second one fixes
the remaining bug of the same kind. Although they are triggered by the
same super block data anomaly, I divided it into the above two because the
details of the issues and how to fix it are different.
Both are required to eliminate the shift-out-of-bounds issues at mount
time.
This patch (of 2):
If the block size exponent information written in an on-disk superblock is
corrupted, nilfs_sb2_bad_offset helper function can trigger
shift-out-of-bounds warning followed by a kernel panic (if panic_on_warn
is set):
shift exponent 38983 is too large for 64-bit type 'unsigned long long'
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106
ubsan_epilogue lib/ubsan.c:151 [inline]
__ubsan_handle_shift_out_of_bounds+0x33d/0x3b0 lib/ubsan.c:322
nilfs_sb2_bad_offset fs/nilfs2/the_nilfs.c:449 [inline]
nilfs_load_super_block+0xdf5/0xe00 fs/nilfs2/the_nilfs.c:523
init_nilfs+0xb7/0x7d0 fs/nilfs2/the_nilfs.c:577
nilfs_fill_super+0xb1/0x5d0 fs/nilfs2/super.c:1047
nilfs_mount+0x613/0x9b0 fs/nilfs2/super.c:1317
...
In addition, since nilfs_sb2_bad_offset() performs multiplication without
considering the upper bound, the computation may overflow if the disk
layout parameters are not normal.
This fixes these issues by inserting preliminary sanity checks for those
parameters and by converting the comparison from one involving
multiplication and left bit-shifting to one using division and right
bit-shifting.
Link: https://lkml.kernel.org/r/20221027044306.42774-1-konishi.ryusuke@gmail.com
Link: https://lkml.kernel.org/r/20221027044306.42774-2-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+e91619dd4c11c4960706@syzkaller.appspotmail.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Most /proc files don't have length (in fstat sense). This leads to
inefficiencies when reading such files with APIs commonly found in modern
programming languages. They open file, then fstat descriptor, get st_size
== 0 and either assume file is empty or start reading without knowing
target size.
cat(1) does OK because it uses large enough buffer by default. But naive
programs copy-pasted from SO aren't:
let mut f = std::fs::File::open("/proc/cmdline").unwrap();
let mut buf: Vec<u8> = Vec::new();
f.read_to_end(&mut buf).unwrap();
will result in
openat(AT_FDCWD, "/proc/cmdline", O_RDONLY|O_CLOEXEC) = 3
statx(0, NULL, AT_STATX_SYNC_AS_STAT, STATX_ALL, NULL) = -1 EFAULT (Bad address)
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_BASIC_STATS|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0444, stx_size=0, ...}) = 0
lseek(3, 0, SEEK_CUR) = 0
read(3, "BOOT_IMAGE=(hd3,gpt2)/vmlinuz-5.", 32) = 32
read(3, "19.6-100.fc35.x86_64 root=/dev/m", 32) = 32
read(3, "apper/fedora_localhost--live-roo"..., 64) = 64
read(3, "ocalhost--live-swap rd.lvm.lv=fe"..., 128) = 116
read(3, "", 12)
open/stat is OK, lseek looks silly but there are 3 unnecessary reads
because Rust starts with 32 bytes per Vec<u8> and grows from there.
In case of /proc/cmdline, the length is known precisely.
Make variables readonly while I'm at it.
P.S.: I tried to scp /proc/cpuinfo today and got empty file
but this is separate story.
Link: https://lkml.kernel.org/r/YxoywlbM73JJN3r+@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Many monitoring tools include open file count as a metric. Currently the
only way to get this number is to enumerate the files in /proc/pid/fd.
The problem with the current approach is that it does many things people
generally don't care about when they need one number for a metric. In our
tests for cadvisor, which reports open file counts per cgroup, we observed
that reading the number of open files is slow. Out of 35.23% of CPU time
spent in `proc_readfd_common`, we see 29.43% spent in `proc_fill_cache`,
which is responsible for filling dentry info. Some of this extra time is
spinlock contention, but it's a contention for the lock we don't want to
take to begin with.
We considered putting the number of open files in /proc/pid/status.
Unfortunately, counting the number of fds involves iterating the
open_files bitmap, which has a linear complexity in proportion with the
number of open files (bitmap slots really, but it's close). We don't want
to make /proc/pid/status any slower, so instead we put this info in
/proc/pid/fd as a size member of the stat syscall result. Previously the
reported number was zero, so there's very little risk of breaking
anything, while still providing a somewhat logical way to count the open
files with a fallback if it's zero.
RFC for this patch included iterating open fds under RCU. Thanks to Frank
Hofmann for the suggestion to use the bitmap instead.
Previously:
```
$ sudo stat /proc/1/fd | head -n2
File: /proc/1/fd
Size: 0 Blocks: 0 IO Block: 1024 directory
```
With this patch:
```
$ sudo stat /proc/1/fd | head -n2
File: /proc/1/fd
Size: 65 Blocks: 0 IO Block: 1024 directory
```
Correctness check:
```
$ sudo ls /proc/1/fd | wc -l
65
```
I added the docs for /proc/<pid>/fd while I'm at it.
[ivan@cloudflare.com: use bitmap_weight() to count the bits]
Link: https://lkml.kernel.org/r/20221018045844.37697-1-ivan@cloudflare.com
[akpm@linux-foundation.org: include linux/bitmap.h for bitmap_weight()]
[ivan@cloudflare.com: return errno from proc_fd_getattr() instead of setting negative size]
Link: https://lkml.kernel.org/r/20221024173140.30673-1-ivan@cloudflare.com
Link: https://lkml.kernel.org/r/20220922224027.59266-1-ivan@cloudflare.com
Signed-off-by: Ivan Babrou <ivan@cloudflare.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Ivan Babrou <ivan@cloudflare.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Some minor cleanup patches resent".
The first three patches trivial clean up patches.
And for the patch "kexec: replace crash_mem_range with range", I got a
ibm-p9wr ppc64le system to test, it works well.
This patch (of 4):
elfcorehdr_alloc() allocates a memory chunk for elfcorehdr_addr with
kzalloc(). If is_vmcore_usable() returns false, elfcorehdr_addr is a
predefined value. If parse_crash_elf_headers() gets some error and
returns a negetive value, the elfcorehdr_addr should be released with
elfcorehdr_free().
Fix it by calling elfcorehdr_free() when parse_crash_elf_headers() fails.
Link: https://lkml.kernel.org/r/20220929042936.22012-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20220929042936.22012-2-bhe@redhat.com
Signed-off-by: Jianglei Nie <niejianglei2021@163.com>
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chen Lifu <chenlifu@huawei.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Li Chen <lchen@ambarella.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: ye xingchen <ye.xingchen@zte.com.cn>
Cc: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use bitmap_zero/bitmap_copy/bitmap_qeual directly for bitmap operations.
Link: https://lkml.kernel.org/r/20221007124846.186453-3-joseph.qi@linux.alibaba.com
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pass bits directly into fill_node_map helper and use bitmap API directly
to simplify code.
Link: https://lkml.kernel.org/r/20221007124846.186453-2-joseph.qi@linux.alibaba.com
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use bitmap_zero/bitmap_copy/bitmap_equal directly for bitmap operations.
Link: https://lkml.kernel.org/r/20221007124846.186453-1-joseph.qi@linux.alibaba.com
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Statistically, in a large deployment regular segfaults may indicate a CPU
issue.
Currently, it is not possible to find out what CPU the segfault happened
on. There are at least two attempts to improve segfault logging with this
regard, but they do not help in case the logs rotate.
Hence, lets make sure it is possible to permanently record a CPU the task
ran on using a new core_pattern specifier.
Link: https://lkml.kernel.org/r/20220903064330.20772-1-oleksandr@redhat.com
Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
Suggested-by: Renaud Métrich <rmetrich@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Grzegorz Halat <ghalat@redhat.com>
Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Joel Savitz <jsavitz@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Will Deacon <will@kernel.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Here are 2 small driver core fixes for 6.1-rc6:
- utsname fix, this one should already be in your tree as it
came from a different tree earlier.
- kernfs bugfix for a much reported syzbot report that seems to
keep getting triggered.
Both of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCY3dPug8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynaxgCfbhh4z3ARWRl1L5H+vtnH+MeaLsoAnRBKQ2yv
GUXBXpiyz5TSzfGiP/Im
=80uW
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are two small driver core fixes for 6.1-rc6:
- utsname fix, this one should already be in your tree as it came
from a different tree earlier.
- kernfs bugfix for a much reported syzbot report that seems to keep
getting triggered.
Both of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-6.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
kernfs: Fix spurious lockdep warning in kernfs_find_and_get_node_by_id()
kernel/utsname_sysctl.c: Add missing enum uts_proc value
These cases were done with this Coccinelle:
@@
expression H;
expression L;
@@
- (get_random_u32_below(H) + L)
+ get_random_u32_inclusive(L, H + L - 1)
@@
expression H;
expression L;
expression E;
@@
get_random_u32_inclusive(L,
H
- + E
- - E
)
@@
expression H;
expression L;
expression E;
@@
get_random_u32_inclusive(L,
H
- - E
- + E
)
@@
expression H;
expression L;
expression E;
expression F;
@@
get_random_u32_inclusive(L,
H
- - E
+ F
- + E
)
@@
expression H;
expression L;
expression E;
expression F;
@@
get_random_u32_inclusive(L,
H
- + E
+ F
- - E
)
And then subsequently cleaned up by hand, with several automatic cases
rejected if it didn't make sense contextually.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
These cases were done with this Coccinelle:
@@
expression E;
identifier I;
@@
- do {
... when != I
- I = get_random_u32();
... when != I
- } while (I > E);
+ I = get_random_u32_below(E + 1);
@@
expression E;
identifier I;
@@
- do {
... when != I
- I = get_random_u32();
... when != I
- } while (I >= E);
+ I = get_random_u32_below(E);
@@
expression E;
identifier I;
@@
- do {
... when != I
- I = get_random_u32();
... when != I
- } while (I < E);
+ I = get_random_u32_above(E - 1);
@@
expression E;
identifier I;
@@
- do {
... when != I
- I = get_random_u32();
... when != I
- } while (I <= E);
+ I = get_random_u32_above(E);
@@
identifier I;
@@
- do {
... when != I
- I = get_random_u32();
... when != I
- } while (!I);
+ I = get_random_u32_above(0);
@@
identifier I;
@@
- do {
... when != I
- I = get_random_u32();
... when != I
- } while (I == 0);
+ I = get_random_u32_above(0);
@@
expression E;
@@
- E + 1 + get_random_u32_below(U32_MAX - E)
+ get_random_u32_above(E)
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
This is a simple mechanical transformation done by:
@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
(E)
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
It does not appear that FOLL_FORCE should be needed for setting up the
stack pages. They are allocated using the nascent brpm->vma, which was
newly created with VM_STACK_FLAGS, which an arch can override, but they
all appear to include VM_WRITE | VM_MAYWRITE. Remove FOLL_FORCE.
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: https://lore.kernel.org/lkml/202211171439.CDE720EAD@keescook/
Signed-off-by: Kees Cook <keescook@chromium.org>
vfs_lock_file, vfs_test_lock and vfs_cancel_lock all take both a struct
file argument and a file_lock. The file_lock has a fl_file field in it
howevever and it _must_ match the file passed in.
While most of the locks.c routines use the separately-passed file
argument, some filesystems rely on fl_file being filled out correctly.
I'm working on a patch series to remove the redundant argument from
these routines, but for now, let's ensure that the callers always set
this properly by issuing a WARN_ON_ONCE if they ever don't match.
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
As of now only device names are printed out over __xfs_printk().
The device names are not persistent across reboots which in case
of searching for origin of corruption brings another task to properly
identify the devices. This patch add XFS UUID upon every mount/umount
event which will make the identification much easier.
Signed-off-by: Lukas Herbolt <lukas@herbolt.com>
[sandeen: rebase onto current upstream kernel]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When lazysbcount is enabled, fsstress and loop mount/unmount test report
the following problems:
XFS (loop0): SB summary counter sanity check failed
XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460,
xfs_sb block 0x0
XFS (loop0): Unmount and run xfs_repair
XFS (loop0): First 128 bytes of corrupted metadata buffer:
00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00 XFSB.........(..
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a i.|._.D..t....4Z
00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80 ..... ..........
00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................
00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00 ................
00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00 ................
00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19 ................
XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580). Shutting down filesystem.
XFS (loop0): Please unmount the filesystem and rectify the problem(s)
XFS (loop0): log mount/recovery failed: error -117
XFS (loop0): log mount failed
This corruption will shutdown the file system and the file system will
no longer be mountable. The following script can reproduce the problem,
but it may take a long time.
#!/bin/bash
device=/dev/sda
testdir=/mnt/test
round=0
function fail()
{
echo "$*"
exit 1
}
mkdir -p $testdir
while [ $round -lt 10000 ]
do
echo "******* round $round ********"
mkfs.xfs -f $device
mount $device $testdir || fail "mount failed!"
fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null &
sleep 4
killall -w fsstress
umount $testdir
xfs_repair -e $device > /dev/null
if [ $? -eq 2 ];then
echo "ERR CODE 2: Dirty log exception during repair."
exit 1
fi
round=$(($round+1))
done
With lazysbcount is enabled, There is no additional lock protection for
reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the
m_ifree, this will make the m_ifree greater than m_icount. For example,
consider the following sequence and ifreedelta is postive:
CPU0 CPU1
xfs_log_sb xfs_trans_unreserve_and_mod_sb
---------- ------------------------------
percpu_counter_sum(&mp->m_icount)
percpu_counter_add_batch(&mp->m_icount,
idelta, XFS_ICOUNT_BATCH)
percpu_counter_add(&mp->m_ifree, ifreedelta);
percpu_counter_sum(&mp->m_ifree)
After this, incorrect inode count (sb_ifree > sb_icount) will be writen to
the log. In the subsequent writing of sb, incorrect inode count (sb_ifree >
sb_icount) will fail to pass the boundary check in xfs_validate_sb_write()
that cause the file system shutdown.
When lazysbcount is enabled, we don't need to guarantee that Lazy sb
counters are completely correct, but we do need to guarantee that sb_ifree
<= sb_icount. On the other hand, the constraint that m_ifree <= m_icount
must be satisfied any time that there /cannot/ be other threads allocating
or freeing inode chunks. If the constraint is violated under these
circumstances, sb_i{count,free} (the ondisk superblock inode counters)
maybe incorrect and need to be marked sick at unmount, the count will
be rebuilt on the next mount.
Fixes: 8756a5af18 ("libxfs: add more bounds checking to sb sanity checks")
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Clean up resources if resetting the dotdot entry doesn't succeed.
Observed through code inspection.
Fixes: 5838d0356b ("xfs: reset child dir '..' entry when unlinking child")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
Metadata files (e.g. realtime bitmaps and quota files) do not show up in
the bulkstat output, which means that scrub-by-handle does not work;
they can only be checked through a specific scrub type. Therefore, each
scrub type calls xchk_metadata_inode_forks to check the metadata for
whatever's in the file.
Unfortunately, that function doesn't actually check the inode record
itself. Refactor the function a bit to make that happen.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
We can handle files that are exactly s_maxbytes bytes long; we just
can't handle anything larger than that.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
CoW forks only exist in memory, which means that they can only ever have
an incore extent tree. Hence they must always be FMT_EXTENTS, so check
this when we're scrubbing them.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Ensure that extents in an inode's CoW fork are not marked as shared in
the refcount btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Teach scrub to flag quota files containing unwritten extents.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Enhance the block map scrubber to check delayed allocation reservations.
Though there are no physical space allocations to check, we do need to
make sure that the range of file offsets being mapped are correct, and
to bump the lastoff cursor so that key order checking works correctly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When scrub is checking file fork mappings against rmap records and
the rmap record starts before or ends after the bmap record, check the
adjacent bmap records to make sure that they're adjacent to the one
we're checking. This helps us to detect cases where the rmaps cover
territory that the bmaps do not.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
sparse complains that we can return an uninitialized error from this
function and that pag could be uninitialized. We know that there are no
zero-AG filesystems and hence we had to call xchk_bmap_check_ag_rmaps at
least once, so this is not actually possible, but I'm too worn out from
automated complaints from unsophisticated AIs so let's just fix this and
move on to more interesting problems, eh?
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Teach the summary count checker to count the number of free realtime
extents and compare that to the superblock copy.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If any part of the per-AG summary counter scan loop aborts without
collecting all of the data we need, the scrubber's observation data will
be invalid. Set the incomplete flag so that we abort the scrub without
reporting false corruptions. Document the data dependency here too.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
xfs_rtalloc_query_range scans the realtime bitmap file in order of
increasing file offset, so this caller can take ILOCK_SHARED on the rt
bitmap inode instead of ILOCK_EXCL. This isn't going to yield any
practical benefits at mount time, but we'd like to make the locking
usage consistent around xfs_rtalloc_query_all calls. Make all the
places we do this use the same xfs_ilock lockflags for consistency.
Fixes: 4c934c7dd6 ("xfs: report realtime space information via the rtbitmap")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
It turns out that GETFSMAP and online fsck have had a bug for years due
to their use of ILOCK_SHARED to coordinate their linear scans of the
realtime bitmap. If the bitmap file's data fork happens to be in BTREE
format and the scan occurs immediately after mounting, the incore bmbt
will not be populated, leading to ASSERTs tripping over the incorrect
inode state. Because the bitmap scans always lock bitmap buffers in
increasing order of file offset, it is appropriate for these two callers
to take a shared ILOCK to improve scalability.
To fix this problem, load both data and attr fork state into memory when
mounting the realtime inodes. Realtime metadata files aren't supposed
to have an attr fork so the second step is likely a nop.
On most filesystems this is unlikely since the rtbitmap data fork is
usually in extents format, but it's possible to craft a filesystem that
will by fragmenting the free space in the data section and growfsing the
rt section.
Fixes: 4c934c7dd6 ("xfs: report realtime space information via the rtbitmap")
Also-Fixes: 46d9bfb5e7 ("xfs: cross-reference the realtime bitmap")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If we tried to repair something but the repair failed with -EDEADLOCK,
that means that the repair function couldn't grab some resource it
needed and wants us to try again. If we try again (with TRY_HARDER) but
still can't get all the resources we need, the repair fails and errors
remain on the filesystem.
Right now, repair returns the -EDEADLOCK to the caller as -EFSCORRUPTED,
which results in XFS_SCRUB_OFLAG_CORRUPT being passed out to userspace.
This is not correct because repair has not determined that anything is
corrupt. If the repair had been invoked on an object that could be
optimized but wasn't corrupt (OFLAG_PREEN), the inability to grab
resources will be reported to userspace as corrupt metadata, and users
will be unnecessarily alarmed that their suboptimal metadata turned into
a corruption.
Fix this by returning zero so that the results of the actual scrub will
be copied back out to userspace.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Repair functions will not return EAGAIN -- if they were not able to
obtain resources, they should return EDEADLOCK (like the rest of online
fsck) to signal that we need to grab all the resources and try again.
Hence we don't need to deal with this case except as a debugging
assertion.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If the scrub process is sent a fatal signal while we're checking dquots,
the predicate for this will set the error code to -EINTR. Don't then
squash that into -ECANCELED, because the wrong errno turns up in the
trace output.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If the program calling online fsck is terminated with a fatal signal,
bail out to userspace by returning EINTR, not EAGAIN. EAGAIN is used by
scrubbers to indicate that we should try again with more resources
locked, and not to indicate that the operation was cancelled. The
miswiring is mostly harmless, but it shows up in the trace data.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Convert all the online scrub code to use the Linux slab allocator
functions directly instead of going through the kmem wrappers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Initialize the check_owner list head so that we don't corrupt the list.
Reduce the scope of the object pointer.
Fixes: 858333dcf0 ("xfs: check btree block ownership with bnobt/rmapbt when scrubbing btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Memory allocation usage is the same throughout online fsck -- we want
kernel memory, we have to be able to back out if we can't allocate
memory, and we don't want to spray dmesg with memory allocation failure
reports. Standardize the GFP flag usage and document these requirements.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Teach the AGFL repair function to check each block of the proposed AGFL
against the rmap btree. If the rmapbt finds any mappings that are not
OWN_AG, strike that block from the list.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Currently, the only way to lock an allocation group is to hold the AGI
and AGF buffers. If a repair needs to roll the transaction while
repairing some AG metadata, it maintains that lock by holding the two
buffers across the transaction roll and joins them afterwards.
However, repair is not like other parts of XFS that employ the bhold -
roll - bjoin sequence because it's possible that the AGI or AGF buffers
are not actually dirty before the roll. This presents two problems --
First, we need to redirty those buffers to keep them moving along in the
log to avoid pinning the log tail. Second, a clean buffer log item can
detach from the buffer. If this happens, the buffer type state is
discarded along with the bli and must be reattached before the next time
the buffer is logged. If it is not, the logging code will complain and
log recovery will not work properly.
An earlier version of this patch tried to fix the second problem by
re-setting the buffer type in the bli after joining the buffer to the
new transaction, but that looked weird and didn't solve the first
problem. Instead, solve both problems by logging the buffer before
rolling the transaction.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
While scrubbing an allocation group, we don't need to hold the AGFL
buffer as part of the scrub context. All that is necessary to lock an
AG is to hold the AGI and AGF buffers, so fix all the existing users of
the AGFL buffer to grab them only when necessary.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
While running the online fsck test suite, I noticed the following
assertion in the kernel log (edited for brevity):
XFS: Assertion failed: 0, file: fs/xfs/xfs_health.c, line: 571
------------[ cut here ]------------
WARNING: CPU: 3 PID: 11667 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
CPU: 3 PID: 11667 Comm: xfs_scrub Tainted: G W 5.19.0-rc7-xfsx #rc7 6e6475eb29fd9dda3181f81b7ca7ff961d277a40
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
RIP: 0010:assfail+0x46/0x4a [xfs]
Call Trace:
<TASK>
xfs_dir2_isblock+0xcc/0xe0
xchk_directory_blocks+0xc7/0x420
xchk_directory+0x53/0xb0
xfs_scrub_metadata+0x2b6/0x6b0
xfs_scrubv_metadata+0x35e/0x4d0
xfs_ioc_scrubv_metadata+0x111/0x160
xfs_file_ioctl+0x4ec/0xef0
__x64_sys_ioctl+0x82/0xa0
do_syscall_64+0x2b/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
This assertion triggers in xfs_dirattr_mark_sick when the caller passes
in a whichfork value that is neither of XFS_{DATA,ATTR}_FORK. The cause
of this is that xchk_directory_blocks only partially initializes the
xfs_da_args structure that is passed to xfs_dir2_isblock. If the data
fork is not correct, the XFS_IS_CORRUPT clause will trigger. My
development branch reports this failure to the health monitoring
subsystem, which accesses the uninitialized args->whichfork field,
leading the the assertion tripping. We really shouldn't be passing
random stack contents around, so the solution here is to force the
compiler to zero-initialize the struct.
Found by fuzzing u3.bmx[0].blockcount = middlebit on xfs/1554.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If the returning value of SMB2_set_info_init is an error-value,
exit the function.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 0967e54579 ("cifs: use a compound for setting an xattr")
Signed-off-by: Anastasia Belova <abelova@astralinux.ru>
Signed-off-by: Steve French <stfrench@microsoft.com>
to_attr() in zonefs sysfs code is unused, which it causes a warning when
compiling with clang and W=1. Delete it to prevent the warning.
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
When an IO error occurs, the function __zonefs_io_error() is used to
issue a zone report to obtain the latest zone information from the
device. This function gets a zone report for all zones used as storage
for a file, which is always 1 zone except for files representing
aggregated conventional zones.
The number of zones of a zone report for a file is calculated in
__zonefs_io_error() by doing a bit-shift of the inode i_zone_size field,
which is equal to or larger than the device zone size. However, this
calculation does not take into account that the last zone of a zoned
device may be smaller than the zone size reported by bdev_zone_sectors()
(which is used to set the bit shift size). As a result, if an error
occurs for an IO targetting such last smaller zone, the zone report will
ask for 0 zones, leading to an invalid zone report.
Fix this by using the fact that all files require a 1 zone report,
except if the inode i_zone_size field indicates a zone size larger than
the device zone size. This exception case corresponds to a mount with
aggregated conventional zones.
A check for this exception is added to the file inode initialization
during mount. If an invalid setup is detected, emit an error and fail
the mount (check contributed by Johannes Thumshirn).
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
The return value of CIFSGetExtAttr is negative, should be checked
with -EOPNOTSUPP rather than EOPNOTSUPP.
Fixes: 64a5cfa6db ("Allow setting per-file compression via SMB2/3")
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmNzxK8ACgkQ+7dXa6fL
C2uN9g//bKgnlQpAbhEV7OQc4TqGatxHWzzOne77ceGjedyT1/IAFUJWIP8bAHSa
e8UlJ/4rLBraoD1JaaFBARprJcy6vv8iX7hFWGgBWWcM23V5FQqSwogzLBqKI3LI
Ig3w8159CHSnG0r9LImB7cPuZtE/3W+GxapnaUITeSSiirocW0q3/KafGn8QvFy3
/jYbRNKr6q7Ya6QBx8xKF6iI5jN0WmsqroUdebGOZSoLSq/45OaZoksFeVJT8FLR
wwlpdcQu3Ag/3wELIZNUSu39UMQE6ttcXehTybTYiAMSeZq71G8TooHGAxFzqdRK
peYoneBcG2l4GtZm6/SXRjBdtju3ZaL8PU9JPK2q3tUx6sQY8HG6w4bbDohA2BhM
upG1rC2bclZ6FJ+Q0FRI5y/YoOc+FsVydTZIaW/yd8DUKLYP0+2e7tM5xVTnqqBj
UXiTe8J+kSHRSmZPX+DEcdVVYiR770iyEYXNHBGIlFvcYh2bi76pW5dVmn5bEMSU
AV1R0rAnB9u6wTdVwDJUrDIVMOii2qNn6+XU908fida/T6P4d+9cTC8R8WAR454i
7mHY3saavl8arSasXtZyG3kffm3UNilFdrNxdITlk+K6lcjF4la9cFEhM7T3w73M
nJuaTLFI0iiT/321mSnTsCHB+HXN1OiV8H0v/Eani41HecCplsQ=
=VxQ6
-----END PGP SIGNATURE-----
AMerge tag 'netfs-fixes-20221115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull netfx fixes from David Howells:
"Two fixes, affecting the functions that iterates over the pagecache
unmarking or unlocking pages after an op is complete:
- xas_for_each() loops must call xas_retry() first thing and
immediately do a "continue" in the case that the extracted value is
a special value that indicates that the walk raced with a
modification. Fix the unlock and unmark loops to do this.
- The maths in the unlock loop is dodgy as it could, theoretically,
at some point in the future end up with a starting file pointer
that is in the middle of a folio. This will cause a subtraction to
go negative - but the number is unsigned. Fix the maths to use
absolute file positions instead of relative page indices"
* tag 'netfs-fixes-20221115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
netfs: Fix dodgy maths
netfs: Fix missing xas_retry() calls in xarray iteration
If the returning value of SMB2_close_init is an error-value,
exit the function.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 352d96f3ac ("cifs: multichannel: move channel selection above transport layer")
Signed-off-by: Anastasia Belova <abelova@astralinux.ru>
Signed-off-by: Steve French <stfrench@microsoft.com>
- Fix packed_inode invalid access when reading fragments on crafted
images;
- Add a missing erofs_put_metabuf() in an error path in fscache mode;
- Fix incorrect `count' for unmapped extents in fscache mode;
- Fix use-after-free of fsid and domain_id string when remounting;
- Fix missing xas_retry() in fscache mode.
-----BEGIN PGP SIGNATURE-----
iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCY3OcchEceGlhbmdAa2Vy
bmVsLm9yZwAKCRA5NzHcH7XmBCzgAP92t7Lfu7gBuyhXfCJwJVFK0Iku8j9mhOiT
+C/RVB+9zQEAk/2vy3ULcGN5k6k2q7OgEzNxQ/jM3hVQnoK+sQzwVwA=
=XHmr
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-6.1-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
"Most patches randomly fix error paths or corner cases in fscache mode
reported recently. One fixes an invalid access relating to fragments
on crafted images.
Summary:
- Fix packed_inode invalid access when reading fragments on crafted
images
- Add a missing erofs_put_metabuf() in an error path in fscache mode
- Fix incorrect `count' for unmapped extents in fscache mode
- Fix use-after-free of fsid and domain_id string when remounting
- Fix missing xas_retry() in fscache mode"
* tag 'erofs-for-6.1-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix missing xas_retry() in fscache mode
erofs: fix use-after-free of fsid and domain_id string
erofs: get correct count for unmapped range in fscache mode
erofs: put metabuf in error path in fscache mode
erofs: fix general protection fault when reading fragment
Fix the dodgy maths in netfs_rreq_unlock_folios(). start_page could be
inside the folio, in which case the calculation of pgpos will be come up
with a negative number (though for the moment rreq->start is rounded down
earlier and folios would have to get merged whilst locked)
Alter how this works to just frame the tracking in terms of absolute file
positions, rather than offsets from the start of the I/O request. This
simplifies the maths and makes it easier to follow.
Fix the issue by using folio_pos() and folio_size() to calculate the end
position of the page.
Fixes: 3d3c950467 ("netfs: Provide readahead and readpage netfs helpers")
Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/Y2SJw7w1IsIik3nb@casper.infradead.org/
Link: https://lore.kernel.org/r/166757988611.950645.7626959069846893164.stgit@warthog.procyon.org.uk/ # v2
netfslib has a number of places in which it performs iteration of an xarray
whilst being under the RCU read lock. It *should* call xas_retry() as the
first thing inside of the loop and do "continue" if it returns true in case
the xarray walker passed out a special value indicating that the walk needs
to be redone from the root[*].
Fix this by adding the missing retry checks.
[*] I wonder if this should be done inside xas_find(), xas_next_node() and
suchlike, but I'm told that's not an simple change to effect.
This can cause an oops like that below. Note the faulting address - this
is an internal value (|0x2) returned from xarray.
BUG: kernel NULL pointer dereference, address: 0000000000000402
...
RIP: 0010:netfs_rreq_unlock+0xef/0x380 [netfs]
...
Call Trace:
netfs_rreq_assess+0xa6/0x240 [netfs]
netfs_readpage+0x173/0x3b0 [netfs]
? init_wait_var_entry+0x50/0x50
filemap_read_page+0x33/0xf0
filemap_get_pages+0x2f2/0x3f0
filemap_read+0xaa/0x320
? do_filp_open+0xb2/0x150
? rmqueue+0x3be/0xe10
ceph_read_iter+0x1fe/0x680 [ceph]
? new_sync_read+0x115/0x1a0
new_sync_read+0x115/0x1a0
vfs_read+0xf3/0x180
ksys_read+0x5f/0xe0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Changes:
========
ver #2)
- Changed an unsigned int to a size_t to reduce the likelihood of an
overflow as per Willy's suggestion.
- Added an additional patch to fix the maths.
Fixes: 3d3c950467 ("netfs: Provide readahead and readpage netfs helpers")
Reported-by: George Law <glaw@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/166749229733.107206.17482609105741691452.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/166757987929.950645.12595273010425381286.stgit@warthog.procyon.org.uk/ # v2
btrfs_ioctl_get_subvol_info() frees the search path after the userspace
copy from the temp buffer @subvol_info. This can lead to a lock splat
warning.
Fix this by freeing the path before we copy it to userspace.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_ioctl_ino_to_path() frees the search path after the userspace copy
from the temp buffer @ipath->fspath. Which potentially can lead to a lock
splat warning.
Fix this by freeing the path before we copy it to userspace.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_ioctl_logical_to_ino() frees the search path after the userspace
copy from the temp buffer @inodes. Which potentially can lead to a lock
splat.
Fix this by freeing the path before we copy @inodes to userspace.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a nowait buffered write we can trigger the following assertion:
[11138.437027] assertion failed: !path->nowait, in fs/btrfs/ctree.c:4658
[11138.438251] ------------[ cut here ]------------
[11138.438254] kernel BUG at fs/btrfs/messages.c:259!
[11138.438762] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[11138.439450] CPU: 4 PID: 1091021 Comm: fsstress Not tainted 6.1.0-rc4-btrfs-next-128 #1
[11138.440611] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[11138.442553] RIP: 0010:btrfs_assertfail+0x19/0x1b [btrfs]
[11138.443583] Code: 5b 41 5a 41 (...)
[11138.446437] RSP: 0018:ffffbaf0cf05b840 EFLAGS: 00010246
[11138.447235] RAX: 0000000000000039 RBX: ffffbaf0cf05b938 RCX: 0000000000000000
[11138.448303] RDX: 0000000000000000 RSI: ffffffffb2ef59f6 RDI: 00000000ffffffff
[11138.449370] RBP: ffff9165f581eb68 R08: 00000000ffffffff R09: 0000000000000001
[11138.450493] R10: ffff9167a88421f8 R11: 0000000000000000 R12: ffff9164981b1000
[11138.451661] R13: 000000008c8f1000 R14: ffff9164991d4000 R15: ffff9164981b1000
[11138.452225] FS: 00007f1438a66440(0000) GS:ffff9167ad600000(0000) knlGS:0000000000000000
[11138.452949] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11138.453394] CR2: 00007f1438a64000 CR3: 0000000100c36002 CR4: 0000000000370ee0
[11138.454057] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[11138.454879] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[11138.455779] Call Trace:
[11138.456211] <TASK>
[11138.456598] btrfs_next_old_leaf.cold+0x18/0x1d [btrfs]
[11138.457827] ? kmem_cache_alloc+0x18d/0x2a0
[11138.458516] btrfs_lookup_csums_range+0x149/0x4d0 [btrfs]
[11138.459407] csum_exist_in_range+0x56/0x110 [btrfs]
[11138.460271] can_nocow_file_extent+0x27c/0x310 [btrfs]
[11138.461155] can_nocow_extent+0x1ec/0x2e0 [btrfs]
[11138.461672] btrfs_check_nocow_lock+0x114/0x1c0 [btrfs]
[11138.462951] btrfs_buffered_write+0x44c/0x8e0 [btrfs]
[11138.463482] btrfs_do_write_iter+0x42b/0x5f0 [btrfs]
[11138.463982] ? lock_release+0x153/0x4a0
[11138.464347] io_write+0x11b/0x570
[11138.464660] ? lock_release+0x153/0x4a0
[11138.465213] ? lock_is_held_type+0xe8/0x140
[11138.466003] io_issue_sqe+0x63/0x4a0
[11138.466339] io_submit_sqes+0x238/0x770
[11138.466741] __do_sys_io_uring_enter+0x37b/0xb10
[11138.467206] ? lock_is_held_type+0xe8/0x140
[11138.467879] ? syscall_enter_from_user_mode+0x1d/0x50
[11138.468688] do_syscall_64+0x38/0x90
[11138.469265] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[11138.470017] RIP: 0033:0x7f1438c539e6
This is because to check if we can NOCOW, we check that if we can NOCOW
into an extent (it's prealloc extent or the inode has NOCOW attribute),
and then check if there are csums for the extent's range in the csum tree.
The search may leave us beyond the last slot of a leaf, and then when
we call btrfs_next_leaf() we end up at btrfs_next_old_leaf() with a
time_seq of 0.
This triggers a failure of the first assertion at btrfs_next_old_leaf(),
since we have a nowait path. With assertions disabled, we simply don't
respect the NOWAIT semantics, allowing the write to block on locks or
blocking on IO for reading an extent buffer from disk.
Fix this by:
1) Triggering the assertion only if time_seq is not 0, which means that
search is being done by a tree mod log user, and in the buffered and
direct IO write paths we don't use the tree mod log;
2) Implementing NOWAIT semantics at btrfs_next_old_leaf(). Any failure to
lock an extent buffer should return immediately and not retry the
search, as well as if we need to do IO to read an extent buffer from
disk.
Fixes: c922b016f3 ("btrfs: assert nowait mode is not used for some btree search functions")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Don't transform the logical block size to a bit shift only to shift it
back to the original block size. Just use the size.
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Added new function wnd_set_used_safe.
Load $BadClus before $AttrDef instead of before $Bitmap.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Added new functions index_hdr_check and index_buf_check.
Now we check all stuff for correctness while reading from disk.
Also fixed bug with stale nfs data.
Reported-by: van fantasy <g1042620637@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Added new function ntfs_check_for_free_space.
Added undo mechanism in attr_data_get_block.
Fixes xfstest generic/083
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
In some cases we can be in deadlock
because we tried to lock the same dir.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
There were 2 problems:
- in some cases we lost dirty flag;
- cluster allocation can be called even when it wasn't needed.
Fixes xfstest generic/465
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Remove ntfs_sparse_cluster.
Zero clusters in attr_allocate_clusters.
Fixes xfstest generic/263
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Simplify logic in ntfs_extend_initialized_size, ntfs_sparse_cluster
and ntfs_fallocate.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
The functions from bitops.h already have _le variants so use them to
prevent invalid reads/writes of the bitmap on big endian systems.
Signed-off-by: Thomas Kühnel <thomas.kuehnel@avm.de>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
__bitmap_set/__bitmap_clear only works with bitmaps in CPU order.
Define a variant of these functions in ntfs3 to handle modifying bitmaps
read from the filesystem.
Signed-off-by: Thomas Kühnel <thomas.kuehnel@avm.de>
Reviewed-by: Nicolas Schier <n.schier@avm.de>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
ni_fname_name called ntfs_cmp_names_cpu which assumes that the first
string is in CPU byte order and the second one in little endian.
In this case both strings are little endian so ntfs_cmp_names is the
correct function to call.
Signed-off-by: Thomas Kühnel <thomas.kuehnel@avm.de>
Reviewed-by: Nicolas Schier <n.schier@avm.de>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
When PAGE_SIZE is 64K, if read_log_page is called by log_read_rst for
the first time, the size of *buffer would be equal to
DefaultLogPageSize(4K).But for *buffer operations like memcpy,
if the memory area size(n) which being assigned to buffer is larger
than 4K (log->page_size(64K) or bytes(64K-page_off)), it will cause
an out of boundary error.
Call trace:
[...]
kasan_report+0x44/0x130
check_memory_region+0xf8/0x1a0
memcpy+0xc8/0x100
ntfs_read_run_nb+0x20c/0x460
read_log_page+0xd0/0x1f4
log_read_rst+0x110/0x75c
log_replay+0x1e8/0x4aa0
ntfs_loadlog_and_replay+0x290/0x2d0
ntfs_fill_super+0x508/0xec0
get_tree_bdev+0x1fc/0x34c
[...]
Fix this by setting variable r_page to NULL in log_read_rst.
Signed-off-by: Yin Xiujiang <yinxiujiang@kylinos.cn>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
The xarray iteration only holds the RCU read lock and thus may encounter
XA_RETRY_ENTRY if there's process modifying the xarray concurrently.
This will cause oops when referring to the invalid entry.
Fix this by adding the missing xas_retry(), which will make the
iteration wind back to the root node if XA_RETRY_ENTRY is encountered.
Fixes: d435d53228 ("erofs: change to use asynchronous io for fscache readpage/readahead")
Suggested-by: David Howells <dhowells@redhat.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Jia Zhu <zhujia.zj@bytedance.com>
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20221114121943.29987-1-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Now that the nfsd_fh_verify_err() tracepoint is always called on
error, it needs to handle cases where the filehandle is not yet
fully formed.
Fixes: 93c128e709 ("nfsd: ensure we always call fh_verify_error tracepoint")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
The request's r_session maybe changed when it was forwarded or
resent. Both the forwarding and resending cases the requests will
be protected by the mdsc->mutex.
Cc: stable@vger.kernel.org
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2137955
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When decoding the snaps fails it maybe leaving the 'first_realm'
and 'realm' pointing to the same snaprealm memory. And then it'll
put it twice and could cause random use-after-free, BUG_ON, etc
issues.
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/57686
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The ceph_lookup_inode() function returns error pointers. It never
returns NULL.
Fixes: aa87052dd9 ("ceph: fix incorrectly showing the .snap size for stat")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In 4053d2500b ("orangefs: rework posix acl handling when creating new
filesystem objects") we tried to precalculate the correct mode when
creating a new inode. However, this leads to regressions when creating new
filesystem objects.
Even if we precalculate the mode we still need to call __orangefs_setattr()
to perform additional checks and we also need to update the mode of
ACL_TYPE_ACCESS acls set on the inode. The patch referenced above regressed
that. Restore that part of the old behavior and remove the mode
precalculation as it doesn't get us anything anymore.
Fixes: 4053d2500b ("orangefs: rework posix acl handling when creating new filesystem objects")
Reported-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
There were two patches which addressed the same bug and added the same
condition:
commit 6db620863f ("fs/ntfs3: Validate data run offset")
commit 887bfc5460 ("fs/ntfs3: Fix slab-out-of-bounds read in run_unpack")
Delete one condition.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
NTFS-3G provides the system.ntfs_attrib_be extended attribute, which
has the same value as system.ntfs_attrib but represented in big-endian.
Some utilities rely on the existence of this extended attribute.
Improves compatibility with NTFS-3G by adding the system.ntfs_attrib_be
extended attribute.
Signed-off-by: Daniel Pinto <danielpinto52@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
The hidedotfiles mount option provides the same functionality as
the NTFS-3G hide_dot_files mount option. As such, it should be
named the same for compatibility with NTGS-3G.
Rename the hidedotfiles to hide_dot_files for compatbility with
NTFS-3G.
Signed-off-by: Daniel Pinto <danielpinto52@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Currently, the ntfs3 driver does return the hidedotfiles mount
option in the list of enabled mount options. This can confuse
users who may doubt they enabled the option when not seeing in
the list provided by the mount command.
Add hidedotfiles mount option to the list of enabled options
provided by the mount command when it is enabled.
Signed-off-by: Daniel Pinto <danielpinto52@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Currently, the hidedotfiles mount option only has an effect when
creating new files. Removing or adding the starting dot when moving
or renaming files does not update the hidden attribute.
Make hidedotfiles also set or uset the hidden attribute when a file
gains or loses its starting dot by being moved or renamed.
Signed-off-by: Daniel Pinto <danielpinto52@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Currently, the hidedotfiles mount option is behaving in the reverse
way of what would be expected: enabling it disables setting the
hidden attribute on files or directories with names starting with a
dot and disabling it enables the setting.
Reverse the behaviour of the hidedotfiles mount option so it matches
what is expected.
Signed-off-by: Daniel Pinto <danielpinto52@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
When enabled, the windows_names mount option prevents the creation
of files or directories with names not allowed by Windows. Use
the same option name as NTFS-3G for compatibility.
Signed-off-by: Daniel Pinto <danielpinto52@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
'a == b ? 0 : 1' is logically equivalent to 'a != b'.
Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>