The filestreams allocator stores an xfs_fstrm_item structure in the MRU to
cache inode number to agno mappings for a particular length of time. Each
xfs_fstrm_item contains the internal MRU structure, an inode pointer and
agno value.
The inode pointer stored in the xfs_fstrm_item is not referenced, however,
which means the inode itself can be removed and reclaimed before the MRU
item is freed. If this occurs, xfs_fstrm_free_func() can access freed or
unrelated memory through xfs_fstrm_item->ip and crash.
The obvious solution is to grab an inode reference for xfs_fstrm_item.
The filestream mechanism only actually uses the inode pointer as a means
to access the xfs_mount, however. Rather than add unnecessary
complexity, simplify the implementation to store an xfs_mount pointer in
struct xfs_mru_cache, and pass it to the free callback. This also
requires updates to the tracepoint class to provide the associated data
via parameters rather than the inode and a minor hack to peek at the MRU
key to establish the inode number at free time.
Based on debugging work and an earlier patch from Brian Foster, who
also wrote most of this changelog.
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When an intent is aborted during it's initial commit through
xfs_defer_trans_abort(), there is a use after free. The current
report is for a RUI through this path in generic/388:
Freed by task 6274:
__kasan_slab_free+0x136/0x180
kmem_cache_free+0xe7/0x4b0
xfs_trans_free_items+0x198/0x2e0
__xfs_trans_commit+0x27f/0xcc0
xfs_trans_roll+0x17b/0x2a0
xfs_defer_trans_roll+0x6ad/0xe60
xfs_defer_finish+0x2a6/0x2140
xfs_alloc_file_space+0x53a/0xf90
xfs_file_fallocate+0x5c6/0xac0
vfs_fallocate+0x2f5/0x930
ioctl_preallocate+0x1dc/0x320
do_vfs_ioctl+0xfe4/0x1690
The problem is that the RUI has two active references - one in the
current transaction, and another held by the defer_ops structure
that is passed to the RUD (intent done) so that both the intent and
the intent done structures are freed on commit of the intent done.
Hence during abort, we need to release the intent item, because the
defer_ops reference is released separately via ->abort_intent
callback. Fix all the intent code to do this correctly.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_dir_ialloc() rolls the current transaction when allocation of a new
inode required the space manager to perform an allocation and replinish
the Inode btree.
None of the callers of xfs_dir_ialloc() need to know if the
transaction was committed. Hence this commit removes the "committed"
argument of xfs_dir_ialloc.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Today if we run xfs_fsr and crash[1], log replay can fail because
the recovery code tries to instantiate the donor inode from
disk to replay the swapext, but it's been deleted and we get
verifier failures when we try to read the inode off disk with
i_mode == 0.
This fixes both sides: We don't log the swapext change if the
inode has been deleted, and we don't try to recover it either.
[1] or if systemd doesn't cleanly unmount root, as it is wont
to do ...
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Most of the generic data structures embedded in xfs_mount are
dynamically initialized immediately after mp is allocated. A few
fields are left out and initialized during the xfs_mountfs()
sequence, after mp has been attached to the superblock.
To clean this up and help prevent premature access of associated
fields, refactor xfs_mount allocation and all dependent init calls
into a new helper. This self-documents that all low level data
structures (i.e., locks, trees, etc.) should be initialized before
xfs_mount is attached to the superblock.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We can only get into the branch if CRCs are enabled, so there's no
need to check inside the branch for CRCs being enabled....
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
We recently came across a V4 filesystem causing memory corruption
due to a newly allocated inode being setup twice and being added to
the superblock inode list twice. From code inspection, the only way
this could happen is if a newly allocated inode was not marked as
free on disk (i.e. di_mode wasn't zero).
Running the metadump on an upstream debug kernel fails during inode
allocation like so:
XFS: Assertion failed: ip->i_d.di_nblocks == 0, file: fs/xfs/xfs_inod=
e.c, line: 838
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:114!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 11 PID: 3496 Comm: mkdir Not tainted 4.16.0-rc5-dgc #442
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/0=
1/2014
RIP: 0010:assfail+0x28/0x30
RSP: 0018:ffffc9000236fc80 EFLAGS: 00010202
RAX: 00000000ffffffea RBX: 0000000000004000 RCX: 0000000000000000
RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff8227211b
RBP: ffffc9000236fce8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000bec R11: f000000000000000 R12: ffffc9000236fd30
R13: ffff8805c76bab80 R14: ffff8805c77ac800 R15: ffff88083fb12e10
FS: 00007fac8cbff040(0000) GS:ffff88083fd00000(0000) knlGS:0000000000000=
000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fffa6783ff8 CR3: 00000005c6e2b003 CR4: 00000000000606e0
Call Trace:
xfs_ialloc+0x383/0x570
xfs_dir_ialloc+0x6a/0x2a0
xfs_create+0x412/0x670
xfs_generic_create+0x1f7/0x2c0
? capable_wrt_inode_uidgid+0x3f/0x50
vfs_mkdir+0xfb/0x1b0
SyS_mkdir+0xcf/0xf0
do_syscall_64+0x73/0x1a0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
Extracting the inode number we crashed on from an event trace and
looking at it with xfs_db:
xfs_db> inode 184452204
xfs_db> p
core.magic = 0x494e
core.mode = 0100644
core.version = 2
core.format = 2 (extents)
core.nlinkv2 = 1
core.onlink = 0
.....
Confirms that it is not a free inode on disk. xfs_repair
also trips over this inode:
.....
zero length extent (off = 0, fsbno = 0) in ino 184452204
correcting nextents for inode 184452204
bad attribute fork in inode 184452204, would clear attr fork
bad nblocks 1 for inode 184452204, would reset to 0
bad anextents 1 for inode 184452204, would reset to 0
imap claims in-use inode 184452204 is free, would correct imap
would have cleared inode 184452204
.....
disconnected inode 184452204, would move to lost+found
And so we have a situation where the directory structure and the
inobt thinks the inode is free, but the inode on disk thinks it is
still in use. Where this corruption came from is not possible to
diagnose, but we can detect it and prevent the kernel from oopsing
on lookup. The reproducer now results in:
$ sudo mkdir /mnt/scratch/{0,1,2,3,4,5}{0,1,2,3,4,5}
mkdir: cannot create directory =E2=80=98/mnt/scratch/00=E2=80=99: File ex=
ists
mkdir: cannot create directory =E2=80=98/mnt/scratch/01=E2=80=99: File ex=
ists
mkdir: cannot create directory =E2=80=98/mnt/scratch/03=E2=80=99: Structu=
re needs cleaning
mkdir: cannot create directory =E2=80=98/mnt/scratch/04=E2=80=99: Input/o=
utput error
mkdir: cannot create directory =E2=80=98/mnt/scratch/05=E2=80=99: Input/o=
utput error
....
And this corruption shutdown:
[ 54.843517] XFS (loop0): Corruption detected! Free inode 0xafe846c not=
marked free on disk
[ 54.845885] XFS (loop0): Internal error xfs_trans_cancel at line 1023 =
of file fs/xfs/xfs_trans.c. Caller xfs_create+0x425/0x670
[ 54.848994] CPU: 10 PID: 3541 Comm: mkdir Not tainted 4.16.0-rc5-dgc #=
443
[ 54.850753] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIO=
S 1.10.2-1 04/01/2014
[ 54.852859] Call Trace:
[ 54.853531] dump_stack+0x85/0xc5
[ 54.854385] xfs_trans_cancel+0x197/0x1c0
[ 54.855421] xfs_create+0x425/0x670
[ 54.856314] xfs_generic_create+0x1f7/0x2c0
[ 54.857390] ? capable_wrt_inode_uidgid+0x3f/0x50
[ 54.858586] vfs_mkdir+0xfb/0x1b0
[ 54.859458] SyS_mkdir+0xcf/0xf0
[ 54.860254] do_syscall_64+0x73/0x1a0
[ 54.861193] entry_SYSCALL_64_after_hwframe+0x42/0xb7
[ 54.862492] RIP: 0033:0x7fb73bddf547
[ 54.863358] RSP: 002b:00007ffdaa553338 EFLAGS: 00000246 ORIG_RAX: 0000=
000000000053
[ 54.865133] RAX: ffffffffffffffda RBX: 00007ffdaa55449a RCX: 00007fb73=
bddf547
[ 54.866766] RDX: 0000000000000001 RSI: 00000000000001ff RDI: 00007ffda=
a55449a
[ 54.868432] RBP: 00007ffdaa55449a R08: 00000000000001ff R09: 00005623a=
8670dd0
[ 54.870110] R10: 00007fb73be72d5b R11: 0000000000000246 R12: 000000000=
00001ff
[ 54.871752] R13: 00007ffdaa5534b0 R14: 0000000000000000 R15: 00007ffda=
a553500
[ 54.873429] XFS (loop0): xfs_do_force_shutdown(0x8) called from line 1=
024 of file fs/xfs/xfs_trans.c. Return address = ffffffff814cd050
[ 54.882790] XFS (loop0): Corruption of in-memory data detected. Shutt=
ing down filesystem
[ 54.884597] XFS (loop0): Please umount the filesystem and rectify the =
problem(s)
Note that this crash is only possible on v4 filesystemsi or v5
filesystems mounted with the ikeep mount option. For all other V5
filesystems, this problem cannot occur because we don't read inodes
we are allocating from disk - we simply overwrite them with the new
inode information.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Tested-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In xfs_scrub_iallocbt_xref_rmap_inodes we're checking inodes against
rmap records, so we should use xfs_scrub_btree_xref_set_corrupt if we
encounter discrepancies here so that we know that it's a cross
referencing error, not necessarily a corruption in the inobt itself.
The userspace xfs_scrub program will try to repair outright corruptions
in the agi/inobt prior to phase 3 so that the inode scan will proceed.
If only a cross-referencing error is noted, the repair program defers
the repair attempt until it can check the other space metadata at least
once.
It is therefore essential that the inobt scrubber can correctly
distinguish between corruptions and "unable to cross-reference something
else with this inobt". The same reasoning applies to "xfs: record inode
buf errors as a xref error in inobt scrubber".
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
If a directory's parent inode pointer doesn't point to an inode, the
directory should be flagged as corrupt. Enable IGET_UNTRUSTED here so
that _iget will return -EINVAL if the inobt does not confirm that the
inode is present and allocated and we can flag the directory corruption.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
When we're verifying inode buffers, sanity-check the unlinked pointer.
We don't want to run the risk of trying to purge something that's
obviously broken.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Extent size hint validation is used by scrub to decide if there's an
error, and it will be used by repair to decide to remove the hint.
Since these use the same validation functions, move them to libxfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
During the inode btree scrubs we try to confirm the freemask bits
against the inode records. If the inode buffer read fails, this is a
cross-referencing error, not a corruption of the inode btree itself.
Use the xref_process_error call here. Found via core.version middlebit
fuzz in xfs/415.
The userspace xfs_scrub program will try to repair outright corruptions
in the agi/inobt prior to phase 3 so that the inode scan will proceed.
If only a cross-referencing error is noted, the repair program defers
the repair attempt until it can check the other space metadata at least
once.
It is therefore essential that the inobt scrubber can correctly
distinguish between corruptions and "unable to cross-reference something
else with this inobt".
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Now that we no longer do raw inode buffer scrubbing, the bp parameter is
no longer used anywhere we're dealing with an inode, so remove it and
all the useless NULL parameters that go with it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
The inode scrubber tries to _iget the inode prior to running checks.
If that _iget call fails with corruption errors that's an automatic
fail, regardless of whether it was the inode buffer read verifier,
the ifork verifier, or the ifork formatter that errored out.
Therefore, get rid of the raw mode scrub code because it's not needed.
Found by trying to fix some test failures in xfs/379 and xfs/415.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
When we're scanning an extent mapping inode fork, ensure that every rmap
record for this ifork has a corresponding bmbt record too. This
(mostly) provides the ability to cross-reference rmap records with bmap
data. The rmap scrubber cannot do the xref on its own because that
requires taking an ilock with the agf lock held, which violates our
locking order rules (inode, then agf).
Note that we only do this for forks that are in btree format due to the
increased complexity; or forks that should have data but suspiciously
have zero extents because the inode could have just had its iforks
zapped by the inode repair code and now we need to reclaim the old
extents.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
When the inode buffer verifier encounters an error, it's much more
helpful to print a buffer from the offending inode instead of just the
start of the inode chunk buffer.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Refactor some of the inode verifier failure logging call sites to use
the new xfs_inode_verifier_error method which dumps the offending buffer
as well as the code location of the failed check. This trims the
output, makes it clearer to the admin that repair must be run, and gives
the developers more details to work from.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Refactor the bmap validator into a more complete helper that looks for
extents that run off the end of the device, overflow into the next AG,
or have invalid flag states.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
In xfs_dir2_data_use_free, we examine on-disk metadata and ASSERT if
it doesn't make sense. Since a carefully crafted fuzzed image can cause
the kernel to crash after blowing a bunch of assertions, let's move
those checks into a validator function and rig everything up to return
EFSCORRUPTED to userspace. Found by lastbit fuzzing ltail.bestcount via
xfs/391.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
The struct xfs_agfl v5 header was originally introduced with
unexpected padding that caused the AGFL to operate with one less
slot than intended. The header has since been packed, but the fix
left an incompatibility for users who upgrade from an old kernel
with the unpacked header to a newer kernel with the packed header
while the AGFL happens to wrap around the end. The newer kernel
recognizes one extra slot at the physical end of the AGFL that the
previous kernel did not. The new kernel will eventually attempt to
allocate a block from that slot, which contains invalid data, and
cause a crash.
This condition can be detected by comparing the active range of the
AGFL to the count. While this detects a padding mismatch, it can
also trigger false positives for unrelated flcount corruption. Since
we cannot distinguish a size mismatch due to padding from unrelated
corruption, we can't trust the AGFL enough to simply repopulate the
empty slot.
Instead, avoid unnecessarily complex detection logic and and use a
solution that can handle any form of flcount corruption that slips
through read verifiers: distrust the entire AGFL and reset it to an
empty state. Any valid blocks within the AGFL are intentionally
leaked. This requires xfs_repair to rectify (which was already
necessary based on the state the AGFL was found in). The reset
mitigates the side effect of the padding mismatch problem from a
filesystem crash to a free space accounting inconsistency. The
generic approach also means that this patch can be safely backported
to kernels with or without a packed struct xfs_agfl.
Check the AGF for an invalid freelist count on initial read from
disk. If detected, set a flag on the xfs_perag to indicate that a
reset is required before the AGFL can be used. In the first
transaction that attempts to use a flagged AGFL, reset it to empty,
warn the user about the inconsistency and allow the freelist fixup
code to repopulate the AGFL with new blocks. The xfs_perag flag is
cleared to eliminate the need for repeated checks on each block
allocation operation.
This allows kernels that include the packing fix commit 96f859d52b
("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct")
to handle older unpacked AGFL formats without a filesystem crash.
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by Dave Chiluk <chiluk+linuxxfs@indeed.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead split out a __xfs_log_fore_lsn helper that gets called again
with the already_slept flag set to true in case we had to sleep.
This prepares for aio_fsync support.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use the the smallest possible loop as preable to find the correct iclog
buffer, and then use gotos for unwinding to straighten the code.
Also fix the top of function comment while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use xfs_iext_prev_extent to skip to the previous extent instead of
opencoding it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Simplify the control flow a bit in preparation for O_ATOMIC-related
changes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This helper doesn't add any real value over just calling iomap_zero_range
directly, so remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that we convert COW preallocations from unwritten to real on every
call this function needs to be called with the ilock held exclusively.
Fortunately we already do that, but update the assert to match.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There is no reason to get a mapping bigger than what we were asked for.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
i_cnextents does not include delayed allocated extents, so switch
to the inode fork size check that we already use in other places
instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Streamline the conditionals so that it is more obvious which specific case
form the top of the function comments is being handled. Use gotos only
for early returns.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Switch to a single interface for flushing the log to a specific LSN, which
gives consistent trace point coverage and a less confusing interface.
The was only a single user of the previous xfs_log_force_lsn function,
which now also passes a NULL log_flushed argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Switch to a single interface for flushing the whole log, which gives
consistent trace point coverage, and removes the unused log_flushed
argument for the previous _xfs_log_force callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The function now does something, and that something is central to our
inode logging scheme.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The rmapbt perag metadata reservation reserves blocks for the
reverse mapping btree (rmapbt). Since the rmapbt uses blocks from
the agfl and perag accounting is updated as blocks are allocated
from the allocation btrees, the reservation actually accounts blocks
as they are allocated to (or freed from) the agfl rather than the
rmapbt itself.
While this works for blocks that are eventually used for the rmapbt,
not all agfl blocks are destined for the rmapbt. Blocks that are
allocated to the agfl (and thus "reserved" for the rmapbt) but then
used by another structure leads to a growing inconsistency over time
between the runtime tracking of rmapbt usage vs. actual rmapbt
usage. Since the runtime tracking thinks all agfl blocks are rmapbt
blocks, it essentially believes that less future reservation is
required to satisfy the rmapbt than what is actually necessary.
The inconsistency is rectified across mount cycles because the perag
reservation is initialized based on the actual rmapbt usage at mount
time. The problem, however, is that the excessive drain of the
reservation at runtime opens a window to allocate blocks for other
purposes that might be required for the rmapbt on a subsequent
mount. This problem can be demonstrated by a simple test that runs
an allocation workload to consume agfl blocks over time and then
observe the difference in the agfl reservation requirement across an
unmount/mount cycle:
mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194
...
... : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1
umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0
mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194
As the above tracepoints show, the reservation requirement reduces
from 3194 blocks to 2956 blocks as the workload runs. Without any
other changes in the filesystem, the same reservation requirement
jumps from 2956 to 3052 blocks over a umount/mount cycle.
To address this divergence, update the RMAPBT reservation to account
blocks used for the rmapbt only rather than all blocks filled into
the agfl. This patch makes several high-level changes toward that
end:
1.) Reintroduce an AGFL reservation type to serve as an accounting
no-op for blocks allocated to (or freed from) the AGFL.
2.) Invoke RMAPBT usage accounting from the actual rmapbt block
allocation path rather than the AGFL allocation path.
The first change is required because agfl blocks are considered free
blocks throughout their lifetime. The perag reservation subsystem is
invoked unconditionally by the allocation subsystem, so we need a
way to tell the perag subsystem (via the allocation subsystem) to
not make any accounting changes for blocks filled into the AGFL.
The second change causes the in-core RMAPBT reservation usage
accounting to remain consistent with the on-disk state at all times
and eliminates the risk of leaving the rmapbt reservation
underfilled.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The AGFL perag reservation type accounts all allocations that feed
into (or are released from) the allocation group free list (agfl).
The purpose of the reservation is to support worst case conditions
for the reverse mapping btree (rmapbt). As such, the agfl
reservation usage accounting only considers rmapbt usage when the
in-core counters are initialized at mount time.
This implementation inconsistency leads to divergence of the in-core
and on-disk usage accounting over time. In preparation to resolve
this inconsistency and adjust the AGFL reservation into an rmapbt
specific reservation, rename the AGFL reservation type and
associated accounting fields to something more rmapbt-specific. Also
fix up a couple tracepoints that incorrectly use the AGFL
reservation type to pass the agfl state of the associated extent
where the raw reservation type is expected.
Note that this patch does not change perag reservation behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The extent swap mechanism requires a unique implementation for
rmapbt enabled filesystems. Because the rmapbt tracks extent owner
information, extent swap must individually unmap and remap each
extent between the two inodes.
The rmapbt extent swap transaction block reservation currently
accounts for the worst case bmapbt block and rmapbt block
consumption based on the extent count of each inode. There is a
corner case that exists due to the extent swap implementation that
is not covered by this reservation, however.
If one of the associated inodes is just over the max extent count
used for extent format inodes (i.e., the inode is in btree format by
a single extent), the unmap/remap cycle of the extent swap can
bounce the inode between extent and btree format multiple times,
almost as many times as there are extents in the inode (if the
opposing inode happens to have one less, for example). Each back and
forth cycle involves a block free and allocation, which isn't a
problem except for that the initial transaction reservation must
account for the total number of block allocations performed by the
chain of deferred operations. If not, a block reservation overrun
occurs and the filesystem shuts down.
Update the rmapbt extent swap block reservation to check for this
situation and add some block reservation slop to ensure the entire
operation succeeds. We'd never likely require reservation for both
inodes as fsr wouldn't defrag the file in that case, but the
additional reservation is constrained by the data fork size so be
cautious and check for both.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The ->t_blk_res_used field tracks how many blocks have been used in
the current transaction. This should never exceed the block
reservation (->t_blk_res) for a particular transaction. We currently
assert this condition in the transaction block accounting code, but
otherwise take no additional action should this situation occur.
The overrun generally has no effect if space ends up being available
and the associated transaction commits. If the transaction is
duplicated, however, the current block usage is used to determine
the remaining block reservation to be transferred to the new
transaction. If usage exceeds reservation, this calculation
underflows and creates a transaction with an invalid and excessive
reservation. When the second transaction commits, the release of
unused blocks corrupts the in-core free space counters. With lazy
superblock accounting enabled, this inconsistency eventually
trickles to the on-disk superblock and corrupts the filesystem.
Replace the transaction block usage accounting assert with an
explicit overrun check. If the transaction overruns the reservation,
shutdown the filesystem immediately to prevent corruption. Add a new
assert to xfs_trans_dup() to catch any callers that might induce
this invalid state in the future.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This is a simple rename, except that xa_ail becomes ail_head.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The AGFL size calculation is about to get more complex, so lets turn
the macro into a function first and remove the macro.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
[darrick: forward port to newer kernel, simplify the helper]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
There's no point in allocating a transaction and locking the inode in
preparation to clear cow blocks if there actually are any cow fork
extents. Therefore, move the xfs_reflink_cancel_cow_range hunk to
xfs_inactive and check the cow ifp first. This makes inode reclamation
run faster.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Yet another round of playing whack-a-mole with directory code that
asserts on corrupt on-disk metadata when it really should be returning
-EFSCORRUPTED instead of ASSERTing. Found by a xfs/391 crash while
lastbit fuzzing of ltail.bestcount.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
In xfs_qm_dqalloc, we join the locked quota inode to the transaction we
use to allocate blocks. If the allocation or mapping fails, we're not
allowed to unlock the inode because the transaction code is in charge of
unlocking it for us. Therefore, remove the iunlock call to avoid
blowing asserts about unbalanced locking + mount hang.
Found by corrupting the AGF and allocating space in the filesystem
(quotacheck) immediately after mount. The upcoming agfl wrapping fixup
test will trigger this scenario.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Due to an inverted logic mistake in xfs_buftarg_isolate()
the xfs_buffers with zero b_lru_ref will take another trip
around LRU, while isolating buffers with non-zero b_lru_ref.
Additionally those isolated buffers end up right back on the LRU
once they are released, because b_lru_ref remains elevated.
Fix that circuitous route by leaving them on the LRU
as originally intended.
Signed-off-by: Vratislav Bendel <vbendel@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_trans_alloc() does GFP_KERNEL allocation, and we can call it
while holding pages locked for writeback in the ->writepages path.
The memory allocation is allowed to wait on pages under writeback,
and so can wait on pages that are tagged as writeback by the
caller.
This affects both pre-IO submission and post-IO submission paths.
Hence xfs_setsize_trans_alloc(), xfs_reflink_end_cow(),
xfs_iomap_write_unwritten() and xfs_reflink_cancel_cow_range().
xfs_iomap_write_unwritten() already does the right thing, but the
others don't. Fix them.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Fixes: 281627df3e ("xfs: log file size updates at I/O completion time")
Fixes: 43caeb187d ("xfs: move mappings from cow fork to data fork after copy-write)"
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Use the VFS dirty inode tracking for lazytime inodes only, and just
log them in ->dirty_inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The memcpy is guarded by a check which is performed a right before we
call xfs_log_dinode_to_disk. At this point we are sure this check will
always be false otherwise we would have errored out. So let's remove
this dead weight.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Remove unused legacy btree traces from IRIX era.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The dmevmask structure member is a dmapi leftover; it's
set here and there but never actually used. Remove it.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
When using large directory blocks, we regularly see memory
allocations of >64k being made for the shadow log vector buffer.
When we are under memory pressure, kmalloc() may not be able to find
contiguous memory chunks large enough to satisfy these allocations
easily, and if memory is fragmented we can potentially stall here.
TO avoid this problem, switch the log vector buffer allocation to
use kmem_alloc_large(). This will allow failed allocations to fall
back to vmalloc and so remove the dependency on large contiguous
regions of memory being available. This should prevent slowdowns
and potential stalls when memory is low and/or fragmented.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>