Now that we have the ability to ask the log how far the tail needs to be
pushed to maintain its free space targets, augment the decision to relog
an intent item so that we only do it if the log has hit the 75% full
threshold. There's no point in relogging an intent into the same
checkpoint, and there's no need to relog if there's plenty of free space
in the log.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
There's a subtle design flaw in the deferred log item code that can lead
to pinning the log tail. Taking up the defer ops chain examples from
the previous commit, we can get trapped in sequences like this:
Caller hands us a transaction t0 with D0-D3 attached. The defer ops
chain will look like the following if the transaction rolls succeed:
t1: D0(t0), D1(t0), D2(t0), D3(t0)
t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
t3: d5(t1), D1(t0), D2(t0), D3(t0)
...
t9: d9(t7), D3(t0)
t10: D3(t0)
t11: d10(t10), d11(t10)
t12: d11(t10)
In transaction 9, we finish d9 and try to roll to t10 while holding onto
an intent item for D3 that we logged in t0.
The previous commit changed the order in which we place new defer ops in
the defer ops processing chain to reduce the maximum chain length. Now
make xfs_defer_finish_noroll capable of relogging the entire chain
periodically so that we can always move the log tail forward. Most
chains will never get relogged, except for operations that generate very
long chains (large extents containing many blocks with different sharing
levels) or are on filesystems with small logs and a lot of ongoing
metadata updates.
Callers are now required to ensure that the transaction reservation is
large enough to handle logging done items and new intent items for the
maximum possible chain length. Most callers are careful to keep the
chain lengths low, so the overhead should be minimal.
The decision to relog an intent item is made based on whether the intent
was logged in a previous checkpoint, since there's no point in relogging
an intent into the same checkpoint.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
The defer ops code has been finishing items in the wrong order -- if a
top level defer op creates items A and B, and finishing item A creates
more defer ops A1 and A2, we'll put the new items on the end of the
chain and process them in the order A B A1 A2. This is kind of weird,
since it's convenient for programmers to be able to think of A and B as
an ordered sequence where all the sub-tasks for A must finish before we
move on to B, e.g. A A1 A2 D.
Right now, our log intent items are not so complex that this matters,
but this will become important for the atomic extent swapping patchset.
In order to maintain correct reference counting of extents, we have to
unmap and remap extents in that order, and we want to complete that work
before moving on to the next range that the user wants to swap. This
patch fixes defer ops to satsify that requirement.
The primary symptom of the incorrect order was noticed in an early
performance analysis of the atomic extent swap code. An astonishingly
large number of deferred work items accumulated when userspace requested
an atomic update of two very fragmented files. The cause of this was
traced to the same ordering bug in the inner loop of
xfs_defer_finish_noroll.
If the ->finish_item method of a deferred operation queues new deferred
operations, those new deferred ops are appended to the tail of the
pending work list. To illustrate, say that a caller creates a
transaction t0 with four deferred operations D0-D3. The first thing
defer ops does is roll the transaction to t1, leaving us with:
t1: D0(t0), D1(t0), D2(t0), D3(t0)
Let's say that finishing each of D0-D3 will create two new deferred ops.
After finish D0 and roll, we'll have the following chain:
t2: D1(t0), D2(t0), D3(t0), d4(t1), d5(t1)
d4 and d5 were logged to t1. Notice that while we're about to start
work on D1, we haven't actually completed all the work implied by D0
being finished. So far we've been careful (or lucky) to structure the
dfops callers such that D1 doesn't depend on d4 or d5 being finished,
but this is a potential logic bomb.
There's a second problem lurking. Let's see what happens as we finish
D1-D3:
t3: D2(t0), D3(t0), d4(t1), d5(t1), d6(t2), d7(t2)
t4: D3(t0), d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3)
t5: d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
Let's say that d4-d11 are simple work items that don't queue any other
operations, which means that we can complete each d4 and roll to t6:
t6: d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
t7: d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
...
t11: d10(t4), d11(t4)
t12: d11(t4)
<done>
When we try to roll to transaction #12, we're holding defer op d11,
which we logged way back in t4. This means that the tail of the log is
pinned at t4. If the log is very small or there are a lot of other
threads updating metadata, this means that we might have wrapped the log
and cannot get roll to t11 because there isn't enough space left before
we'd run into t4.
Let's shift back to the original failure. I mentioned before that I
discovered this flaw while developing the atomic file update code. In
that scenario, we have a defer op (D0) that finds a range of file blocks
to remap, creates a handful of new defer ops to do that, and then asks
to be continued with however much work remains.
So, D0 is the original swapext deferred op. The first thing defer ops
does is rolls to t1:
t1: D0(t0)
We try to finish D0, logging d1 and d2 in the process, but can't get all
the work done. We log a done item and a new intent item for the work
that D0 still has to do, and roll to t2:
t2: D0'(t1), d1(t1), d2(t1)
We roll and try to finish D0', but still can't get all the work done, so
we log a done item and a new intent item for it, requeue D0 a second
time, and roll to t3:
t3: D0''(t2), d1(t1), d2(t1), d3(t2), d4(t2)
If it takes 48 more rolls to complete D0, then we'll finally dispense
with D0 in t50:
t50: D<fifty primes>(t49), d1(t1), ..., d102(t50)
We then try to roll again to get a chain like this:
t51: d1(t1), d2(t1), ..., d101(t50), d102(t50)
...
t152: d102(t50)
<done>
Notice that in rolling to transaction #51, we're holding on to a log
intent item for d1 that was logged in transaction #1. This means that
the tail of the log is pinned at t1. If the log is very small or there
are a lot of other threads updating metadata, this means that we might
have wrapped the log and cannot roll to t51 because there isn't enough
space left before we'd run into t1. This is of course problem #2 again.
But notice the third problem with this scenario: we have 102 defer ops
tied to this transaction! Each of these items are backed by pinned
kernel memory, which means that we risk OOM if the chains get too long.
Yikes. Problem #1 is a subtle logic bomb that could hit someone in the
future; problem #2 applies (rarely) to the current upstream, and problem
#3 applies to work under development.
This is not how incremental deferred operations were supposed to work.
The dfops design of logging in the same transaction an intent-done item
and a new intent item for the work remaining was to make it so that we
only have to juggle enough deferred work items to finish that one small
piece of work. Deferred log item recovery will find that first
unfinished work item and restart it, no matter how many other intent
items might follow it in the log. Therefore, it's ok to put the new
intents at the start of the dfops chain.
For the first example, the chains look like this:
t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
t3: d5(t1), D1(t0), D2(t0), D3(t0)
...
t9: d9(t7), D3(t0)
t10: D3(t0)
t11: d10(t10), d11(t10)
t12: d11(t10)
For the second example, the chains look like this:
t1: D0(t0)
t2: d1(t1), d2(t1), D0'(t1)
t3: d2(t1), D0'(t1)
t4: D0'(t1)
t5: d1(t4), d2(t4), D0''(t4)
...
t148: D0<50 primes>(t147)
t149: d101(t148), d102(t148)
t150: d102(t148)
<done>
This actually sucks more for pinning the log tail (we try to roll to t10
while holding an intent item that was logged in t1) but we've solved
problem #1. We've also reduced the maximum chain length from:
sum(all the new items) + nr_original_items
to:
max(new items that each original item creates) + nr_original_items
This solves problem #3 by sharply reducing the number of defer ops that
can be attached to a transaction at any given time. The change makes
the problem of log tail pinning worse, but is improvement we need to
solve problem #2. Actually solving #2, however, is left to the next
patch.
Note that a subsequent analysis of some hard-to-trigger reflink and COW
livelocks on extremely fragmented filesystems (or systems running a lot
of IO threads) showed the same symptoms -- uncomfortably large numbers
of incore deferred work items and occasional stalls in the transaction
grant code while waiting for log reservations. I think this patch and
the next one will also solve these problems.
As originally written, the code used list_splice_tail_init instead of
list_splice_init, so change that, and leave a short comment explaining
our actions.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
In xfs_bui_item_recover, there exists a use-after-free bug with regards
to the inode that is involved in the bmap replay operation. If the
mapping operation does not complete, we call xfs_bmap_unmap_extent to
create a deferred op to finish the unmapping work, and we retain a
pointer to the incore inode.
Unfortunately, the very next thing we do is commit the transaction and
drop the inode. If reclaim tears down the inode before we try to finish
the defer ops, we dereference garbage and blow up. Therefore, create a
way to join inodes to the defer ops freezer so that we can maintain the
xfs_inode reference until we're done with the inode.
Note: This imposes the requirement that there be enough memory to keep
every incore inode in memory throughout recovery.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When xfs_defer_capture extracts the deferred ops and transaction state
from a transaction, it should record the transaction reservation type
from the old transaction so that when we continue the dfops chain, we
still use the same reservation parameters.
Doing this means that the log item recovery functions get to determine
the transaction reservation instead of abusing tr_itruncate in yet
another part of xfs.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
When xfs_defer_capture extracts the deferred ops and transaction state
from a transaction, it should record the remaining block reservations so
that when we continue the dfops chain, we can reserve the same number of
blocks to use. We capture the reservations for both data and realtime
volumes.
This adds the requirement that every log intent item recovery function
must be careful to reserve enough blocks to handle both itself and all
defer ops that it can queue. On the other hand, this enables us to do
away with the handwaving block estimation nonsense that was going on in
xlog_finish_defer_ops.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
When we replay unfinished intent items that have been recovered from the
log, it's possible that the replay will cause the creation of more
deferred work items. As outlined in commit 509955823c ("xfs: log
recovery should replay deferred ops in order"), later work items have an
implicit ordering dependency on earlier work items. Therefore, recovery
must replay the items (both recovered and created) in the same order
that they would have been during normal operation.
For log recovery, we enforce this ordering by using an empty transaction
to collect deferred ops that get created in the process of recovering a
log intent item to prevent them from being committed before the rest of
the recovered intent items. After we finish committing all the
recovered log items, we allocate a transaction with an enormous block
reservation, splice our huge list of created deferred ops into that
transaction, and commit it, thereby finishing all those ops.
This is /really/ hokey -- it's the one place in XFS where we allow
nested transactions; the splicing of the defer ops list is is inelegant
and has to be done twice per recovery function; and the broken way we
handle inode pointers and block reservations cause subtle use-after-free
and allocator problems that will be fixed by this patch and the two
patches after it.
Therefore, replace the hokey empty transaction with a structure designed
to capture each chain of deferred ops that are created as part of
recovering a single unfinished log intent. Finally, refactor the loop
that replays those chains to do so using one transaction per chain.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Remove this one-line helper since the assert is trivially true in one
call site and the rest obscures a bitmask operation.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
During a code inspection, I found a serious bug in the log intent item
recovery code when an intent item cannot complete all the work and
decides to requeue itself to get that done. When this happens, the
item recovery creates a new incore deferred op representing the
remaining work and attaches it to the transaction that it allocated. At
the end of _item_recover, it moves the entire chain of deferred ops to
the dummy parent_tp that xlog_recover_process_intents passed to it, but
fail to log a new intent item for the remaining work before committing
the transaction for the single unit of work.
xlog_finish_defer_ops logs those new intent items once recovery has
finished dealing with the intent items that it recovered, but this isn't
sufficient. If the log is forced to disk after a recovered log item
decides to requeue itself and the system goes down before we call
xlog_finish_defer_ops, the second log recovery will never see the new
intent item and therefore has no idea that there was more work to do.
It will finish recovery leaving the filesystem in a corrupted state.
The same logic applies to /any/ deferred ops added during intent item
recovery, not just the one handling the remaining work.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
While QAing the new xfs_repair quotacheck code, I uncovered a quota
corruption bug resulting from a bad interaction between dquot buffer
initialization and quotacheck. The bug can be reproduced with the
following sequence:
# mkfs.xfs -f /dev/sdf
# mount /dev/sdf /opt -o usrquota
# su nobody -s /bin/bash -c 'touch /opt/barf'
# sync
# xfs_quota -x -c 'report -ahi' /opt
User quota on /opt (/dev/sdf)
Inodes
User ID Used Soft Hard Warn/Grace
---------- ---------------------------------
root 3 0 0 00 [------]
nobody 1 0 0 00 [------]
# xfs_io -x -c 'shutdown' /opt
# umount /opt
# mount /dev/sdf /opt -o usrquota
# touch /opt/man2
# xfs_quota -x -c 'report -ahi' /opt
User quota on /opt (/dev/sdf)
Inodes
User ID Used Soft Hard Warn/Grace
---------- ---------------------------------
root 1 0 0 00 [------]
nobody 1 0 0 00 [------]
# umount /opt
Notice how the initial quotacheck set the root dquot icount to 3
(rootino, rbmino, rsumino), but after shutdown -> remount -> recovery,
xfs_quota reports that the root dquot has only 1 icount. We haven't
deleted anything from the filesystem, which means that quota is now
under-counting. This behavior is not limited to icount or the root
dquot, but this is the shortest reproducer.
I traced the cause of this discrepancy to the way that we handle ondisk
dquot updates during quotacheck vs. regular fs activity. Normally, when
we allocate a disk block for a dquot, we log the buffer as a regular
(dquot) buffer. Subsequent updates to the dquots backed by that block
are done via separate dquot log item updates, which means that they
depend on the logged buffer update being written to disk before the
dquot items. Because individual dquots have their own LSN fields, that
initial dquot buffer must always be recovered.
However, the story changes for quotacheck, which can cause dquot block
allocations but persists the final dquot counter values via a delwri
list. Because recovery doesn't gate dquot buffer replay on an LSN, this
means that the initial dquot buffer can be replayed over the (newer)
contents that were delwritten at the end of quotacheck. In effect, this
re-initializes the dquot counters after they've been updated. If the
log does not contain any other dquot items to recover, the obsolete
dquot contents will not be corrected by log recovery.
Because quotacheck uses a transaction to log the setting of the CHKD
flags in the superblock, we skip quotacheck during the second mount
call, which allows the incorrect icount to remain.
Fix this by changing the ondisk dquot initialization function to use
ordered buffers to write out fresh dquot blocks if it detects that we're
running quotacheck. If the system goes down before quotacheck can
complete, the CHKD flags will not be set in the superblock and the next
mount will run quotacheck again, which can fix uninitialized dquot
buffers. This requires amending the defer code to maintaine ordered
buffer state across defer rolls for the sake of the dquot allocation
code.
For regular operations we preserve the current behavior since the dquot
items require properly initialized ondisk dquot records.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Given how XFS is all based around btrees it doesn't make much sense
to offer a totally generic state when we can just use the btree cursor.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
All defer op instance place their own extension of the log item into
the dfp_done field. Replace that with a xfs_log_item to improve type
safety and make the code easier to follow.
Also use the opportunity to improve the ->finish_item calling conventions
to place the done log item as the higher level structure before the
list_entry used for the individual items.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Split out a helper that operates on a single xfs_defer_pending structure
to untangle the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
This avoids a per-item indirect call, and also simplifies the interface
a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
These are aways called together, and my merging them we reduce the amount
of indirect calls, improve type safety and in general clean up the code
a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Create a helper that encapsulates the whole logic to create a defer
intent. This reorders some of the work that was done, but none of
that has an affect on the operation as only fields that don't directly
interact are affected.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Since no caller is using KM_NOSLEEP and no callee branches on KM_SLEEP,
we can remove KM_NOSLEEP and replace KM_SLEEP with 0.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
There are many, many xfs header files which are included but
unneeded (or included twice) in the xfs code, so remove them.
nb: xfs_linux.h includes about 9 headers for everyone, so those
explicit includes get removed by this. I'm not sure what the
preference is, but if we wanted explicit includes everywhere,
a followup patch could remove those xfs_*.h includes from
xfs_linux.h and move them into the files that need them.
Or it could be left as-is.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
During testing of xfs/141 on a V4 filesystem, I observed some
inconsistent behavior with regards to resources that are held (i.e.
remain locked) across a defer roll. The transaction roll always gives
the defer roll function a new transaction, even if committing the old
transaction fails. However, the defer roll function only rejoins the
held resources if the transaction commit succeedied. This means that
callers of defer roll have to figure out whether the held resources are
attached to the transaction being passed back.
Worse yet, if the defer roll was part of a defer finish call, we have a
third possibility: the defer finish could pass back a dirty transaction
with dirty held resources and an error code.
The only sane way to handle all of these scenarios is to require that
the code that held the resource either cancel the transaction before
unlocking and releasing the resources, or use functions that detach
resources from a transaction properly (e.g. xfs_trans_brelse) if they
need to drop the reference before committing or cancelling the
transaction.
In order to make this so, change the defer roll code to join held
resources to the new transaction unconditionally and fix all the bhold
callers to release the held buffers correctly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
There's no need to bundle a pointer to the defer op type into the defer
op control structure. Instead, store the defer op type enum, which
enables us to shorten some of the lines.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Recently, we forgot to port a new defer op type to xfsprogs, which
caused us some userspace pain. Reorganize the way we make libxfs
clients supply defer op type information so that all type information
has to be provided at build time instead of risky runtime dynamic
configuration.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
struct xfs_defer_ops has now been reduced to a single list_head. The
external dfops mechanism is unused and thus everywhere a (permanent)
transaction is accessible the associated dfops structure is as well.
Remove the xfs_defer_ops structure and fold the list_head into the
transaction. Also remove the last remnant of external dfops in
xfs_trans_dup().
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The majority of remaining references to struct xfs_defer_ops in XFS
are associated with xfs_defer_add(). At this point, there are no
more external xfs_defer_ops users left. All instances of
xfs_defer_ops are embedded in the transaction, which means we can
safely pass the transaction down to the dfops add interface.
Update xfs_defer_add() to receive the transaction as a parameter.
Various subsystems implement wrappers to allocate and construct the
context specific data structures for the associated deferred
operation type. Update these to also carry the transaction down as
needed and clean up unused dfops parameters along the way.
This removes most of the remaining references to struct
xfs_defer_ops throughout the code and facilitates removal of the
structure.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[darrick: fix unused variable warnings with ftrace disabled]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The xfs_defer_ops ->dop_pending list is used to track active
deferred operations once intents are logged. These items must be
aborted in the event of an error. The list is populated as intents
are logged and items are removed as they complete (or are aborted).
Now that xfs_defer_finish() cancels on error, there is no need to
ever access ->dop_pending outside of xfs_defer_finish(). The list is
only ever populated after xfs_defer_finish() begins and is either
completed or cancelled before it returns.
Remove ->dop_pending from xfs_defer_ops and replace it with a local
list in the xfs_defer_finish() path. Pass the local list to the
various helpers now that it is not accessible via dfops. Note that
we have to check for NULL in the abort case as the final tx roll
occurs outside of the scope of the new local list (once the dfops
has completed and thus drained the list).
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The current semantics of xfs_defer_finish() require the caller to
call xfs_defer_cancel() on error. This is slightly inconsistent with
transaction commit error handling where a failed commit cleans up
the transaction before returning.
More significantly, the only requirement for exposure of
->dop_pending outside of xfs_defer_finish() is so that
xfs_defer_cancel() can drain it on error. Since the only recourse of
xfs_defer_finish() errors is cancellation, mirror the transaction
logic and cancel remaining dfops before returning from
xfs_defer_finish() with an error.
Beside simplifying xfs_defer_finish() semantics, this ensures that
xfs_defer_finish() always returns with an empty ->dop_pending and
thus facilitates removal of the list from xfs_defer_ops.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The dfops code still passes around the xfs_defer_ops pointer
superfluously in a few places. Clean this up wherever the
transaction will suffice.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The dfops infrastructure ->finish_item() callback passes the
transaction and dfops as separate parameters. Since dfops is always
part of a transaction, the latter parameter is no longer necessary.
Remove it from the various callbacks.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Inodes that are held across deferred operations are explicitly
joined to the dfops structure to ensure appropriate relogging.
While inodes are currently joined explicitly, we can detect the
conditions that require relogging at dfops finish time by inspecting
the transaction item list for inodes with ili_lock_flags == 0.
Replace the xfs_defer_ijoin() infrastructure with such detection and
automatic relogging of held inodes. This eliminates the need for the
per-dfops inode list, replaced by an on-stack variant in
xfs_defer_trans_roll().
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Buffers that are held across deferred operations are explicitly
joined to the dfops structure to ensure appropriate relogging.
While buffers are currently joined explicitly, we can detect the
conditions that require relogging at dfops finish time by inspecting
the transaction item list for held buffers.
Replace the xfs_defer_bjoin() infrastructure with such detection and
automatic relogging of held buffers. This eliminates the need for
the per-dfops buffer list, replaced by an on-stack variant in
xfs_defer_trans_roll().
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The dop_low field enables the low free space allocation mode when a
previous allocation has detected difficulty allocating blocks. It
has historically been part of the xfs_defer_ops structure, which
means if enabled, it remains enabled across a set of transactions
until the deferred operations have completed and the dfops is reset.
Now that the dfops is embedded in the transaction, we can save a bit
more space by using a transaction flag rather than a standalone
boolean. Drop the ->dop_low field and replace it with a transaction
flag that is set at the same points, carried across rolling
transactions and cleared on completion of deferred operations. This
essentially emulates the behavior of ->dop_low and so should not
change behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
All callers pass ->t_dfops of the associated transactions. Refactor
the helpers to receive the transactions and facilitate further
cleanups between xfs_defer_ops and xfs_trans.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
With no more external dfops users, there is no need for an
xfs_defer_ops cancel wrapper.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Once xfs_defer_finish() has completed all deferred operations, it
checks the dirty state of the transaction and rolls it once more to
return a clean transaction for the caller. This primarily to cover
the case where repeated xfs_defer_finish() calls are made in a loop
and we need to make sure that the caller starts the next iteration
with a clean transaction. Otherwise we risk transaction reservation
overrun.
This final transaction roll is not required in the transaction
commit path, however, because the transaction is immediately
committed and freed after dfops completion. Refactor the final roll
into a separate helper such that we can avoid it in the transaction
commit path. Lift the dfops reset as well so dfops remains valid
until after the last call to xfs_defer_trans_roll(). The reset is
also unnecessary in the transaction commit path because the
transaction is about to complete.
This eliminates unnecessary regrants of transactions where the
associated transaction roll can be replaced by a transaction commit.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Every caller of xfs_defer_finish() now passes the transaction and
its associated ->t_dfops. The xfs_defer_ops parameter is therefore
no longer necessary and can be removed.
Since most xfs_defer_finish() callers also have to consider
xfs_defer_cancel() on error, update the latter to also receive the
transaction for consistency. The log recovery code contains an
outlier case that cancels a dfops directly without an available
transaction. Retain an internal wrapper to support this outlier case
for the time being.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The dfops structure used by multi-transaction operations is
typically stored on the stack and carried around by the associated
transaction. The lifecycle of dfops does not quite match that of the
transaction, but they are tightly related in that the former depends
on the latter.
The relationship of these objects is tight enough that we can avoid
the cumbersome boilerplate code required in most cases to manage
them separately by just embedding an xfs_defer_ops in the
transaction itself. This means that a transaction allocation returns
with an initialized dfops, a transaction commit finishes pending
deferred items before the tx commit, a transaction cancel cancels
the dfops before the transaction and a transaction dup operation
transfers the current dfops state to the new transaction.
The dup operation is slightly complicated by the fact that we can no
longer just copy a dfops pointer from the old transaction to the new
transaction. This is solved through a dfops move helper that
transfers the pending items and other dfops state across the
transactions. This also requires that transaction rolling code
always refer to the transaction for the current dfops reference.
Finally, to facilitate incremental conversion to the internal dfops
and continue to support the current external dfops mode of
operation, create the new ->t_dfops_internal field with a layer of
indirection. On allocation, ->t_dfops points to the internal dfops.
This state is overridden by callers who re-init a local dfops on the
transaction. Once ->t_dfops is overridden, the external dfops
reference is maintained as the transaction rolls.
This patch adds the fundamental ability to support an internal
dfops. All codepaths that perform deferred processing continue to
override the internal dfops until they are converted over in
subsequent patches.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_defer_init() is currently used in two particular situations. The
first and most obvious case is raw initialization of an
xfs_defer_ops struct. The other case is partial reinit of
xfs_defer_ops on reuse due to iteration.
Most instances of the first case will be replaced by a single init
of a dfops embedded in the transaction. Init calls are still
technically required for the second case because the dfops may have
low space mode enabled or have joined items that need to be reset
before the dfops should be reused.
Since the current dfops usage expects either a final transaction
commit after xfs_defer_finish() or xfs_defer_init() if dfops is to
be reused, we can shift some of the init logic into
xfs_defer_finish() such that the latter returns with a reinitialized
dfops. This eliminates the second dependency noted above such that a
dfops is immediately ready for reuse after an xfs_defer_finish()
without the need to change any calling code.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
dop_committed is set when deferred item processing rolls the
transaction at least once, but is only ever accessed in tracepoints.
The transaction roll/commit events are already available via
independent tracepoints, so remove the otherwise unused field.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
xfs_defer_finish() has a couple quirks that are not safe with
respect to the upcoming internal dfops functionality. First,
xfs_defer_finish() attaches the passed in dfops structure to
->t_dfops and caches and restores the original value. Second, it
continues to use the initial dfops reference before and after the
transaction roll.
These behaviors assume that dop is an independent memory allocation
from the transaction itself, which may not always be true once
transactions begin to use an embedded dfops structure. In the latter
model, dfops processing creates a new xfs_defer_ops structure with
each transaction and the associated state is migrated across to the
new transaction.
Fix up xfs_defer_finish() to handle the possibility of the current
dfops changing after a transaction roll. Since ->t_dfops is used
unconditionally in this path, it is no longer necessary to
attach/restore ->t_dfops and pass it explicitly down to
xfs_defer_trans_roll(). Update dop in the latter function and the
caller to ensure that it always refers to the current dfops
structure.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The following assertion was seen on generic/051:
XFS: Assertion failed: tp->t_firstblock == NULLFSBLOCK, file: fs/xfs/libxfs5
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:102!
invalid opcode: 0000 [#1] SMP PTI
CPU: 2 PID: 20757 Comm: fsstress Not tainted 4.18.0-rc4+ #3969
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/4
RIP: 0010:assfail+0x23/0x30
Code: c3 66 0f 1f 44 00 00 48 89 f1 41 89 d0 48 c7 c6 88 e0 8c 82 48 89 fa
RSP: 0018:ffff88012dc43c08 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff88012dc43ca0 RCX: 0000000000000000
RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff828480eb
RBP: ffff88012aa92758 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: f000000000000000 R12: 0000000000000000
R13: ffff88012dc43d48 R14: ffff88013092e7e8 R15: 0000000000000014
FS: 00007f8d689b8e80(0000) GS:ffff88013fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8d689c7000 CR3: 000000012ba6a000 CR4: 00000000000006e0
Call Trace:
xfs_defer_init+0xff/0x160
xfs_reflink_remap_extent+0x31b/0xa00
xfs_reflink_remap_blocks+0xec/0x4a0
xfs_reflink_remap_range+0x3a1/0x650
xfs_file_dedupe_range+0x39/0x50
vfs_dedupe_file_range+0x218/0x260
do_vfs_ioctl+0x262/0x6a0
? __se_sys_newfstat+0x3c/0x60
ksys_ioctl+0x35/0x60
__x64_sys_ioctl+0x11/0x20
do_syscall_64+0x4b/0x190
entry_SYSCALL_64_after_hwframe+0x49/0xbe
The root cause of the assertion failure is that xfs_defer_finish doesn't
roll the transaction after processing all the deferred items. Therefore
it returns a dirty transaction to the caller, which leaves the caller at
risk of exceeding the transaction reservation if it logs more items.
Brian Foster's patchset to move the defer_ops firstblock into the
transaction requires t_firstblock == NULLFSBLOCK upon defer_ops
initialization, which is how this was noticed at all.
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
All but one caller of xfs_defer_init() passes in the ->t_firstblock
of the associated transaction. The one outlier is
xlog_recover_process_intents(), which simply passes a dummy value
because a valid pointer is required. This firstblock variable can
simply be removed.
At this point we could remove the xfs_defer_init() firstblock
parameter and initialize ->t_firstblock directly. Even that is not
necessary, however, because ->t_firstblock is automatically
reinitialized in the new transaction on a transaction roll. Since
xfs_defer_init() should never occur more than once on a particular
transaction (since the corresponding finish will roll it), replace
the reinit from xfs_defer_init() with an assert that verifies the
transaction has a NULLFSBLOCK firstblock.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Most callers of xfs_defer_init() immediately attach the dfops
structure to a transaction. Add a transaction parameter to eliminate
much of this boilerplate code. This also helps self-document the
fact that many codepaths now expect a dfops pointer implicitly via
xfs_trans->t_dfops.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The ->t_agfl_dfops field is currently used to defer agfl block frees
from associated transaction contexts. While all known problematic
contexts have already been updated to use ->t_agfl_dfops, the
broader goal is defer agfl frees from all callers that already use a
deferred operations structure. Further, the transaction field
facilitates a good amount of code clean up where the transaction and
dfops have historically been passed down through the stack
separately.
Rename the field to something more generic to prepare to use it as
such throughout XFS. This patch does not change behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Remove the verbose license text from XFS files and replace them
with SPDX tags. This does not change the license of any of the code,
merely refers to the common, up-to-date license files in LICENSES/
This change was mostly scripted. fs/xfs/Makefile and
fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected
and modified by the following command:
for f in `git grep -l "GNU General" fs/xfs/` ; do
echo $f
cat $f | awk -f hdr.awk > $f.new
mv -f $f.new $f
done
And the hdr.awk script that did the modification (including
detecting the difference between GPL-2.0 and GPL-2.0+ licenses)
is as follows:
$ cat hdr.awk
BEGIN {
hdr = 1.0
tag = "GPL-2.0"
str = ""
}
/^ \* This program is free software/ {
hdr = 2.0;
next
}
/any later version./ {
tag = "GPL-2.0+"
next
}
/^ \*\// {
if (hdr > 0.0) {
print "// SPDX-License-Identifier: " tag
print str
print $0
str=""
hdr = 0.0
next
}
print $0
next
}
/^ \* / {
if (hdr > 1.0)
next
if (hdr > 0.0) {
if (str != "")
str = str "\n"
str = str $0
next
}
print $0
next
}
/^ \*/ {
if (hdr > 0.0)
next
print $0
next
}
// {
if (hdr > 0.0) {
if (str != "")
str = str "\n"
str = str $0
next
}
print $0
}
END { }
$
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
So it's clear in the trace where they are being called from.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Now that AGFL block frees are deferred when dfops is set in the
transaction, start deferring AGFL block frees from contexts that are
known to push the limits of existing log reservations.
The first such context is deferred operation processing itself. This
primarily targets deferred extent frees (such as file extents and
inode chunks), but in doing so covers all allocation operations that
occur in deferred operation processing context.
Update xfs_defer_finish() to set and reset ->t_agfl_dfops across the
processing sequence. This means that any AGFL block frees due to
allocation events result in the addition of new EFIs to the dfops
rather than being processed immediately. xfs_defer_finish() rolls
the transaction at least once more to process the frees of the AGFL
blocks back to the allocation btrees and returns once the AGFL is
rectified.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In certain cases, defer_ops callers will lock a buffer and want to hold
the lock across transaction rolls. Similar to ijoined inodes, we want
to dirty & join the buffer with each transaction roll in defer_finish so
that afterwards the caller still owns the buffer lock and we haven't
inadvertently pinned the log.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
And instead require callers to explicitly join the inode using
xfs_defer_ijoin. Also consolidate the defer error handling in
a few places using a goto label.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Split xfs_trans_roll into a low-level helper that just rolls the
actual transaction and a new higher level xfs_trans_roll_inode
that takes care of logging and rejoining the inode. This gets
rid of the NULL inode case, and allows to simplify the special
cases in the deferred operation code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
If the deferred ops transaction roll fails, we need to abort the intent
items if we haven't already logged a done item for it, regardless of
whether or not the deferred ops has had a transaction committed. Dave
found this while running generic/388.
Move the tracepoint to make it easier to track object lifetimes.
Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>