Have the CB_LAYOUTRECALL code treat NFS4_OK and NFS4ERR_DELAY returns
equivalently. Change the code to periodically resend CB_LAYOUTRECALLS
until the ls_layouts list is empty or the client returns a different
error code.
If we go for two lease periods without the list being emptied or the
client sending a hard error, then we give up and clean out the list
anyway.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If there are no nodes in extent tree, let's skip releasing step to avoid
any overhead of grabbing/releasing extent tree lock.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
1. rename field in struct extent_tree from count to node_cnt for
readability.
2. alter to use atomic type for node_cnt.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to teach f2fs_fiemap to recognize encrypted data.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are duplicated code in between get_node_page and get_node_page_ra,
introduce __get_node_page to includes common parts of these two, and
export get_node_page and get_node_page_ra by reusing __get_node_page.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add node id check in ra_node_page and get_node_page_ra like get_node_page.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
...a more descriptive name and we can drop the double underscore prefix.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Right now, we just get WARN_ON_ONCE, which is not particularly helpful.
Have it dump some info about the locks and the inode to make it easier
to track down leaked locks in the future.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
...so we can print information about it if there are leaked locks.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Add some tracepoints around the POSIX locking code. These were useful
when tracking down problems when handling the race between setlk and
close.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
We don't clean out OFD locks on close(), so there's no need to check
for a race with them here. They'll get cleaned out at the same time
that flock locks are.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Dmitry reported that he was able to reproduce the WARN_ON_ONCE that
fires in locks_free_lock_context when the flc_posix list isn't empty.
The problem turns out to be that we're basically rebuilding the
file_lock from scratch in fcntl_setlk when we discover that the setlk
has raced with a close. If the l_whence field is SEEK_CUR or SEEK_END,
then we may end up with fl_start and fl_end values that differ from
when the lock was initially set, if the file position or length of the
file has changed in the interim.
Fix this by just reusing the same lock request structure, and simply
override fl_type value with F_UNLCK as appropriate. That ensures that
we really are unlocking the lock that was initially set.
While we're there, make sure that we do pop a WARN_ON_ONCE if the
removal ever fails. Also return -EBADF in this event, since that's
what we would have returned if the close had happened earlier.
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Fixes: c293621bbf (stale POSIX lock handling)
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
For large sparse or fragmented files, checking every single entry in
the bmapbt on every operation is prohibitively expensive. Especially
as such checks rarely discover problems during normal operations on
high extent coutn files. Our regression tests don't tend to exercise
files with hundreds of thousands to millions of extents, so mostly
this isn't noticed.
However, trying to run things like xfs_mdrestore of large filesystem
dumps on a debug kernel quickly becomes impossible as the CPU is
completely burnt up repeatedly walking the sparse file bmapbt that
is generated for every allocation that is made.
Hence, if the file has more than 10,000 extents, just don't bother
with walking the tree to check it exhaustively. The btree code has
checks that ensure that the newly inserted/removed/modified record
is correctly ordered, so the entrie tree walk in thses cases has
limited additional value.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This allows us to see page cache driven readahead in action as it
passes through XFS. This helps to understand buffered read
throughput problems such as readahead IO IO sizes being too small
for the underlying device to reach max throughput.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
* bugfixes:
SUNRPC: Fixup socket wait for memory
SUNRPC: Fix a missing break in rpc_anyaddr()
pNFS/flexfiles: Fix an Oopsable typo in ff_mirror_match_fh()
NFS: Fix attribute cache revalidation
NFS: Ensure we revalidate attributes before using execute_ok()
NFS: Flush reclaim writes using FLUSH_COND_STABLE
NFS: Background flush should not be low priority
NFSv4.1/pnfs: Fixup an lo->plh_block_lgets imbalance in layoutreturn
NFSv4: Don't perform cached access checks before we've OPENed the file
NFS: Allow the combination pNFS and labeled NFS
NFS42: handle layoutstats stateid error
nfs: Fix race in __update_open_stateid()
nfs: fix missing assignment in nfs4_sequence_done tracepoint
The use of wait_on_atomic_t() for waiting on I/O to complete before
unlocking allows us to git rid of the NFS_IO_INPROGRESS flag, and thus the
nfs_iocounter's flags member, and finally the nfs_iocounter altogether.
The count of I/O is moved to the lock context, and the counter
increment/decrement functions become simple enough to open-code.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
[Trond: Fix up conflict with existing function nfs_wait_atomic_killable()]
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The nlmsvc_binding structure is never modified, so declare it as const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
make sure the isize we read doesn't change during the process.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There was a subtle bug on nat cache management which incurs wrong nid allocation
or wrong block addresses when try_to_free_nats is triggered heavily.
This patch enlarges the previous coverage of nat_tree_lock to avoid data race.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use to_delayed_work() instead of open-coding it.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
gendisk with part==0 is obviously gendisk->disk_name.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Number of fds is already known based on passed list.
No functional changes.
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
XFS now uses CRC verification over a limited section of the log to
detect torn writes prior to a crash. This is difficult to test directly
due to the timing and hardware requirements to cause a short write.
Add a mechanism to inject CRC errors into log records to facilitate
testing torn write detection during log recovery. This mechanism is
dangerous and can result in filesystem corruption. Thus, it is only
available in DEBUG mode for testing/development purposes. Set a non-zero
value to the following sysfs entry to enable error injection:
/sys/fs/xfs/<dev>/log/log_badcrc_factor
Once enabled, XFS intentionally writes an invalid CRC to a log record at
some random point in the future based on the provided frequency. The
filesystem immediately shuts down once the record has been written to
the physical log to prevent metadata writeback (e.g., AIL insertion)
once the log write completes. This helps reasonably simulate a torn
write to the log as the affected record must be safe to discard. The
next mount after the intentional shutdown requires log recovery and
should detect and recover from the torn write.
Note again that this _will_ result in data loss or worse. For testing
and development purposes only!
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Certain types of storage, such as persistent memory, do not provide
sector atomicity for writes. This means that if a crash occurs while XFS
is writing log records, only part of those records might make it to the
storage. This is problematic because log recovery uses the cycle value
packed at the top of each log block to locate the head/tail of the log.
This can lead to CRC verification failures during log recovery and an
unmountable fs for a filesystem that is otherwise consistent.
Update log recovery to incorporate log record CRC verification as part
of the head/tail discovery process. Once the head is located via the
traditional algorithm, run a CRC-only pass over the records up to the
head of the log. If CRC verification fails, assume that the records are
torn as a matter of policy and trim the head block back to the start of
the first bad record.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
* pnfs_generic:
NFSv4.1/pNFS: Cleanup constify struct pnfs_layout_range arguments
NFSv4.1/pnfs: Cleanup copying of pnfs_layout_range structures
NFSv4.1/pNFS: Cleanup pnfs_mark_matching_lsegs_invalid()
NFSv4.1/pNFS: Fix a race in initiate_file_draining()
NFSv4.1/pNFS: pnfs_error_mark_layout_for_return() must always return layout
NFSv4.1/pNFS: pnfs_mark_matching_lsegs_return() should set the iomode
NFSv4.1/pNFS: Use nfs4_stateid_copy for copying stateids
NFSv4.1/pNFS: Don't pass stateids by value to pnfs_send_layoutreturn()
NFS: Relax requirements in nfs_flush_incompatible
NFSv4.1/pNFS: Don't queue up a new commit if the layout segment is invalid
NFS: Allow multiple commit requests in flight per file
NFS/pNFS: Fix up pNFS write reschedule layering violations and bugs
NFSv4: List stateid information in the callback tracepoints
NFSv4.1/pNFS: Don't return NFS4ERR_DELAY unnecessarily in CB_LAYOUTRECALL
NFSv4.1/pNFS: Ensure we enforce RFC5661 Section 12.5.5.2.1
pNFS: If we have to delay the layout callback, mark the layout for return
NFSv4.1/pNFS: Add a helper to mark the layout as returned
pNFS: Ensure nfs4_layoutget_prepare returns the correct error
Peng Tao points out that the call to pnfs_mark_matching_lsegs_return()
could race with pnfs_put_lseg(), in which case the layout segment is
cleared, but no layoutreturn will be sent.
Fix is to replace the call to pnfs_mark_matching_lsegs_invalid().
Reported-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fix a bug whereby if all the layout segments could be immediately freed,
the call to pnfs_error_mark_layout_for_return() would never result in
a layoutreturn.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If pnfs_mark_matching_lsegs_return() needs to mark a layout segment for
return, then it must also set the return iomode.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
get_zeroed_page does alloc_page and returns page_address of the result;
subsequent virt_to_page will recover the page, but since the caller
needs both page and its page_address() anyway, why bother going through
that wrapper at all?
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A _lot_ of ->write() instances were open-coding it; some are
converted to memdup_user_nul(), a lot more remain...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If a name contains at least some characters with Unicode values
exceeding single byte, the CS0 output should have 2 bytes per character.
And if other input characters have single byte Unicode values, then
the single input byte is converted to 2 output bytes, and the length
of output becomes larger than the length of input. And if the input
name is long enough, the output length may exceed the allocated buffer
length.
All this means that conversion from UTF8 or NLS to CS0 requires
checking of output length in order to stop when it exceeds the given
output buffer size.
[JK: Make code return -ENAMETOOLONG instead of silently truncating the
name]
CC: stable@vger.kernel.org
Signed-off-by: Andrew Gabbasov <andrew_gabbasov@mentor.com>
Signed-off-by: Jan Kara <jack@suse.cz>
udf_CS0toUTF8 function stops the conversion when the output buffer
length reaches UDF_NAME_LEN-2, which is correct maximum name length,
but, when checking, it leaves the space for a single byte only,
while multi-bytes output characters can take more space, causing
buffer overflow.
Similar error exists in udf_CS0toNLS function, that restricts
the output length to UDF_NAME_LEN, while actual maximum allowed
length is UDF_NAME_LEN-2.
In these cases the output can override not only the current buffer
length field, causing corruption of the name buffer itself, but also
following allocation structures, causing kernel crash.
Adjust the output length checks in both functions to prevent buffer
overruns in case of multi-bytes UTF8 or NLS characters.
CC: stable@vger.kernel.org
Signed-off-by: Andrew Gabbasov <andrew_gabbasov@mentor.com>
Signed-off-by: Jan Kara <jack@suse.cz>
ConfigFS lacked binary attributes up until now. This patch
introduces support for binary attributes in a somewhat similar
manner of sysfs binary attributes albeit with changes that
fit the configfs usage model.
Problems that configfs binary attributes fix are everything that
requires a binary blob as part of the configuration of a resource,
such as bitstream loading for FPGAs, DTBs for dynamically created
devices etc.
Look at Documentation/filesystems/configfs/configfs.txt for internals
and howto use them.
This patch is against linux-next as of today that contains
Christoph's configfs rework.
Signed-off-by: Pantelis Antoniou <pantelis.antoniou@konsulko.com>
[hch: folded a fix from Geert Uytterhoeven <geert+renesas@glider.be>]
[hch: a few tiny updates based on review feedback]
Signed-off-by: Christoph Hellwig <hch@lst.de>
The qtree_fmt_operations structures are never modified, so declare them as
const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Jan Kara <jack@suse.cz>
A new warning has come up from a recent cleanup:
fs/udf/inode.c: In function 'udf_setup_indirect_aext':
fs/udf/inode.c:1927:28: warning: 'adsize' may be used uninitialized in this function [-Wmaybe-uninitialized]
If the alloc_type is neither ICBTAG_FLAG_AD_SHORT nor
ICBTAG_FLAG_AD_LONG, the value of adsize is undefined. Currently,
callers of these functions make sure alloc_type is one of the two valid
ones but for future proofing make sure we handle the case of invalid
alloc type as well. This changes the code to return -EIOin that case.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: fcea62babc ("udf: Factor out code for creating indirect extent")
Signed-off-by: Jan Kara <jack@suse.cz>
Introduce max_file_blocks in sbi to store max block index of file in f2fs,
it could be used to avoid unneeded calculation of max block index in
runtime.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: fix overflow of sbi->max_file_blocks]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Doing a splice read (generic/249) generates a lockdep splat because
we recursively lock the inode iolock in this path:
SyS_sendfile64
do_sendfile
do_splice_direct
splice_direct_to_actor
do_splice_to
xfs_file_splice_read <<<<<< lock here
default_file_splice_read
vfs_readv
do_readv_writev
do_iter_readv_writev
xfs_file_read_iter <<<<<< then here
The issue here is that for DAX inodes we need to avoid the page
cache path and hence simply push it into the normal read path.
Unfortunately, we can't tell down at xfs_file_read_iter() whether we
are being called from the splice path and hence we cannot avoid the
locking at this layer. Hence we simply have to drop the inode
locking at the higher splice layer for DAX.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Commit 1ca1915 ("xfs: Don't use unwritten extents for DAX") enabled
the DAX allocation call to dip into the reserve pool in case it was
converting unwritten extents rather than allocating blocks. This was
a direct copy of the unwritten extent conversion code, but had an
unintended side effect of allowing normal data block allocation to
use the reserve pool. Hence normal block allocation could deplete
the reserve pool and prevent unwritten extent conversion at ENOSPC,
hence violating fallocate guarantees on preallocated space.
Fix it by checking whether the incoming map from __xfs_get_blocks()
spans an unwritten extent and only use the reserve pool if the
allocation covers an unwritten extent.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The return type "unsigned long" was used by the suffix_kstrtoint()
function even though it will eventually return a negative error code.
Improve this implementation detail by using the type "int" instead.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Create xfs_btree_sblock_verify() to verify short-format btree blocks
(i.e. the per-AG btrees with 32-bit block pointers) instead of
open-coding them.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Because struct xfs_agfl is 36 bytes long and has a 64-bit integer
inside it, gcc will quietly round the structure size up to the nearest
64 bits -- in this case, 40 bytes. This results in the XFS_AGFL_SIZE
macro returning incorrect results for v5 filesystems on 64-bit
machines (118 items instead of 119). As a result, a 32-bit xfs_repair
will see garbage in AGFL item 119 and complain.
Therefore, tell gcc not to pad the structure so that the AGFL size
calculation is correct.
cc: <stable@vger.kernel.org> # 3.10 - 4.4
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Use a convenience variable instead of open-coding the inode fork.
This isn't really needed for now, but will become important when we
add the copy-on-write fork later.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Update the log ticket reservation type printing code to reflect
all the types of log tickets, to avoid incorrect debug output and
avoid running off the end of the array.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Since xfs_repair wants to use xfs_alloc_fix_freelist, remove the
static designation. xfsprogs already has this; this simply brings
the kernel up to date.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
There are no callers of the xfs_buf_ioend_async() function outside
of the fs/xfs/xfs_buf.c. So, let's make it static.
Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Linux's quota subsystem has an ability to handle project quota. This
commit just utilizes the ability from xfs side. dbus-monitor and
quota_nld shipped as part of quota-tools can be used for testing.
See the patch posting on the XFS list for details on testing.
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
In my earlier commit
c29aad4 xfs: pass mp to XFS_WANT_CORRUPTED_GOTO
I added some local mp variables with code which indicates that
mp might be NULL. Coverity doesn't like this now, because the
updated per-fs XFS_STATS macros dereference mp.
I don't think this is actually a problem; from what I can tell,
we cannot get to these functions with a null bma->tp, so my NULL
check was probably pointless. Still, it's not super obvious.
So switch this code to get mp from the inode on the xfs_bmalloca
structure, with no conditional, because the functions are already
using bmap->ip directly.
Addresses-Coverity-Id: 1339552
Addresses-Coverity-Id: 1339553
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This adds a name to each buf_ops structure, so that if
a verifier fails we can print the type of verifier that
failed it. Should be a slight debugging aid, I hope.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
If there is any non zero bit in a long bitmap, it can jump out of the
loop and finish the function as soon as possible.
Signed-off-by: Jia He <hejianet@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
As part of the head/tail discovery process, log recovery locates the
head block and then reverse seeks to find the start of the last active
record in the log. This is non-trivial as the record itself could have
wrapped around the end of the physical log. Log recovery torn write
detection potentially needs to walk further behind the last record in
the log, as multiple log I/Os can be in-flight at one time during a
crash event.
Therefore, refactor the reverse log record header search mechanism into
a new helper that supports the ability to seek past an arbitrary number
of log records (or until the tail is hit). Update the head/tail search
mechanism to call the new helper, but otherwise there is no change in
log recovery behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Log recovery torn write detection uses CRC verification over a range of
the active log to identify torn writes. Since the generic log recovery
pass code implements a superset of the functionality required for CRC
verification, it can be easily modified to support a CRC verification
only pass.
Create a new CRC pass type and update the log record processing helper
to skip everything beyond CRC verification when in this mode. This pass
will be invoked in subsequent patches to implement torn write detection.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Each log recovery pass walks from the tail block to the head block and
processes records appropriately based on the associated log pass type.
There are various failure conditions that can occur through this
sequence, such as I/O errors, CRC errors, etc. Log torn write detection
will perform CRC verification near the head of the log to detect torn
writes and trim torn records from the log appropriately.
As it is, xlog_do_recovery_pass() only returns an error code in the
event of CRC failure, which isn't enough information to trim the head of
the log. Update xlog_do_recovery_pass() to optionally return the start
block of the associated record when an error occurs. This patch contains
no functional changes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Log record CRC verification currently occurs during active log recovery,
immediately before a log record is unpacked. Therefore, the CRC
calculation code is buried within the data unpack function. CRC
verification pass support only needs to go so far as check the CRC, but
this is not easily allowed as the code is currently organized.
Since we now have a new log record processing helper, pull the record
CRC verification code out from the unpack helper and open-code it at the
top of the new process helper. This facilitates the ability to modify
how records are processed based on the type of the current pass. This
patch contains no functional changes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xlog_do_recovery_pass() duplicates a couple function calls related to
processing log records because the function must handle wrapping around
the end of the log if the head is behind the tail. This is implemented
as separate loops. CRC verification pass support will modify how records
are processed in both of these loops.
Rather than continue to duplicate code, factor the calls that process a
log record into a new helper and call that helper from both loops. This
patch contains no functional changes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
XFS log records have separate fields for the record size and the iclog
size used to write the record. mkfs.xfs zeroes the log and writes an
unmount record to generate a clean log for the subsequent mount. The
userspace record logging code has a bug where the iclog size (h_size)
field of the log record is hardcoded to 32k, even if a log stripe unit
is specified. The log record length is correctly extended to the stripe
unit. Since the kernel log recovery code uses the h_size field to
determine the log buffer size, this means that the kernel can attempt to
read/process records larger than the buffer size and overrun the buffer.
This has historically not been a problem because the kernel doesn't
actually run through log recovery in the clean unmount case. Instead,
the kernel detects that a single unmount record exists between the head
and tail and pushes the tail forward such that the log is viewed as
clean (head == tail). Once CRC verification is enabled, however, all
records at the head of the log are verified for CRC errors and thus we
are susceptible to overrun problems if the iclog field is not correct.
While the core problem must be fixed in userspace, this is historical
behavior that must be detected in the kernel to avoid severe side
effects such as memory corruption and crashes. Update the log buffer
size calculation code to detect this condition, warn the user and resize
the log buffer based on the log stripe unit. Return a corruption error
in cases where this does not look like a clean filesystem (i.e., the log
record header indicates more than one operation).
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Now that the VFS encapsulates the dedupe ioctl, wire up btrfs to it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Hoist the btrfs EXTENT_SAME ioctl up to the VFS and make the name
more systematic (FIDEDUPERANGE).
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add missed CONFIG_F2FS_FS_XATTR for encrypted symlink inode in order
to avoid unneeded registry of ->{get,set,remove}xattr.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch removes refcount, and instead, adds zombie_list to shrink directly
without radix tree traverse.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes missing IPU condition when fdatasync is called.
With this patch, fdatasync is able to avoid additional node writes for recovery.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If two processes share the same credentials and NFSv4 open stateid, then
allow them both to dirty the same page, even if their nfs_open_context
differs.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If the layout segment is invalid, then we should not be adding more
write requests to the commit list. Instead, those writes should be
replayed after requesting a new layout.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
There is no report on this bug_on case, but if malicious attacker changed this
field intentionally, we can just reset it as a MAX value.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Allow synchronous RPC calls to wait for pending RPC calls to finish,
but also allow asynchronous ones to just fire off another commit.
With this patch, the xfstests generic/074 test completes in 226s
instead of 242s
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The flexfiles layout in particular, seems to want to poke around in the
O_DIRECT flags when retransmitting.
This patch sets up an interface to allow it to call back into O_DIRECT
to handle retransmission correctly. It also fixes a potential bug whereby
we could change the behaviour of O_DIRECT if an error is already pending.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
There are two rules to handle aborting volatile or atomic writes.
1. drop atomic writes
- we don't need to keep any stale db data.
2. write journal data
- we should keep the journal data with fsync for db recovery.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If filesystem is readonly, leave user message info instead of recovering
inline dot inode.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Otherwise, we can get mismatched largest extent information.
One example is:
1. mount f2fs w/ extent_cache
2. make a small extent
3. umount
4. mount f2fs w/o extent_cache
5. update the largest extent
6. umount
7. mount f2fs w/ extent_cache
8. get the old extent made by #2
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use f2fs_sync_fs to clean up codes in f2fs_ioc_write_checkpoint.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: remove unused err variable]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds a max block check for get_data_block_bmap.
Trinity test program will send a block number as parameter into
ioctl_fibmap, which will be used in get_node_path(), when the block
number large than f2fs max blocks, it will trigger kernel bug.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Xue Liu <liuxueliu.liu@huawei.com>
[Jaegeuk Kim: fix missing condition, pointed by Chao Yu]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
fix bugs:
1. len could be updated incorrectly when start+len is beyond isize.
2. If there is a hole consisting of more than two blocks, it could
fail to add FIEMAP_EXTENT_LAST flag for the last extent.
3. If there is an extent beyond isize, when we search extents in a range
that ends at isize, it will also return the extent beyond isize,
which is outside the range.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Sometimes we keep dumb when IO error occur in lower layer device, so user
will not receive any error return value for some operation, but actually,
the operation did not succeed.
This sould be avoided, so this patch reports such kind of error to user.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
__recover_do_dentries will try to grab free space in storage, so fix to
add missing f2fs_balance_fs here.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If f2fs_write_begin is to update data, we can bypass calling f2fs_lock_op() in
order to avoid the checkpoint latency in the write syscall.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If get_node_page() gets zero nid, we can return early without getting a wrong
page. For example, get_dnode_of_data() can try to do that.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds prepare_write_begin to clean f2fs_write_begin.
The major role of this function is to convert any inline_data and allocate
or find block address.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If inline_data option is disable, when truncating an inline inode with
size which is not exceed maxinum inline size, we should not convert
inline inode to regular one to avoid the overhead of synchronizing
conversion.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
do_checkpoint and write_checkpoint can fail due to reasons like triggering
in a readonly fs or encountering IO error of storage device.
So it's better to report such error info to user, let user be aware of
failure of doing checkpoint.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If user tries to update or read data, we don't need to call f2fs_balance_fs
which triggers f2fs_gc, which increases unnecessary long latency.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Only cover sbi->cp_rwsem on one dnode page's allocation and modification
instead of multiple's in f2fs_map_blocks, it can reduce the covered region
of cp_rwsem, then we can avoid potential long time delay for concurrent
checkpointer.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces recording node block allocation in dnode_of_data.
This information helps to figure out whether any node block is allocated during
specific file operations.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The f2fs_balance_fs doesn't need to cover f2fs_new_inode or f2fs_find_entry
works.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We can check inode's inline_data flag when calling to convert it.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there is no candidates for shrinking slab entries, we don't need to traverse
any trees at all.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: fix missing initialization reported by Yunlei He]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Jeff reports seeing an Oops in ff_layout_alloc_lseg. Turns out
copy+paste has played cruel tricks on a nested loop.
Reported-by: Jeff Layton <jeff.layton@primarydata.com>
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If a NFSv4 client uses the cache_consistency_bitmask in order to
request only information about the change attribute, timestamps and
size, then it has not revalidated all attributes, and hence the
attribute timeout timestamp should not be updated.
Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We have found a BUG on res->migration_pending when migrating lock
resources. The situation is as follows.
dlm_mark_lockres_migration
res->migration_pending = 1;
__dlm_lockres_reserve_ast
dlm_lockres_release_ast returns with res->migration_pending remains
because other threads reserve asts
wait dlm_migration_can_proceed returns 1
>>>>>>> o2hb found that target goes down and remove target
from domain_map
dlm_migration_can_proceed returns 1
dlm_mark_lockres_migrating returns -ESHOTDOWN with
res->migration_pending still remains.
When reentering dlm_mark_lockres_migrating(), it will trigger the BUG_ON
with res->migration_pending. So clear migration_pending when target is
down.
Signed-off-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When resizing, it firstly extends the last gd. Once it should backup
super in the gd, it calculates new backup super and update the
corresponding value.
But it currently doesn't consider the situation that the backup super is
already done. And in this case, it still sets the bit in gd bitmap and
then decrease from bg_free_bits_count, which leads to a corrupted gd and
trigger the BUG in ocfs2_block_group_set_bits:
BUG_ON(le16_to_cpu(bg->bg_free_bits_count) < num_bits);
So check whether the backup super is done and then do the updates.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Donald Buczek reports that NFS clients can also report incorrect
results for access() due to lack of revalidation of attributes
before calling execute_ok().
Looking closely, it seems chdir() is afflicted with the same problem.
Fix is to ensure we call nfs_revalidate_inode_rcu() or
nfs_revalidate_inode() as appropriate before deciding to trust
execute_ok().
Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Link: http://lkml.kernel.org/r/1451331530-3748-1-git-send-email-buczek@molgen.mpg.de
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* flexfiles:
pNFS/flexfiles: Ensure we record layoutstats even if RPC is terminated early
pNFS: Add flag to track if we've called nfs4_ff_layout_stat_io_start_read/write
pNFS/flexfiles: Fix a statistics gathering imbalance
pNFS/flexfiles: Don't mark the entire layout as failed, when returning it
pNFS/flexfiles: Don't prevent flexfiles client from retrying LAYOUTGET
pnfs/flexfiles: count io stat in rpc_count_stats callback
pnfs/flexfiles: do not mark delay-like status as DS failure
NFS41: map NFS4ERR_LAYOUTUNAVAILABLE to ENODATA
nfs: only remove page from mapping if launder_page fails
nfs: handle request add failure properly
nfs: centralize pgio error cleanup
nfs: clean up rest of reqs when failing to add one
NFS41: pop some layoutget errors to application
pNFS/flexfiles: Support server-supplied layoutstats sampling period
If the client is promising to return the layout ASAP, then there is no
need to return DELAY and have the server retry. Instead default to the
normal procedure described in RFC5661.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The RFC requires us to check if the server is recalling a stateid that we
haven't yet received. If so, tell it to wait.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If the client needs to delay the layout callback, then speed up the recall
process by marking the remaining layout segments to be actively returned
by the client.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This ensures that we don't reuse the stateid if a layout return or
implied layout return means that we've returned all layout segments
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If we're unable to perform the layoutget due to an invalid open stateid
or a bulk recall, ensure that we return the error so that the caller
can decide on an appropriate action.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Currently, we will only record the layoutstats correctly if the
RPC call successfully obtains a slot. If we exit before that
happens, then we may find ourselves starting the busy timer through
the call in ff_layout_(read|write)_prepare_layoutstats, but never stopping it.
The same thing happens if we're doing DA-DS.
The fix is to ensure that we catch these cases in the rpc_release()
callback.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When we replay a failed read, write or commit to the dataserver, we
need to ensure that we call ff_layout_read_prepare_v3(),
ff_layout_write_prepare_v3 or ff_layout_commit_prepare_v3() so that we
reset the statistics.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
In pNFS/flexfiles, we want to return the layout without necessarily marking
it as having completely failed. We therefore move the call to
pnfs_layout_io_set_failed() out of pnfs_error_mark_layout_for_return(),
and then ensura that pNFS/files layout calls it separately.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fix a bug in which flexfiles clients are falling back to I/O through the
MDS even when the FF_FLAGS_NO_IO_THRU_MDS flag is set.
The flexfiles client will always report errors through the LAYOUTRETURN
and/or LAYOUTERROR mechanisms, so it should normally be safe for it
to retry the LAYOUTGET until it fails or succeeds.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If client ever restarts IO due to some errors, we'll endup
mis-counting IO stats if we do the counting in .rpc_done
callback. Move it to .rpc_count_stats callback that is only
called when releasing RPC.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We just need to delay and retry in these cases.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Instead of mapping it to EIO that is a fatal error and
fails application. We'll go inband after getting
NFS4ERR_LAYOUTUNAVAILABLE.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Instead of dropping pages when write fails, only do it when
we get fatal failure in launder_page write back.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When we fail to queue a read page to IO descriptor,
we need to clean it up otherwise it is hanging around
preventing nfs module from being removed.
When we fail to queue a write page to IO descriptor,
we need to clean it up and also save the failure status
to open context. Then at file close, we can try to write
pages back again and drop the page if it fails to writeback
in .launder_page, which will be done in the next patch.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
In case we fail during setting things up for read/write IO, set
pg_error in IO descriptor and do the cleanup in nfs_pageio_add_request,
where we clean up all pages that are still hanging around on the IO
descriptor.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If we fail to set up things before sending anything over wire,
we need to clean up the reqs that are still attached to the
IO descriptor.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
For ERESTARTSYS/EIO/EROFS/ENOSPC/E2BIG in layoutget, we
should just bail out instead of hiding the error and
retrying inband IO.
Change all the call sites to pop the error all the way up.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Some servers want to be able to control the frequency with which clients
report layoutstats, for instance, in order to monitor QoS for a particular
file or set of file. In order to support this, the flexfiles layout allows
the server to pass this info as a hint in the layout payload.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If there are already writes queued up for commit, then don't flush
just this page even if it is a reclaim issue.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Background flush is needed in order to satisfy the global page limits.
Don't subvert by reducing the priority.
This should also address a write starvation issue that was reported by
Neil Brown.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Since commit 2d8ae84fbc, nothing is bumping lo->plh_block_lgets in the
layoutreturn path, so it should not be touched in nfs4_layoutreturn_release
either.
Fixes: 2d8ae84fbc ("NFSv4.1/pnfs: Remove redundant lo->plh_block_lgets...")
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Donald Buczek reports that a nfs4 client incorrectly denies
execute access based on outdated file mode (missing 'x' bit).
After the mode on the server is 'fixed' (chmod +x) further execution
attempts continue to fail, because the nfs ACCESS call updates
the access parameter but not the mode parameter or the mode in
the inode.
The root cause is ultimately that the VFS is calling may_open()
before the NFS client has a chance to OPEN the file and hence revalidate
the access and attribute caches.
Al Viro suggests:
>>> Make nfs_permission() relax the checks when it sees MAY_OPEN, if you know
>>> that things will be caught by server anyway?
>>
>> That can work as long as we're guaranteed that everything that calls
>> inode_permission() with MAY_OPEN on a regular file will also follow up
>> with a vfs_open() or dentry_open() on success. Is this always the
>> case?
>
> 1) in do_tmpfile(), followed by do_dentry_open() (not reachable by NFS since
> it doesn't have ->tmpfile() instance anyway)
>
> 2) in atomic_open(), after the call of ->atomic_open() has succeeded.
>
> 3) in do_last(), followed on success by vfs_open()
>
> That's all. All calls of inode_permission() that get MAY_OPEN come from
> may_open(), and there's no other callers of that puppy.
Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=109771
Link: http://lkml.kernel.org/r/1451046656-26319-1-git-send-email-buczek@molgen.mpg.de
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Allow LAYOUTRETURN and DELEGRETURN to use machine credentials if the
server supports it. Add request for OPEN_DOWNGRADE as the close path
also uses that.
Signed-off-by: Andrew Elble <aweits@rit.edu>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This patch fixes the checkpatch.pl error to nfs4sysctl.c:
ERROR: do not initialise statics to 0
Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>