When resending to MDS, we might resend multiple mirroring
requests to MDS. As a result, nfs_direct_good_bytes() ends
up counting bytes multiple times, causing application to
get wrong return results in read/write syscalls.
Fix it by tracking start of a dreq and checking the range of
pgio header.
Cc: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Use it to indicate that LD wants to retry layoutget. LD can set
it whenever it wants the common pnfs code to return and retry
pnfs path through a new layout.
The bit gets cleared when client does a new layoutget, when client
closes the file (ROC case), or when kernel needs to evict the inode
(non-ROC case).
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Otherwise we'll lose error tracking information when
encoding layoutreturn.
pnfs_put_lseg may be called from rpc callbacks. So we should not
call pnfs_send_layoutreturn directly because it can deadlock in
the rpc layer.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
When it is set, generic pnfs would try to send layoutreturn right
before last close/delegation_return regard less NFS_LAYOUT_ROC is
set or not. LD can then make sure layoutreturn is always sent
rather than being omitted.
The difference against NFS_LAYOUT_RETURN is that
NFS_LAYOUT_RETURN_BEFORE_CLOSE does not block usage of the layout so
LD can set it and expect generic layer to try pnfs path at the
same time.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
So that callers can specify which range to return.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
If current IO cannot be completed due to some transient errors,
LD may want to ask generic layer to resend the request through
pnfs again.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
Let it return current nfs_pgio_mirror in use depending on pg_mirror_count.
For read, we always use pg_mirrors[0], so this effectively gives us freedom
to use pg_mirror_idx to track the actual mirror to read from through out the
IO stack.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
so that we don't reset desc->pg_mirror_idx for read unnecessarily.
Remove WARN_ON_ONCE from __nfs_pageio_add_request to allow LD to
set pg_mirror_idx for read where pg_mirror_count is always 1.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
So that we can detect the case if some layout segments are still
pinned which is surely a bug that we need to fix.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
This skips the WARN_ON_ONCE, but doesnt change behavior (the memcmp would
fail).
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
The current mirroring code only notices short writes to the first
mirror. This patch keeps per-mirror byte counts and only considers
a byte to be written once all mirrors report so.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
This patch adds mirrored write support to the pgio layer. The default
is to use one mirror, but pgio callers may define callbacks to change
this to any value up to the (arbitrarily selected) limit of 16.
The basic idea is to break out members of nfs_pageio_descriptor that cannot
be shared between mirrored DSes and put them in a new structure.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Pass ds_commit_idx through the nfs commit path. It's used to select
the commit bucket when using pnfs and is ignored when not using pnfs.
Several functions had to be changed: nfs_retry_commit,
nfs_mark_request_commit, pnfs_mark_request_commit and the pnfs layout
driver .mark_request_commit functions.
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
This is needed to support mirrored writes - the first write can't just
trash the lseg, we need to keep it around until all mirrors have
written.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Instead of calling layoutreturn directly, call pnfs_error_mark_layout_for_return
to mark layouts for return and let generic code return layout when
layout segments are freed.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
Conflicts:
fs/nfs/filelayout/filelayout.c
So that pnfs path is not disabled for ever.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
If current lseg is the last lseg marked with NFS_LSEG_LAYOUTRETURN,
send layoutreturn.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
And if we are to return the same type of layouts, don't bother
sending more layoutgets.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
It marks all matching layout segments as NFS_LSEG_LAYOUTRETURN,
which is an indicator for pnfs_put_lseg() to send layoutreturn,
and also prevents pnfs_update_layout() from using the returning
segments. Once it is set, it never gets cleared.
It also sets proper io failure bit so that pnfs path can be retried
after PNFS_LAYOUTGET_RETRY_TIMEOUT second.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
It allows to specify different iomode to return.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
So that it is possible to return a specific iomode layouts.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
Flexfiles layout would want to use them to report DS IO status.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
Per RFC 5661 Errata 3208:
| A client MAY always forget its layout state and associated
| layout stateid at any time (See also section 12.5.5.1).
| In such case, the client MUST use a non-layout stateid for the next
| LAYOUTGET operation. This will signal the server that the client has
| no more layouts on the file and its respective layout state can be
| released before issuing a new layout in response to LAYOUTGET.
In order to make such a signal unique to server, client needs to serialize
all layoutgets using non-layout stateid. We implement this by serializing
layoutgets when client has no layout segments at hand.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
Enable pNFS callbacks to allow flex files to work correctly with a
NFSv3-enabled data server.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
so that flexfile layout client can pass in DS credential instead of
using user cred, which will be done in the next patch.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
flexclient needs this as there is no nfs_server to DS connection.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
pnfs flexfile layout client may want to use NFSv3 ops rather
than the default MDS v4 ops.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
flexfile layout may need to set such when making DS connections.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
The flexfiles layout wants to create DS connection over NFSv3.
Add nfs3_set_ds_client to allow that to happen.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
They can be reused by flexfile layout as well.
Also add a code such that if read fails on one DS and
there are other DSes available to use, don't resend
through MDS but through pNFS so that client can read
from other DSes.
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
flexfile layout may use different auth flavor as specified by MDS.
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
It can be reused by flexfiles layout client.
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
It can be reused by flexfile layout.
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
Also pull nfs4_pnfs_ds_addr and nfs4_pnfs_ds to generic pnfs.
They can all be reused by flexfile layout as well.
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
The flexfilelayout driver will share some common code
with the filelayout driver. This set of changes refactors
that common code out to avoid any module depenencies.
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
The unload_nls() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
For now just a few simple events to trace the layout stateid lifetime, but
these already were enough to find several bugs in the Linux client layout
stateid handling.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add support to issue layout recalls to clients. For now we only support
full-file recalls to get a simple and stable implementation. This allows
to embedd a nfsd4_callback structure in the layout_state and thus avoid
any memory allocations under spinlocks during a recall. For normal
use cases that do not intent to share a single file between multiple
clients this implementation is fully sufficient.
To ensure layouts are recalled on local filesystem access each layout
state registers a new FL_LAYOUT lease with the kernel file locking code,
which filesystems that support pNFS exports that require recalls need
to break on conflicting access patterns.
The XDR code is based on the old pNFS server implementation by
Andy Adamson, Benny Halevy, Boaz Harrosh, Dean Hildebrand, Fred Isaman,
Marc Eshel, Mike Sager and Ricardo Labiaga.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add support for the GETDEVICEINFO, LAYOUTGET, LAYOUTCOMMIT and
LAYOUTRETURN NFSv4.1 operations, as well as backing code to manage
outstanding layouts and devices.
Layout management is very straight forward, with a nfs4_layout_stateid
structure that extends nfs4_stid to manage layout stateids as the
top-level structure. It is linked into the nfs4_file and nfs4_client
structures like the other stateids, and contains a linked list of
layouts that hang of the stateid. The actual layout operations are
implemented in layout drivers that are not part of this commit, but
will be added later.
The worst part of this commit is the management of the pNFS device IDs,
which suffers from a specification that is not sanely implementable due
to the fact that the device-IDs are global and not bound to an export,
and have a small enough size so that we can't store the fsid portion of
a file handle, and must never be reused. As we still do need perform all
export authentication and validation checks on a device ID passed to
GETDEVICEINFO we are caught between a rock and a hard place. To work
around this issue we add a new hash that maps from a 64-bit integer to a
fsid so that we can look up the export to authenticate against it,
a 32-bit integer as a generation that we can bump when changing the device,
and a currently unused 32-bit integer that could be used in the future
to handle more than a single device per export. Entries in this hash
table are never deleted as we can't reuse the ids anyway, and would have
a severe lifetime problem anyway as Linux export structures are temporary
structures that can go away under load.
Parts of the XDR data, structures and marshaling/unmarshaling code, as
well as many concepts are derived from the old pNFS server implementation
from Andy Adamson, Benny Halevy, Dean Hildebrand, Marc Eshel, Fred Isaman,
Mike Sager, Ricardo Labiaga and many others.
Signed-off-by: Christoph Hellwig <hch@lst.de>
The pnfs code will need it too. Also remove the nfsd_ prefix to match the
other filehandle helpers in that file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
This (ab-)uses the file locking code to allow filesystems to recall
outstanding pNFS layouts on a file. This new lease type is similar but
not quite the same as FL_DELEG. A FL_LAYOUT lease can always be granted,
an a per-filesystem lock (XFS iolock for the initial implementation)
ensures not FL_LAYOUT leases granted when we would need to recall them.
Also included are changes that allow multiple outstanding read
leases of different types on the same file as long as they have a
differnt owner. This wasn't a problem until now as nfsd never set
FL_LEASE leases, and no one else used FL_DELEG leases, but given that
nfsd will also issues FL_LAYOUT leases we will have to handle it now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Just like for other lock types we should allow different owners to have
a read lease on a file. Currently this can't happen, but with the addition
of pNFS layout leases we'll need this feature.
Signed-off-by: Christoph Hellwig <hch@lst.de>
The only user outside of fs/super.c is gone now
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, the ioctl handling code for XFS_IOC_FSSETXATTR treats all
targets as regular files: it refuses to change the extent size if
extents are allocated. This is wrong for directories, as there the
extent size is only used as a default for children.
The patch fixes this issue and improves validation of flag
combinations:
- only disallow extent size changes after extents have been allocated
for regular files
- only allow XFS_XFLAG_EXTSIZE for regular files
- only allow XFS_XFLAG_EXTSZINHERIT for directories
- automatically clear the flags if the extent size is zero
Thanks to Dave Chinner for guidance on the proper fix for this issue.
[dchinner: ported changes onto cleanup series. Makes changes clear
and obvious.]
[dchinner: added comments documenting validity checking rules.]
Signed-off-by: Iustin Pop <iustin@k1024.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The project ID change checking is one of the few remaining open
coded checks in xfs_ioctl_setattr(). Factor it into a helper
function so that the setattr code mostly becomes a flow of check
and action helpers, making it easier to read and follow.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The extent size hint change checking is fairly complex, so isolate
that into it's own function. This simplifies the logic flow of the
setattr code, making it easier to read.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Currently XFS_IOCTL_SETXATTR will fail if run in a user namespace as
it it not allowed to change project IDs. The current code, however,
also prevents any other change being made as well, so things like
extent size hints cannot be set in user namespaces. This is wrong,
so only disallow access to project IDs and related flags from inside
the init namespace.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Now there is only one caller to xfs_ioctl_setattr that uses all the
functionality of the function we can kill the behviour mask and
start cleaning up the code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
xfs_ioctl_setxflags doesn't need all of the functionailty in
xfs_ioctl_setattr() and now we have separate helper functions that
share the checks and modifications that xfs_ioctl_setxflags
requires. Hence disaggregate it from xfs_ioctl_setattr() to allow
further work to be done on xfs_ioctl_setattr.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The setup of the transaction is done after a random smattering of
checks and before another bunch of ioperations specific
validity checks. Pull all the preamble out into a helper function
that returns a transaction or error.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The setting of the extended flags is down through two separate
interfaces, but they are munged together into xfs_ioctl_setattr
and make that function far more complex than it needs to be.
Separate it out into a helper function along with all the other
common inode changes and transaction manipulations in
xfs_ioctl_setattr().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
It is set if the filp is set ot non-blocking, but the flag is not
used anywhere. Hence we can kill it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Back in the days when the direct I/O ->end_io callback could be called
from interrupt context for AIO we needed a structure to hand off to the
workqueue, and reused the ioend structure for this purpose. These days
->end_io is always called from user or workqueue context, which allows us
to avoid this memory allocation and simplify the code significantly.
[dchinner: removed now unused xfs_finish_ioend_sync() function after
Brian Foster did an initial review. ]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Change kmem_free to use kvfree() generic function, remove the
duplicated code.
Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This logic is duplicated in xfs_file_fallocate and xfs_ioc_space, and
we'll need another copy of it for pNFS block support.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
These tests are off by one because if len == sizeof(nfs_export_path)
then we have truncated the name.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Most filesystems prevent truncation of an active swapfile by way of
inode_newsize_ok, called from inode_change_ok. NFS doesn't call either
from nfs_setattr, presumably because most of these checks are expected
to be done server-side. However, the IS_SWAPFILE check can only be done
client-side, and truncating a swapfile can't possibly be good.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Bruce reported seeing this warning pop when mounting using v4.1:
------------[ cut here ]------------
WARNING: CPU: 1 PID: 1121 at kernel/sched/core.c:7300 __might_sleep+0xbd/0xd0()
do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810ff58f>] prepare_to_wait+0x2f/0x90
Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer ppdev joydev snd virtio_console virtio_balloon pcspkr serio_raw parport_pc parport pvpanic floppy soundcore i2c_piix4 virtio_blk virtio_net qxl drm_kms_helper ttm drm virtio_pci virtio_ring ata_generic virtio pata_acpi
CPU: 1 PID: 1121 Comm: nfsv4.1-svc Not tainted 3.19.0-rc4+ #25
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014
0000000000000000 000000004e5e3f73 ffff8800b998fb48 ffffffff8186ac78
0000000000000000 ffff8800b998fba0 ffff8800b998fb88 ffffffff810ac9da
ffff8800b998fb68 ffffffff81c923e7 00000000000004d9 0000000000000000
Call Trace:
[<ffffffff8186ac78>] dump_stack+0x4c/0x65
[<ffffffff810ac9da>] warn_slowpath_common+0x8a/0xc0
[<ffffffff810aca65>] warn_slowpath_fmt+0x55/0x70
[<ffffffff810ff58f>] ? prepare_to_wait+0x2f/0x90
[<ffffffff810ff58f>] ? prepare_to_wait+0x2f/0x90
[<ffffffff810dd2ad>] __might_sleep+0xbd/0xd0
[<ffffffff8124c973>] kmem_cache_alloc_trace+0x243/0x430
[<ffffffff810d941e>] ? groups_alloc+0x3e/0x130
[<ffffffff810d941e>] groups_alloc+0x3e/0x130
[<ffffffffa0301b1e>] svcauth_unix_accept+0x16e/0x290 [sunrpc]
[<ffffffffa0300571>] svc_authenticate+0xe1/0xf0 [sunrpc]
[<ffffffffa02fc564>] svc_process_common+0x244/0x6a0 [sunrpc]
[<ffffffffa02fd044>] bc_svc_process+0x1c4/0x260 [sunrpc]
[<ffffffffa03d5478>] nfs41_callback_svc+0x128/0x1f0 [nfsv4]
[<ffffffff810ff970>] ? wait_woken+0xc0/0xc0
[<ffffffffa03d5350>] ? nfs4_callback_svc+0x60/0x60 [nfsv4]
[<ffffffff810d45bf>] kthread+0x11f/0x140
[<ffffffff810ea815>] ? local_clock+0x15/0x30
[<ffffffff810d44a0>] ? kthread_create_on_node+0x250/0x250
[<ffffffff81874bfc>] ret_from_fork+0x7c/0xb0
[<ffffffff810d44a0>] ? kthread_create_on_node+0x250/0x250
---[ end trace 675220a11e30f4f2 ]---
nfs41_callback_svc does most of its work while in TASK_INTERRUPTIBLE,
which is just wrong. Fix that by finishing the wait immediately if we've
found that the list has something on it.
Also, we don't expect this kthread to accept signals, so we should be
using a TASK_UNINTERRUPTIBLE sleep instead. That however, opens us up
hung task warnings from the watchdog, so have the schedule_timeout
wake up every 60s if there's no callback activity.
Reported-by: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull btrfs fix from Chris Mason:
"We have one more fix for btrfs in my for-linus branch - this was a bug
in the new raid5/6 scrubbing support"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: fix raid56 scrub failed in xfstests btrfs/072
Pull quota and UDF fix from Jan Kara:
"A fix for UDF to properly free preallocated blocks and a fix for quota
so that Q_GETQUOTA quotactl reports correct numbers for XFS filesystem
(and similarly Q_XGETQUOTA quotactl works properly for other
filesystems)"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: Switch ->get_dqblk() and ->set_dqblk() to use bytes as space units
udf: Release preallocation on last writeable close
Currently maximum space limit quota format supports is in blocks however
since we store space limits in bytes, this is somewhat confusing. So
store the maximum limit in bytes as well. Also rename the field to match
the new unit and related inode field to match the new naming scheme.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Ocfs2 can just use the generic helpers provided by quota code for
turning quotas on and off when quota files are stored as system inodes.
The only difference is the feature test in ocfs2_quota_on() and that is
covered by dquot_quota_enable() checking whether usage tracking is
enabled (which can happen only if the filesystem has the quota feature
set).
Signed-off-by: Jan Kara <jack@suse.cz>
Ext4 can just use the generic helpers provided by quota code for turning
quotas on and off when quota files are stored as system inodes. The only
difference is the feature test in ext4_quota_on_sysfile() but the same
is achieved in dquot_quota_enable() by checking whether usage tracking
for the corresponding quota type is enabled (which can happen only if
quota feature is set).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Add functions which translate ->quota_enable / ->quota_disable calls
into appropriate changes in VFS quota. This will enable filesystems
supporting VFS quota files in system inodes to be controlled via
Q_XQUOTA[ON|OFF] quotactls for better userspace compatibility.
Also provide a vector for quotactl using these functions which can be
used by filesystems with quota files stored in hidden system files.
Signed-off-by: Jan Kara <jack@suse.cz>
Make Q_QUOTAON / Q_QUOTAOFF quotactl call ->quota_enable /
->quota_disable callback when provided. To match current behavior of
ocfs2 & ext4 we make these quotactls turn on / off quota enforcement for
appropriate quota type.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Split ->set_xstate callback into two callbacks - one for turning quotas
on (->quota_enable) and one for turning quotas off (->quota_disable). That
way we don't have to pass quotactl command into the callback which seems
cleaner.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Highlights include:
- Stable fix for a NFSv4.1 Oops on mount
- Stable fix for an O_DIRECT deadlock condition
- Fix an issue with submounted volumes and fake duplicate inode numbers
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUyqe0AAoJEGcL54qWCgDyIDcP/RZmf7SasgP+i1c4oWGMrqsn
69DaPRjakbCEJcX00DIp3x2Ulq4uUgJ5v9WDEPkQBKe2lAg7uaQBTzY6x1erDs+3
bBZFXd0SByfRD8M6UXQwxtbxksBfozItLNU7vbln/0BiIoabO8QVxNhX6fMfK7yt
tl4Ik/Dr/o3Oe3pVR0wW+0eqEYcGGXzTpPdYoN470z1/DjV9w5rkm3HsPzoq3QHI
CN3acBJXFrWdCMSJcyzjmeaWRO93FwLSwKS4KKB7hcFY1+pUgA9qhnA5W4GY/0/7
1ovakR778CPD2YAaQuY9iJG561QixiTSRQHRWoI6LgawQ3JyBMt/eqtcfFI6Olxp
Mjs+EMjf6hlrFUpBNZ/8dAROJYSU35sZ6y9C5Ch7ZGBhcgbiQCOVyKjNiBWRG8KX
+xnufzs4tnSIScMHuPyGYpkXy/xEdRhbtm+NjVbvel7Hr37q0jlV2M0/4i54HJ8U
UWMyHfGFnODKkFGByJc9w1XWSGcEZmRsCAsevGI9Yzf+FSTxrdkLWwdVFFOonHQw
OfAlYmO29v4zdTbl1dCgU1pyqdDb47Js/07TQCLb2YcRiUfAuysUTFtBSkRDbwCf
eYKnRvXahPqTB84TDM62HUCJ7y4D945wt+/6FvuYnDYgOeTARXGzAZ+jhv1i8UbJ
8xX2xcdW6fe2DIfZM/9h
=eGYe
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.19-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
- Stable fix for a NFSv4.1 Oops on mount
- Stable fix for an O_DIRECT deadlock condition
- Fix an issue with submounted volumes and fake duplicate inode
numbers"
* tag 'nfs-for-3.19-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: Fix use of nfs_attr_use_mounted_on_fileid()
NFSv4.1: Fix an Oops in nfs41_walk_client_list
nfs: fix dio deadlock when O_DIRECT flag is flipped
since that's a more logical and accurate place - Leif Lindholm
* Update efibootmgr URL in Kconfig help - Peter Jones
* Improve accuracy of EFI guid function names - Borislav Petkov
* Expose firmware platform size in sysfs for the benefit of EFI boot
loader installers and other utilities - Steve McIntyre
* Cleanup __init annotations for arm64/efi code - Ard Biesheuvel
* Mark the UIE as unsupported for rtc-efi - Ard Biesheuvel
* Fix memory leak in error code path of runtime map code - Dan Carpenter
* Improve robustness of get_memory_map() by removing assumptions on the
size of efi_memory_desc_t (which could change in future spec
versions) and querying the firmware instead of guessing about the
memmap size - Ard Biesheuvel
* Remove superfluous guid unparse calls - Ivan Khoronzhuk
* Delete unnecessary chosen@0 DT node FDT code since was duplicated
from code in drivers/of and is entirely unnecessary - Leif Lindholm
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUv69oAAoJEC84WcCNIz1VEYgP/1b27WRfCXs4q/8FP+UheSDS
nAFbGe9PjVPnxo5pA9VwPP6eNQ2zYiyNGEK1BlbQlFPZdSD1updIraA78CiF5iys
iSYyG9xVIcTB23RZI8aJLnBXbosIUKPJZ3FORv1LPhI6Mz1rCpraEaaUlv67rUKr
FLBG9cR7t9f/f+fJw6LOAAISGIG/4s0wQdA5/noaYkj5R5bICl2UTGtbwa0oNstb
NUO93aKDgaG/VljpIEeG6XV96Ioz7cHjQsEaX8sTrvT0n7nPNIqSDjFJOqWKJOXl
RsFrzyl8fFIbMuQatYv1f3efPvyH+iKOfHnHrvcjUNje0xhm7F0Bd86BkOw1a3JQ
pNb0YUWecI0Z/8GSzN8X0JQ7cowa3wI15Z/Hfs03odTXiM6VqwFAhuz/s5DEUdKS
U+rOPjU0ezt3G4oBB/VGgF9w5JWKfsMcsHgmLX9P+JYzKFrxggo1SXAtXUeRAqQp
agKmUB+k6Y1baQO8efkoM7rKL2F0q1SR9QiK+16BHCCkevD23v7IFGrHm2r1xKil
kvWlY4MkRVa4KGPxEFEDVty0HjXxImwYsxTaYVHTS7SMeoP41f6koHKB19NaB3No
5fqn/rT1KcJuhQj/I+vAixIX4WMJkX/MQVbtKfqSaKlAiRg3eRY6ONYr0jOglfF6
gaMuvmDd0HlV6UJvH/9L
=iPpM
-----END PGP SIGNATURE-----
Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/efi
Pull EFI updates from Matt Fleming:
" - Move efivarfs from the misc filesystem section to pseudo filesystem,
since that's a more logical and accurate place - Leif Lindholm
- Update efibootmgr URL in Kconfig help - Peter Jones
- Improve accuracy of EFI guid function names - Borislav Petkov
- Expose firmware platform size in sysfs for the benefit of EFI boot
loader installers and other utilities - Steve McIntyre
- Cleanup __init annotations for arm64/efi code - Ard Biesheuvel
- Mark the UIE as unsupported for rtc-efi - Ard Biesheuvel
- Fix memory leak in error code path of runtime map code - Dan Carpenter
- Improve robustness of get_memory_map() by removing assumptions on the
size of efi_memory_desc_t (which could change in future spec
versions) and querying the firmware instead of guessing about the
memmap size - Ard Biesheuvel
- Remove superfluous guid unparse calls - Ivan Khoronzhuk
- Delete unnecessary chosen@0 DT node FDT code since was duplicated
from code in drivers/of and is entirely unnecessary - Leif Lindholm
There's nothing super scary, mainly cleanups, and a merge from Ricardo who
kindly picked up some patches from the linux-efi mailing list while I
was out on annual leave in December.
Perhaps the biggest risk is the get_memory_map() change from Ard, which
changes the way that both the arm64 and x86 EFI boot stub build the
early memory map. It would be good to have it bake in linux-next for a
while.
"
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Simple helpers that pass an arbitrary iov_iter to filesystems.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... to catch possible memory corruptions.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
This patch adds ubifs_err() output to some error paths to tell the user
what's going on.
Artem: improve the messages, rename too long variable
Signed-off-by: Subodh Nijsure <snijsure@grid-net.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Ben Shelton <ben.shelton@ni.com>
Acked-by: Brad Mouring <brad.mouring@ni.com>
Acked-by: Terry Wilcox <terry.wilcox@ni.com>
Acked-by: Gratian Crisan <gratian.crisan@ni.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Artem: rename static functions so that they do not use the "ubifs_" prefix - we
only use this prefix for non-static functions.
Artem: remove few junk white-space changes in file.c
Signed-off-by: Subodh Nijsure <snijsure@grid-net.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Ben Shelton <ben.shelton@ni.com>
Acked-by: Brad Mouring <brad.mouring@ni.com>
Acked-by: Terry Wilcox <terry.wilcox@ni.com>
Acked-by: Gratian Crisan <gratian.crisan@ni.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Currently ->get_dqblk() and ->set_dqblk() use struct fs_disk_quota which
tracks space limits and usage in 512-byte blocks. However VFS quotas
track usage in bytes (as some filesystems require that) and we need to
somehow pass this information. Upto now it wasn't a problem because we
didn't do any unit conversion (thus VFS quota routines happily stuck
number of bytes into d_bcount field of struct fd_disk_quota). Only if
you tried to use Q_XGETQUOTA or Q_XSETQLIM for VFS quotas (or Q_GETQUOTA
/ Q_SETQUOTA for XFS quotas), you got bogus results. Hardly anyone
tried this but reportedly some Samba users hit the problem in practice.
So when we want interfaces compatible we need to fix this.
We bite the bullet and define another quota structure used for passing
information from/to ->get_dqblk()/->set_dqblk. It's somewhat sad we have
to have more conversion routines in fs/quota/quota.c and another copying
of quota structure slows down getting of quota information by about 2%
but it seems cleaner than overloading e.g. units of d_bcount to bytes.
CC: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Commit 6fb1ca92a6 "udf: Fix race between write(2) and close(2)"
changed the condition when preallocation is released. The idea was that
we don't want to release the preallocation for an inode on close when
there are other writeable file descriptors for the inode. However the
condition was written in the opposite way so we released preallocation
only if there were other writeable file descriptors. Fix the problem by
changing the condition properly.
CC: stable@vger.kernel.org
Fixes: 6fb1ca92a6
Reported-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
Conflicts:
arch/arm/boot/dts/imx6sx-sdb.dts
net/sched/cls_bpf.c
Two simple sets of overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
The xfstests btrfs/072 reports uncorrectable read errors in dmesg,
because scrub forgets to use commit_root for parity scrub routine
and scrub attempts to scrub those extents items whose contents are
not fully on disk.
To fix it, we just add the @search_commit_root flag back.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Chris Mason <clm@fb.com>
If CONFIG_CIFS_WEAK_PW_HASH is not set, CIFSSEC_MUST_LANMAN
and CIFSSEC_MUST_PLNTXT is defined as 0.
When setting new SecurityFlags without any MUST flags,
your flags would be overwritten with CIFSSEC_MUST_LANMAN (0).
Signed-off-by: Niklas Cassel <niklass@axis.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
This patch just removes a goto that did nothing.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We can be more aggressive about this, if we are clever and careful. This is subtle.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There are only 3 callers and quite a bit of that thing is executed
exactly in one of those. Just lift it there...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
What we want is to have non-counting references to children in
pagecache of parent directory, and avoid picking them after a child
has been freed. Fine, so let's just have ->d_prune() clear
parent's inode "has directory contents in page cache" flag.
That way we don't need ->d_fsdata for storing offsets, so we can
use it as a quick and dirty "is it referenced from page cache"
flag.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull vfs fixes from Al Viro:
"A couple of fixes - deadlock in CIFS and build breakage in cris serial
driver (resurfaced f_dentry in there)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
VFS: Convert file->f_dentry->d_inode to file_inode()
fix deadlock in cifs_ioctl_clone()
Ensure that we deal correctly with the case where the server sends us a
newer instance of the same delegation. If the stateids match, but the
sequence numbers differ, then treat the new delegation as if it were
an atomic upgrade.
Signed-off-by: Trond Myklebust <Trond.Myklebust@primarydata.com>
Replace the current code with something that is a little closer to what
net/sunrpc/auth_unix.c uses.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Optimise the layout return on close code by ensuring that
1) Add a check for whether we hold a layout before taking any spinlocks
2) Only take the spin lock once
3) Use nfs_state->state to speed up open file checks
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Note, however, that we still serialise on the open stateid if the lock
stateid is unconfirmed. Hopefully that will not prove too much of a
burden for first time locks; it should leave the ability to parallelise
OPENs unchanged, since they no longer call the serialisation primitives.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Ensure that we test the lock stateid remained unchanged while we were
updating the VFS tracking of the byte range lock. Have the process
replay the lock to the server if we detect that was not the case.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This patch ensures that the server cannot reorder our LOCK/LOCKU
requests if they are sent in parallel on the wire.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The original text in RFC3530 was terribly confusing since it conflated
lockowners and lock stateids. RFC3530bis clarifies that you must use
open_to_lock_owner when there is no lock state for that file+lockowner
combination.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When we update the lock stateid, we really do need to ensure that this is
done under the state->state_lock, and that we are indeed only updating
confirmed locks with a newer version of the same stateid.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Remove the serialisation of OPEN/OPEN_DOWNGRADE and CLOSE calls for the
case of NFSv4.1 and newer.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When we relax the sequencing on the NFSv4.1 OPEN/CLOSE code, we will want
to use the value NULL to indicate that no sequencing is needed.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If an OPEN RPC call races with a CLOSE or OPEN_DOWNGRADE so that it
updates the nfs_state structure before the CLOSE/OPEN_DOWNGRADE has
a chance to do so, then we know that the state->flags need to be
recalculated from scratch.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Pull btrfs fixes from Chris Mason:
"We have a few fixes in my for-linus branch.
Qu Wenruo's batch fix a regression between some our merge window pull
and the inode_cache feature. The rest are smaller bugs"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.
btrfs: Fix the bug that fs_info->pending_changes is never cleared.
btrfs: fix state->private cast on 32 bit machines
Btrfs: fix race deleting block group from space_info->ro_bgs list
Btrfs: fix incorrect freeing in scrub_stripe
btrfs: sync ioctl, handle errors after transaction start
If we are to remove the serialisation of OPEN/CLOSE, then we need to
ensure that the stateid sent as part of a CLOSE operation does not
change after we test the state in nfs4_close_prepare.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The BKL is completely out of the picture in the lockd and sunrpc code
these days. Update the antiquated comments that refer to it.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Someone with a weird time_t happened to notice this, it shouldn't really
manifest till 2038. It may not be our ownly year-2038 problem.
Reported-by: Aaron Pace <Aaron.Pace@alcatel-lucent.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In order to ensure that filenames are not released before the audit
subsystem is done with the strings there are a number of hacks built
into the fs and audit subsystems around getname() and putname(). To
say these hacks are "ugly" would be kind.
This patch removes the filename hackery in favor of a more
conventional reference count based approach. The diffstat below tells
most of the story; lots of audit/fs specific code is replaced with a
traditional reference count based approach that is easily understood,
even by those not familiar with the audit and/or fs subsystems.
CC: viro@zeniv.linux.org.uk
CC: linux-fsdevel@vger.kernel.org
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Enable recording of filenames in getname_kernel() and remove the
kludgy workaround in __audit_inode() now that we have proper filename
logging for kernel users.
CC: viro@zeniv.linux.org.uk
CC: linux-fsdevel@vger.kernel.org
Signed-off-by: Paul Moore <pmoore@redhat.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
a) make it accept ERR_PTR() as filename (and return its PTR_ERR() in that case)
b) make it putname() the sucker in the end otherwise
simplifies life for callers...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There are several areas in the kernel that create temporary filename
objects using the following pattern:
int func(const char *name)
{
struct filename *file = { .name = name };
...
return 0;
}
... which for the most part works okay, but it causes havoc within the
audit subsystem as the filename object does not persist beyond the
lifetime of the function. This patch converts all of these temporary
filename objects into proper filename objects using getname_kernel()
and putname() which ensure that the filename object persists until the
audit subsystem is finished with it.
Also, a special thanks to Al Viro, Guenter Roeck, and Sabrina Dubroca
for helping resolve a difficult kernel panic on boot related to a
use-after-free problem in kern_path_create(); the thread can be seen
at the link below:
* https://lkml.org/lkml/2015/1/20/710
This patch includes code that was either based on, or directly written
by Al in the above thread.
CC: viro@zeniv.linux.org.uk
CC: linux@roeck-us.net
CC: sd@queasysnail.net
CC: linux-fsdevel@vger.kernel.org
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In preparation for expanded use in the kernel, make getname_kernel()
more useful by allowing it to handle any legal filename length.
Thanks to Guenter Roeck for his suggestion to substitute memcpy() for
strlcpy().
CC: linux@roeck-us.net
CC: viro@zeniv.linux.org.uk
CC: linux-fsdevel@vger.kernel.org
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
xfs_compat_attrmulti_by_handle() calls memdup_user() which returns a
negative error code. The error code is negated by the caller and thus
incorrectly converted to a positive error code.
Remove the error negation such that the negative error is passed
correctly back up to userspace.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When the superblock is modified in a transaction, the commonly
modified fields are not actually copied to the superblock buffer to
avoid the buffer lock becoming a serialisation point. However, there
are some other operations that modify the superblock fields within
the transaction that don't directly log to the superblock but rely
on the changes to be applied during the transaction commit (to
minimise the buffer lock hold time).
When we do this, we fail to mark the buffer log item as being a
superblock buffer and that can lead to the buffer not being marked
with the corect type in the log and hence causing recovery issues.
Fix it by setting the type correctly, similar to xfs_mod_sb()...
cc: <stable@vger.kernel.org> # 3.10 to current
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Conversion from local to extent format does not set the buffer type
correctly on the new extent buffer when a symlink data is moved out
of line.
Fix the symlink code and leave a comment in the generic bmap code
reminding us that the format-specific data copy needs to set the
destination buffer type appropriately.
cc: <stable@vger.kernel.org> # 3.10 to current
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This leads to log recovery throwing errors like:
XFS (md0): Mounting V5 Filesystem
XFS (md0): Starting recovery (logdev: internal)
XFS (md0): Unknown buffer type 0!
XFS (md0): _xfs_buf_ioapply: no ops on block 0xaea8802/0x1
ffff8800ffc53800: 58 41 47 49 .....
Which is the AGI buffer magic number.
Ensure that we set the type appropriately in both unlink list
addition and removal.
cc: <stable@vger.kernel.org> # 3.10 to current
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Jan Kara reported that log recovery was finding buffers with invalid
types in them. This should not happen, and indicates a bug in the
logging of buffers. To catch this, add asserts to the buffer
formatting code to ensure that the buffer type is in range when the
transaction is committed.
We don't set a type on buffers being marked stale - they are not
going to get replayed, the format item exists only for recovery to
be able to prevent replay of the buffer, so the type does not
matter. Hence that needs special casing here.
cc: <stable@vger.kernel.org> # 3.10 to current
Reported-by: Jan Kara <jack@suse.cz>
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This function call was being optimized out during nfs_fhget(), leading
to situations where we have a valid fileid but still want to use the
mounted_on_fileid. For example, imagine we have our server configured
like this:
server % df
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 9.1G 6.5G 1.9G 78% /
/dev/vdb1 487M 2.3M 456M 1% /exports
/dev/vdc1 487M 2.3M 456M 1% /exports/vol1
/dev/vdd1 487M 2.3M 456M 1% /exports/vol2
If our client mounts /exports and tries to do a "chown -R" across the
entire mountpoint, we will get a nasty message warning us about a circular
directory structure. Running chown with strace tells me that each directory
has the same device and inode number:
newfstatat(AT_FDCWD, "/nfs/", {st_dev=makedev(0, 38), st_ino=2, ...}) = 0
newfstatat(4, "vol1", {st_dev=makedev(0, 38), st_ino=2, ...}) = 0
newfstatat(4, "vol2", {st_dev=makedev(0, 38), st_ino=2, ...}) = 0
With this patch the mounted_on_fileid values are used for st_ino, so the
directory loop warning isn't reported.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We currently have to ensure that every time we update sb_features2
that we update sb_bad_features2. Now that we log and format the
superblock in it's entirety we actually don't have to care because
we can simply update the sb_bad_features2 when we format it into the
buffer. This removes the need for anything but the mount and
superblock formatting code to care about sb_bad_features2, and
hence removes the possibility that we forget to update bad_features2
when necessary in the future.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We now have several superblock loggin functions that are identical
except for the transaction reservation and whether it shoul dbe a
synchronous transaction or not. Consolidate these all into a single
function, a single reserveration and a sync flag and call it
xfs_sync_sb().
Also, xfs_mod_sb() is not really a modification function - it's the
operation of logging the superblock buffer. hence change the name of
it to reflect this.
Note that we have to change the mp->m_update_flags that are passed
around at mount time to a boolean simply to indicate a superblock
update is needed.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
When we log changes to the superblock, we first have to write them
to the on-disk buffer, and then log that. Right now we have a
complex bitfield based arrangement to only write the modified field
to the buffer before we log it.
This used to be necessary as a performance optimisation because we
logged the superblock buffer in every extent or inode allocation or
freeing, and so performance was extremely important. We haven't done
this for years, however, ever since the lazy superblock counters
pulled the superblock logging out of the transaction commit
fast path.
Hence we have a bunch of complexity that is not necessary that makes
writing the in-core superblock to disk much more complex than it
needs to be. We only need to log the superblock now during
management operations (e.g. during mount, unmount or quota control
operations) so it is not a performance critical path anymore.
As such, remove the complex field based logging mechanism and
replace it with a simple conversion function similar to what we use
for all other on-disk structures.
This means we always log the entirity of the superblock, but again
because we rarely modify the superblock this is not an issue for log
bandwidth or CPU time. Indeed, if we do log the superblock
frequently, delayed logging will minimise the impact of this
overhead.
[Fixed gquota/pquota inode sharing regression noticed by bfoster.]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
We only support swap file calling nfs_direct_IO. However, application
might be able to get to nfs_direct_IO if it toggles O_DIRECT flag
during IO and it can deadlock because we grab inode->i_mutex in
nfs_file_direct_write(). So return 0 for such case. Then the generic
layer will fall back to buffer IO.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
xfs_fs_get_xstate() and xfs_fs_get_xstatev() check whether there's quota
running before calling xfs_qm_scall_getqstat() or
xfs_qm_scall_getqstatv(). Thus we are certain that superblock supports
quota and xfs_sb_version_hasquota() check is pointless. Similarly we
know that when quota is running, mp->m_quotainfo will be allocated.
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
'flags' have XFS_ALL_QUOTA_ACCT cleared immediately on function entry.
There's no point in checking these bits later in the function. Also
because we check something is going to change, we know some enforcement
bits are being added and thus there's no point in testing that later.
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Q_XQUOTARM is never passed to xfs_fs_set_xstate() so remove the test.
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently flags passed via Q_SETINFO were just stored. This makes it
hard to add new flags since in theory userspace could be just setting /
clearing random flags. Since currently there is only one userspace
settable flag and that is somewhat obscure flags only for ancient v1
quota format, I'm reasonably sure noone operates these flags and
hopefully we are fine just adding the check that passed flags are sane.
If we indeed find some userspace program that gets broken by the strict
check, we can always remove it again.
Signed-off-by: Jan Kara <jack@suse.cz>
Currently all quota flags were defined just in kernel-private headers.
Export flags readable / writeable from userspace to userspace via
include/uapi/linux/quota.h.
Signed-off-by: Jan Kara <jack@suse.cz>
OLQF_CLEAN flag is used by OCFS2 on disk to recognize whether quota
recovery is needed or not. We also somewhat abuse mem_dqinfo->dqi_flags
to pass this flag around. Use private flags for this to avoid clashes
with other quota flags / not pollute generic quota flag namespace.
Signed-off-by: Jan Kara <jack@suse.cz>
Currently, v2 quota format blindly stored flags from in-memory dqinfo on
disk, although there are no flags supported. Since it is stupid to store
flags which have no effect, just store 0 unconditionally and don't
bother loading it from disk.
Note that userspace could have stored some flags there via Q_SETINFO
quotactl and then later read them (although flags have no effect) but
I'm pretty sure noone does that (most definitely quota-tools don't and
quota interface doesn't have too much other users).
Signed-off-by: Jan Kara <jack@suse.cz>
Pull RCU updates from Paul E. McKenney:
- Documentation updates.
- Miscellaneous fixes.
- Preemptible-RCU fixes, including fixing an old bug in the
interaction of RCU priority boosting and CPU hotplug.
- SRCU updates.
- RCU CPU stall-warning updates.
- RCU torture-test updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 6b5fe46dfa (btrfs: do commit in sync_fs if there are pending
changes) will call btrfs_start_transaction() in sync_fs(), to handle
some operations needed to be done in next transaction.
However this can cause deadlock if the filesystem is frozen, with the
following sys_r+w output:
[ 143.255932] Call Trace:
[ 143.255936] [<ffffffff816c0e09>] schedule+0x29/0x70
[ 143.255939] [<ffffffff811cb7f3>] __sb_start_write+0xb3/0x100
[ 143.255971] [<ffffffffa040ec06>] start_transaction+0x2e6/0x5a0
[btrfs]
[ 143.255992] [<ffffffffa040f1eb>] btrfs_start_transaction+0x1b/0x20
[btrfs]
[ 143.256003] [<ffffffffa03dc0ba>] btrfs_sync_fs+0xca/0xd0 [btrfs]
[ 143.256007] [<ffffffff811f7be0>] sync_fs_one_sb+0x20/0x30
[ 143.256011] [<ffffffff811cbd01>] iterate_supers+0xe1/0xf0
[ 143.256014] [<ffffffff811f7d75>] sys_sync+0x55/0x90
[ 143.256017] [<ffffffff816c49d2>] system_call_fastpath+0x12/0x17
[ 143.256111] Call Trace:
[ 143.256114] [<ffffffff816c0e09>] schedule+0x29/0x70
[ 143.256119] [<ffffffff816c3405>] rwsem_down_write_failed+0x1c5/0x2d0
[ 143.256123] [<ffffffff8133f013>] call_rwsem_down_write_failed+0x13/0x20
[ 143.256131] [<ffffffff811caae8>] thaw_super+0x28/0xc0
[ 143.256135] [<ffffffff811db3e5>] do_vfs_ioctl+0x3f5/0x540
[ 143.256187] [<ffffffff811db5c1>] SyS_ioctl+0x91/0xb0
[ 143.256213] [<ffffffff816c49d2>] system_call_fastpath+0x12/0x17
The reason is like the following:
(Holding s_umount)
VFS sync_fs staff:
|- btrfs_sync_fs()
|- btrfs_start_transaction()
|- sb_start_intwrite()
(Waiting thaw_fs to unfreeze)
VFS thaw_fs staff:
thaw_fs()
(Waiting sync_fs to release
s_umount)
So deadlock happens.
This can be easily triggered by fstest/generic/068 with inode_cache
mount option.
The fix is to check if the fs is frozen, if the fs is frozen, just
return and waiting for the next transaction.
Cc: David Sterba <dsterba@suse.cz>
Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
[enhanced comment, changed to SB_FREEZE_WRITE]
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Fs_info->pending_changes is never cleared since the original code uses
cmpxchg(&fs_info->pending_changes, 0, 0), which will only clear it if
pending_changes is already 0.
This will cause a lot of problem when mount it with inode_cache mount
option.
If the btrfs is mounted as inode_cache, pending_changes will always be
1, even when the fs is frozen.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Now that default_backing_dev_info is not used for writeback purposes we can
git rid of it easily:
- instead of using it's name for tracing unregistered bdi we just use
"unknown"
- btrfs and ceph can just assign the default read ahead window themselves
like several other filesystems already do.
- we can assign noop_backing_dev_info as the default one in alloc_super.
All filesystems already either assigned their own or
noop_backing_dev_info.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
bdi_destroy already does all the work, and if we delay freeing the
anon bdev we can get away with just that single call.
Addintionally remove the call during mount failure, as
deactivate_super_locked will already call ->kill_sb and clean up
the bdi for us.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
bdi_destroy already does all the work, and if we delay freeing the
anon bdev we can get away with just that single call.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that we never use the backing_dev_info pointer in struct address_space
we can simply remove it and save 4 to 8 bytes in every inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that we got rid of the bdi abuse on character devices we can always use
sb->s_bdi to get at the backing_dev_info for a file, except for the block
device special case. Export inode_to_bdi and replace uses of
mapping->backing_dev_info with it to prepare for the removal of
mapping->backing_dev_info.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
mapping->backing_dev_info will go away, so don't rely on it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Directly grab the backing_dev_info from the request_queue instead of
detouring through the address_space.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Since 018a17bdc8 ("bdi: reimplement bdev_inode_switch_bdi()") the
block device code writes out all dirty data whenever switching the
backing_dev_info for a block device inode. But a block device inode can
only be dirtied when it is in use, which means we only have to write it
out on the final blkdev_put, but not when doing a blkdev_get.
Factoring out the write out from the bdi list switch prepares from
removing the list switch later in the series.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Since "BDI: Provide backing device capability information [try #3]" the
backing_dev_info structure also provides flags for the kind of mmap
operation available in a nommu environment, which is entirely unrelated
to it's original purpose.
Introduce a new nommu-only file operation to provide this information to
the nommu mmap code instead. Splitting this from the backing_dev_info
structure allows to remove lots of backing_dev_info instance that aren't
otherwise needed, and entirely gets rid of the concept of providing a
backing_dev_info for a character device. It also removes the need for
the mtd_inodefs filesystem.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
hugetlbfs, kernfs and dlmfs can simply use noop_backing_dev_info instead
of creating a local duplicate.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit
c11f1df500
requires writers to wait for any pending oplock break handler to
complete before proceeding to write. This is done by waiting on bit
CIFS_INODE_PENDING_OPLOCK_BREAK in cifsFileInfo->flags. This bit is
cleared by the oplock break handler job queued on the workqueue once it
has completed handling the oplock break allowing writers to proceed with
writing to the file.
While testing, it was noticed that the filehandle could be closed while
there is a pending oplock break which results in the oplock break
handler on the cifsiod workqueue being cancelled before it has had a
chance to execute and clear the CIFS_INODE_PENDING_OPLOCK_BREAK bit.
Any subsequent attempt to write to this file hangs waiting for the
CIFS_INODE_PENDING_OPLOCK_BREAK bit to be cleared.
We fix this by ensuring that we also clear the bit
CIFS_INODE_PENDING_OPLOCK_BREAK when we remove the oplock break handler
from the workqueue.
The bug was found by Red Hat QA while testing using ltp's fsstress
command.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Jeff Layton <jlayton@samba.org>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <steve.french@primarydata.com>
When leaving a function use memzero_explicit instead of memset(0) to
clear stack allocated buffers. memset(0) may be optimized away.
This particular buffer is highly likely to contain sensitive data which
we shouldn't leak (it's named 'passwd' after all).
Signed-off-by: Giel van Schijndel <me@mortis.eu>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Reported-at: http://www.viva64.com/en/b/0299/
Reported-by: Andrey Karpov
Reported-by: Svyatoslav Razmyslov
Signed-off-by: Steve French <steve.french@primarydata.com>
Suppress the following warning displayed on building 32bit (i686) kernel.
===============================================================================
...
CC [M] fs/btrfs/extent_io.o
fs/btrfs/extent_io.c: In function ‘btrfs_free_io_failure_record’:
fs/btrfs/extent_io.c:2193:13: warning: cast to pointer from integer of
different size [-Wint-to-pointer-cast]
failrec = (struct io_failure_record *)state->private;
...
===============================================================================
Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Reported-by: Chris Murphy <chris@colorremedies.com>
Signed-off-by: Chris Mason <clm@fb.com>
When removing a block group we were deleting it from its space_info's
ro_bgs list without the correct protection - the space info's spinlock.
Fix this by doing the list delete while holding the spinlock of the
corresponding space info, which is the correct lock for any operation
on that list.
This issue was introduced in the 3.19 kernel by the following change:
Btrfs: move read only block groups onto their own list V2
commit 633c0aad4c
I ran into a kernel crash while a task was running statfs, which iterates
the space_info->ro_bgs list while holding the space info's spinlock,
and another task was deleting it from the same list, without holding that
spinlock, as part of the block group remove operation (while running the
function btrfs_remove_block_group). This happened often when running the
stress test xfstests/generic/038 I recently made.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The address that should be freed is not 'ppath' but 'path'.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Chris Mason <clm@fb.com>
The version merged to 3.19 did not handle errors from start_trancaction
and could pass an invalid pointer to commit_transaction.
Fixes: 6b5fe46dfa ("btrfs: do commit in sync_fs if there are pending changes")
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
It really needs to check that src is non-directory *and* use
{un,}lock_two_nodirectories(). As it is, it's trivial to cause
double-lock (ioctl(fd, CIFS_IOC_COPYCHUNK_FILE, fd)) and if the
last argument is an fd of directory, we are asking for trouble
by violating the locking order - all directories go before all
non-directories. If the last argument is an fd of parent
directory, it has 50% odds of locking child before parent,
which will cause AB-BA deadlock if we race with unlink().
Cc: stable@vger.kernel.org @ 3.13+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Contrary to common expectations for an "int" return, these functions
return only a positive value -- if used correctly they cannot even
return 0 because the message header will necessarily be in the skb.
This makes the very common pattern of
if (genlmsg_end(...) < 0) { ... }
be a whole bunch of dead code. Many places also simply do
return nlmsg_end(...);
and the caller is expected to deal with it.
This also commonly (at least for me) causes errors, because it is very
common to write
if (my_function(...))
/* error condition */
and if my_function() does "return nlmsg_end()" this is of course wrong.
Additionally, there's not a single place in the kernel that actually
needs the message length returned, and if anyone needs it later then
it'll be very easy to just use skb->len there.
Remove this, and make the functions void. This removes a bunch of dead
code as described above. The patch adds lines because I did
- return nlmsg_end(...);
+ nlmsg_end(...);
+ return 0;
I could have preserved all the function's return values by returning
skb->len, but instead I've audited all the places calling the affected
functions and found that none cared. A few places actually compared
the return value with <= 0 in dump functionality, but that could just
be changed to < 0 with no change in behaviour, so I opted for the more
efficient version.
One instance of the error I've made numerous times now is also present
in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
check for <0 or <=0 and thus broke out of the loop every single time.
I've preserved this since it will (I think) have caused the messages to
userspace to be formatted differently with just a single message for
every SKB returned to userspace. It's possible that this isn't needed
for the tools that actually use this, but I don't even know what they
are so couldn't test that changing this behaviour would be acceptable.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should use sprintf format specifier "%u" instead of "%d" for
argument of type 'unsigned int' in pstore_dump().
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
A secured user-space accessible pstore object. Writes
to /dev/pmsg0 are appended to the buffer, on reboot
the persistent contents are available in
/sys/fs/pstore/pmsg-ramoops-[ID].
One possible use is syslogd, or other daemon, can
write messages, then on reboot provides a means to
triage user-space activities leading up to a panic
as a companion to the pstore dmesg or console logs.
Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
ramoops_pstore_read fails to return the next in a prz
series after first zero-sized entry, not venturing to
the next non-zero entry.
Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
All previous checks will fail with error if memory size
is not sufficient to register a zone, so this legacy
check has become redundant.
Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
No guarantees that the names will not exceed the
name buffer with future adjustments.
Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
We have each of the locks_remove_* variants doing this individually.
Have the caller do it instead, and have locks_remove_flock and
locks_remove_lease just assume that it's a valid pointer.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
This makes things a bit more efficient in the cifs and ceph lock
pushing code.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Now that we use standard list_heads for tracking leases, we can have
lm_change take a pointer to the lease to be modified instead of a
double pointer.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Acked-by: Christoph Hellwig <hch@lst.de>
We can now add a dedicated spinlock without expanding struct inode.
Change to using that to protect the various i_flctx lists.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Acked-by: Christoph Hellwig <hch@lst.de>
There is only a single call site for each of these functions, and the
caller takes the i_lock prior to calling them and drops it just
afterward. Move the spinlocking into the functions instead.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Acked-by: Christoph Hellwig <hch@lst.de>
The current scheme of using the i_flock list is really difficult to
manage. There is also a legitimate desire for a per-inode spinlock to
manage these lists that isn't the i_lock.
Start conversion to a new scheme to eventually replace the old i_flock
list with a new "file_lock_context" object.
We start by adding a new i_flctx to struct inode. For now, it lives in
parallel with i_flock list, but will eventually replace it. The idea is
to allocate a structure to sit in that pointer and act as a locus for
all things file locking.
We allocate a file_lock_context for an inode when the first lock is
added to it, and it's only freed when the inode is freed. We use the
i_lock to protect the assignment, but afterward it should mostly be
accessed locklessly.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Acked-by: Christoph Hellwig <hch@lst.de>
...instead of open-coding it and removing flock locks directly. This
helps consolidate the flock lock removal logic into a single spot.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
...that we can use to queue file_locks to per-ctx list_heads. Go ahead
and convert locks_delete_lock and locks_dispose_list to use it instead
of the fl_block list.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Here is one kernfs fix for a reported issue for 3.19-rc5.
It has been in linux-next for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlS5V8QACgkQMUfUDdst+ynQTgCdEOUn6oftKCkErl4WWX9q0+ZT
4CIAoLuGH9Gdn5tIVlqJ1tVmESnsgn0T
=P84S
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fix from Greg KH:
"Here is one kernfs fix for a reported issue for 3.19-rc5.
It has been in linux-next for a while"
* tag 'driver-core-3.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
kernfs: Fix kernfs_name_compare
Highlights include:
- Stable fix for a NFSv3/lockd race
- Fixes for several NFSv4.1 client id trunking bugs
- Remove an incorrect test when checking for delegated opens
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUuSAcAAoJEGcL54qWCgDyVGkQAJTTzJ8TcuDaXPz7a+8vYR/F
0Z0iJ7Q2VzfVwLKReQdSMUhX8E6OsVe/HlmFT1paNdIOnF7gQvTclP7/GZ/EaDtA
csBKkEcxoCRbF6OhCdeSMJC+iXs0cugCgYrkusPSGJj8dG+qa6Hix8rDfZKbM4yp
sQphMozeI/FLHlckBTsViBCECcyR1FxjOrpXc1RPTz1u2dw9dDgZABdX00ubve2Y
5dGIGyriFUiZzqFp+7GLXa2GITuLnda+IPPywyzF2MQy4XYWH63bXoWbyXL3kzQ7
P7hLsGMOfl+pFOBnwYyXhhfqdsS/lOt8V+18Rrr1NhFx7VtQUae/VDPwLxnnoq0P
ZtYiuJmIihzHfr4N5V2QR9TcuNO7fg6/1hxHZkOYxaYkH2IezFRWbtj9J/UJS+8q
XDQzEvXMxtXCHKDQvmW653oQIdCFRR9YvgsN/YaxIB3vWPKL2tupeQWe/ZEdVvX3
WemIxyc6NBS86otauVXOIKZFgiJ4mynpgH22xb43CJ33b049vJVu/NOQWHgrhHqQ
7vpNDvpjyErjtoMVdN/x2L21uVTucZDXF59hfUD/Q9kE5qXGDRlqiFsDbkVvd+0C
SXKBngtslBvgU3dYblAuJ9uMkTYkSE/vNkzVibPXJr4/cysVyMeeRDysVPHRZpw2
8jH+yyHBxvK8XSDTxIkh
=hz4r
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.19-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
- Stable fix for a NFSv3/lockd race
- Fixes for several NFSv4.1 client id trunking bugs
- Remove an incorrect test when checking for delegated opens"
* tag 'nfs-for-3.19-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Remove incorrect check in can_open_delegated()
NFS: Ignore transport protocol when detecting server trunking
NFSv4/v4.1: Verify the client owner id during trunking detection
NFSv4: Cache the NFSv4/v4.1 client owner_id in the struct nfs_client
NFSv4.1: Fix client id trunking on Linux
LOCKD: Fix a race when initialising nlmsvc_timeout
Pull fuse fixes from Miklos Szeredi:
"This fixes a regression in the latest fuse update plus a fix for a
rather theoretical memory ordering issue"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: add memory barrier to INIT
fuse: fix LOOKUP vs INIT compat handling
Remove the function renew_client() that is not used anywhere.
This was partially found by using a static code analysis program called cppcheck.
Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Remove the function nlm_encode_fh() that is not used anywhere.
This was partially found by using a static code analysis program called cppcheck.
Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In order to support accesses to larger chunks of memory, pass in a
'size' parameter (counted in bytes), and return the amount available at
that address.
Add a new helper function, bdev_direct_access(), to handle common
functionality including partition handling, checking the length requested
is positive, checking for the sector being page-aligned, and checking
the length of the request does not pass the end of the partition.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
commit 0efaa7e82f
locks: generic_delete_lease doesn't need a file_lock at all
moves the call to fl->fl_lmops->lm_change() to a place in the
code where fl might be a non-lease lock.
When that happens, fl_lmops is NULL and an Oops ensures.
So add an extra test to restore correct functioning.
Reported-by: Linda Walsh <suse@tlinx.org>
Link: https://bugzilla.suse.com/show_bug.cgi?id=912569
Cc: stable@vger.kernel.org (v3.18)
Fixes: 0efaa7e82f
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Sprintf format specifier "%d" and "%u" are mixed up in
gfs2_recovery_done() and freeze_show(). So correct them.
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Remove the function pulledbits() that is not used anywhere.
This was partially found by using a static code analysis program called cppcheck.
Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
Call mutex_destroy() on superblock mutex in udf_put_super()
otherwise mutex debugging code isn't able to detect that
mutex is used after being freed.
(thanks to Jan Kara for complete definition).
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
*Rename the function efi_guid_unparse to reflect better its behavior. - Borislav Petkov
*Update the EFI firmware Kconfig help with the URL of efibootmgr. - Peter Jones
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUrfd/AAoJECYUxVC6df4Sd7kP/RP4uzJQnQTteg2ys2Zk2sEe
rWdjxtcQurnP5P07SqNV/dZIWQx6hG/l61herFzryq9JKNye70fCd8bGl4sOPiHW
i64FJtmCF+3vgLhJK2s6Ysub/xPdC0zkRXIIHvulX1nwZRWXiPzkJPzFlkCbXaW1
O3ykSBIT4EwXAe84dN1Oio476Zyf/rS81rKakYXKwYNrfSKTmyH2FUlZODaSL+0a
oGpae/2kps6rqdlOq19J5VKcYE3CTzy+Pv3QhqVlgywWF0AIUOHrcWpsWTv4IhKr
v8blQcy7QZ/QWOEg7Nnus48/Mh0sNdAy497ZV4/tgmBfkGUcHD5qX5/Fy6Yp+/0R
YJfRv5fTDyxYp6krdEeUnOphaXZT2jkG8uRdLJxNBONWBkwf/i9QXcmwuJc0+sgF
7NKP8/kbqc3A6blQ/RNQRdeLW0hAAXfrVHj/zW8Fbi0PJTn+ALTTyvjDhMwYOr46
IVbm1CzGQLXGNDFWvxRlMbzQa5PktiiqOxymTNJ6HZwP8TnS5IE5f/iEVpj5Yd+z
zbYQcpBiN8VnQsKBmq04865/OrNgQRcUC+OU6Mz1iIIEIF8NLBDoRfE/oX0I+dEC
jc0wTszmr0uXAu6dP4tgQEaQ0oNdOE806+BZ++3bpaZ3GCScjYtDOnSdsFB5ByYp
dsAqvfPW8uY+I27SXMat
=aLDw
-----END PGP SIGNATURE-----
Merge tag 'rneri-efi-next' of https://github.com/ricardon/efi into next
Pull EFI changes for v3.20 from Ricardo Neri, who kindly picked them up
while I was out on annual leave in December.
Updates included:
*Rename the function efi_guid_unparse to reflect better its behavior. - Borislav Petkov
*Update the EFI firmware Kconfig help with the URL of efibootmgr. - Peter Jones
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Pull scheduler fixes from Ingo Molnar:
"Misc fixes: group scheduling corner case fix, two deadline scheduler
fixes, effective_load() overflow fix, nested sleep fix, 6144 CPUs
system fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix RCU stall upon -ENOMEM in sched_create_group()
sched/deadline: Avoid double-accounting in case of missed deadlines
sched/deadline: Fix migration of SCHED_DEADLINE tasks
sched: Fix odd values in effective_load() calculations
sched, fanotify: Deal with nested sleeps
sched: Fix KMALLOC_MAX_SIZE overflow during cpumask allocation
Pull two nfsd bugfixes from Bruce Fields.
* 'for-3.19' of git://linux-nfs.org/~bfields/linux:
rpc: fix xdr_truncate_encode to handle buffer ending on page boundary
nfsd: fix fi_delegees leak when fi_had_conflict returns true
Pull two Ceph fixes from Sage Weil:
"These are both pretty trivial: a sparse warning fix and size_t printk
thing"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
libceph: fix sparse endianness warnings
ceph: use %zu for len in ceph_fill_inline_data()
Pull btrfs fixes from Chris Mason:
"None of these are huge, but my commit does fix a regression from 3.18
that could cause lost files during log replay.
This also adds Dave Sterba to the list of Btrfs maintainers. It
doesn't mean we're doing things differently, but Dave has really been
helping with the maintainer workload for years"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: don't delay inode ref updates during log replay
Btrfs: correctly get tree level in tree_backref_for_extent
Btrfs: call inode_dec_link_count() on mkdir error path
Btrfs: abort transaction if we don't find the block group
Btrfs, scrub: uninitialized variable in scrub_extent_for_parity()
Btrfs: add more maintainers
fs/f2fs/trace.c:19:12: sparse: symbol 'pids_lock' was not declared. Should it be static?
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In the normal case, the radix_tree_nodes are freed successfully.
But, when cp_error was detected, we should destroy them forcefully.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We use kzalloc to allocate memory in __recover_inline_status, and use this
all-zero memory to check the inline date content of inode page by comparing
them. This is low effective and not needed, let's check inline date content
directly.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: make the code more neat]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch aligns the start block address of a file for direct io to the f2fs's
section size.
Some flash devices manage an over 4KB-sized page as a write unit, and if the
direct_io'ed data are written but not aligned to that unit, the performance can
be degraded due to the partial page copies.
Thus, since f2fs has a section that is well aligned to FTL units, we can align
the block address to the section size so that f2fs avoids this misalignment.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are two slab cache inode_entry_slab and winode_slab using the same
structure as below:
struct dir_inode_entry {
struct list_head list; /* list head */
struct inode *inode; /* vfs inode pointer */
};
struct inode_entry {
struct list_head list;
struct inode *inode;
};
It's a little waste that the two cache can not share their memory space for each
other.
So in this patch we remove one redundant winode_slab slab cache, then use more
universal name struct inode_entry as remaining data structure name of slab,
finally we reuse the inode_entry_slab to store dirty dir item and gc item for
more effective.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Cleanup parameters for trace_f2fs_submit_{read_,write_,page_,page_m}bio with fio
as one parameter.
Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds missing parameter _type_ for trace_f2fs_submit_page_bio, then
use DECLARE_EVENT_CLASS/DEFINE_EVENT_CONDITION pair to cleanup some trace event
code related to f2fs_submit_page_{m,}bio.
Additionally, after we remove redundant code, size of code can be reduced:
text data bss dec hex filename
176787 8712 56 185555 2d4d3 f2fs.ko.org
174408 8648 56 183112 2cb48 f2fs.ko
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In do_recover_data, we find and update previous node pages after updating
its new block addresses.
After then, we call fill_node_footer without reset field, we erase its
cold bit so that this new cold node block is written to wrong log area.
This patch fixes not to miss its old flag.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds block count by in-place-update in stat.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The __f2fs_add_link is covered by cp_rwsem all the time.
This calls init_inode_metadata, which conducts some acl operations including
memory allocation with GFP_KERNEL previously.
But, under memory pressure, f2fs_write_data_page can be called, which also
grabs cp_rwsem too.
In this case, this incurs a deadlock pointed by Chao.
Thread #1 Thread #2
down_read
down_write
down_read
-> here down_read should wait forever.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds two key functions to trace process ids and IOs.
The basic idea is to
1. remain process ids, pids, in page->private.
2. show pids in IO traces.
So, later we can retrieve process information according to IO traces.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds:
o initial trace.c and trace.h with skeleton functions
o Kconfig and Makefile to activate this feature
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch cleans up parameters on IO paths.
The key idea is to use f2fs_io_info adding a parameter, block address, and then
use this structure as parameters.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use more common function ra_meta_pages() with META_POR to readahead node blocks
in restore_node_summary() instead of ra_sum_pages(), hence we can simplify the
readahead code there, and also we can remove unused function ra_sum_pages().
changes from v2:
o use invalidate_mapping_pages as before suggested by Changman Lee.
changes from v1:
o fix one bug when using truncate_inode_pages_range which is pointed out by
Jaegeuk Kim.
Reviewed-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch moves one member of struct nat_entry: _flag_ to struct node_info,
so _version_ in struct node_info and _flag_ which are unsigned char type will
merge to one 32-bit space in register/memory. So the size of nat_entry will be
reduced from 28 bytes to 24 bytes (for 64-bit machine, reduce its size from 40
bytes to 32 bytes) and then slab memory using by f2fs will be reduced.
changes from v2:
o update description of memory usage gain for 64-bit machine suggested by
Changman Lee.
changes from v1:
o introduce inline copy_node_info() to copy valid data from node info suggested
by Jaegeuk Kim, it can avoid bug.
Reviewed-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Let's add readahead code for reading contiguous compact/normal summary blocks
in checkpoint, then we will gain better performance in mount procedure.
Changes from v1
o remove inappropriate 'unlikely' in npages_for_summary_flush.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now we use inmemory pages for atomic write only and provide abort procedure,
we don't need to truncate them explicitly.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The ckpt_valid_map and cur_valid_map are synced by seg_info_to_raw_sit.
In the case of small discards, the candidates are selected before sync,
while fitrim selects candidates after sync.
So, for small discards, we need to add candidates only just being obsoleted.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds two new ioctls to release inmemory pages grabbed by atomic
writes.
o f2fs_ioc_abort_volatile_write
- If transaction was failed, all the grabbed pages and data should be written.
o f2fs_ioc_release_volatile_write
- This is to enhance the performance of PERSIST mode in sqlite.
In order to avoid huge memory consumption which causes OOM, this patch changes
volatile writes to use normal dirty pages, instead blocked flushing to the disk
as long as system does not suffer from memory pressure.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>