Pull vfs updates from Al Viro:
"This the bunch that sat in -next + lock_parent() fix. This is the
minimal set; there's more pending stuff.
In particular, I really hope to get acct.c fixes merged this cycle -
we need that to deal sanely with delayed-mntput stuff. In the next
pile, hopefully - that series is fairly short and localized
(kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
iov_iter work. Most of prereqs for ->splice_write with sane locking
order are there and Kent's dio rewrite would also fit nicely on top of
this pile"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
lock_parent: don't step on stale ->d_parent of all-but-freed one
kill generic_file_splice_write()
ceph: switch to iter_file_splice_write()
shmem: switch to iter_file_splice_write()
nfs: switch to iter_splice_write_file()
fs/splice.c: remove unneeded exports
ocfs2: switch to iter_file_splice_write()
->splice_write() via ->write_iter()
bio_vec-backed iov_iter
optimize copy_page_{to,from}_iter()
bury generic_file_aio_{read,write}
lustre: get rid of messing with iovecs
ceph: switch to ->write_iter()
ceph_sync_direct_write: stop poking into iov_iter guts
ceph_sync_read: stop poking into iov_iter guts
new helper: copy_page_from_iter()
fuse: switch to ->write_iter()
btrfs: switch to ->write_iter()
ocfs2: switch to ->write_iter()
xfs: switch to ->write_iter()
...
Support direct requests that span multiple pnfs data servers by
comparing nfs_pgio_header->verf to a cached verf in pnfs_commit_bucket.
Continue to use dreq->verf if the MDS is used / non-pNFS.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Add "page groups" - a circular list of nfs requests (struct nfs_page)
that all reference the same page. This gives nfs read and write paths
the ability to account for sub-page regions independently. This
somewhat follows the design of struct buffer_head's sub-page
accounting.
Only "head" requests are ever added/removed from the inode list in
the buffered write path. "head" and "sub" requests are treated the
same through the read path and the rest of the write/commit path.
Requests are given an extra reference across the life of the list.
Page groups are never rejoined after being split. If the read/write
request fails and the client falls back to another path (ie revert
to MDS in PNFS case), the already split requests are pushed through
the recoalescing code again, which may split them further and then
coalesce them into properly sized requests on the wire. Fragmentation
shouldn't be a problem with the current design, because we flush all
requests in page group when a non-contiguous request is added, so
the only time resplitting should occur is on a resend of a read or
write.
This patch lays the groundwork for sub-page splitting, but does not
actually do any splitting. For now all page groups have one request
as pg_test functions don't yet split pages. There are several related
patches that are needed support multiple requests per page group.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
@inode is passed but not used.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The header had a pointer to the verifier that was set from the old write
data struct. We don't need to keep the pointer around now that we have
shared structures.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The read_pageio_init method is just a very convoluted way to grab the
right nfs_pageio_ops vector. The vector to chose is not a choice of
protocol version, but just a pNFS vs MDS I/O choice that can simply be
done inside nfs_pageio_init_read based on the presence of a layout
driver, and a new force_mds flag to the special case of falling back
to MDS I/O on a pNFS-capable volume.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The write_pageio_init method is just a very convoluted way to grab the
right nfs_pageio_ops vector. The vector to chose is not a choice of
protocol version, but just a pNFS vs MDS I/O choice that can simply be
done inside nfs_pageio_init_write based on the presence of a layout
driver, and a new force_mds flag to the special case of falling back
to MDS I/O on a pNFS-capable volume.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
same as iov_iter_get_pages(), except that pages array is allocated
(kmalloc if possible, vmalloc if that fails) and left for caller to
free. Lustre and NFS ->direct_IO() switched to it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make sure to properly invalidate the pagecache before performing direct I/O,
so that no stale pages are left around. This matches what the generic
direct I/O code does. Also take the i_mutex over the direct write submission
to avoid the lifelock vs truncate waiting for i_dio_count to decrease, and
to avoid having the pagecache easily repopulated while direct I/O is in
progrss. Again matching the generic direct I/O code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We'll need the i_mutex to prevent i_dio_count from incrementing while
truncate is waiting for it to reach zero, and protects against having
the pagecache repopulated after we flushed it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Simple code cleanup to prepare for later fixes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Simple code cleanup to prepare for later fixes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
i_dio_count is used to protect dio access against truncate. We want
to make sure there are no dio reads pending either when doing a
truncate. I suspect on plain NFS things might work even without
this, but once we use a pnfs layout driver that access backing devices
directly things will go bad without the proper synchronization.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We need to have the I/O fully finished before telling the truncate code
that we are done.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
nfs_file_direct_write only updates the inode size if it succeeded and
returned the number of bytes written. But in the AIO case nfs_direct_wait
turns the return value into -EIOCBQUEUED and we skip the size update.
Instead the aio completion path should updated it, which this patch
does. The implementation is a little hacky because there is no obvious
way to find out we are called for a write in nfs_direct_complete.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
A fileid in NFS is a uint64. There are some occurrences where dprintk()
outputs a signed fileid. This leads to confusion and more difficult to
read debugging (negative fileids matching positive inode numbers).
Signed-off-by: Niels de Vos <ndevos@redhat.com>
CC: Santosh Pradhan <spradhan@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This code doesn't serve any purpose anymore, since the aio retry
infrastructure has been removed.
This change should be safe because aio_read/write are also used for
synchronous IO, and called from do_sync_read()/do_sync_write() - and
there's no looping done in the sync case (the read and write syscalls).
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
The NFS DIO code will dirty pages that catch read responses in order to
handle the case where someone is doing DIO reads into an mmapped buffer.
The existing code doesn't really do the right thing though since it
doesn't take into account the case where we might be attempting to read
past the EOF.
Fix the logic in that code to only dirty pages that ended up receiving
data from the read. Note too that it really doesn't matter if
NFS_IOHDR_ERROR is set or not. All that matters is if the page was
altered by the read.
Cc: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Eryu provided a test program that would segfault when attempting to read
past the EOF on file that was opened O_DIRECT. The buffer given to the
read() call was on the stack, and when he attempted to read past it it
would scribble over the rest of the stack page.
If we hit the end of the file on a DIO READ request, then we don't want
to zero out the rest of the buffer. These aren't pagecache pages after
all, and there's no guarantee that the buffers that were passed in
represent entire pages.
Cc: <stable@vger.kernel.org> # v3.5+
Cc: Fred Isaman <iisaman@netapp.com>
Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For buffer write, block layout client scan inode mapping to find
next hole and use offset-to-hole as layoutget length. Object
layout client uses offset-to-isize as layoutget length.
For direct write, both block layout and object layout use dreq->bytes_left.
Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
pnfs_within_mdsthreshold() is called inside pg_init. We need to set
read_io/write_io before that. Otherwise we fail pnfs_within_mdsthreshold()
and IO goes to MDS.
A simple test case:
dd if=foo of=/mnt/pnfs/bar bs=10M count=1 oflag=direct
Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We want to be able to distinguish between allocation failures, and
the case where the lock context is not needed (because there are no
locks).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Merge Andrew's second set of patches:
- MM
- a few random fixes
- a couple of RTC leftovers
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits)
rtc/rtc-88pm80x: remove unneed devm_kfree
rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
tmpfs: distribute interleave better across nodes
mm: remove redundant initialization
mm: warn if pg_data_t isn't initialized with zero
mips: zero out pg_data_t when it's allocated
memcg: gix memory accounting scalability in shrink_page_list
mm/sparse: remove index_init_lock
mm/sparse: more checks on mem_section number
mm/sparse: optimize sparse_index_alloc
memcg: add mem_cgroup_from_css() helper
memcg: further prevent OOM with too many dirty pages
memcg: prevent OOM with too many dirty pages
mm: mmu_notifier: fix freed page still mapped in secondary MMU
mm: memcg: only check anon swapin page charges for swap cache
mm: memcg: only check swap cache pages for repeated charging
mm: memcg: split swapin charge function into private and public part
mm: memcg: remove needless !mm fixup to init_mm when charging
mm: memcg: remove unneeded shmem charge type
...
Implement the new swapfile a_ops for NFS and hook up ->direct_IO. This
will set the NFS socket to SOCK_MEMALLOC and run socket reconnect under
PF_MEMALLOC as well as reset SOCK_MEMALLOC before engaging the protocol
->connect() method.
PF_MEMALLOC should allow the allocation of struct socket and related
objects and the early (re)setting of SOCK_MEMALLOC should allow us to
receive the packets required for the TCP connection buildup.
[jlayton@redhat.com: Restore PF_MEMALLOC task flags in all cases]
[dfeng@redhat.com: Fix handling of multiple swap files]
[a.p.zijlstra@chello.nl: Original patch]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Neil Brown <neilb@suse.de>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch exports symbols needed by the v4 module. In addition, I also
switch over to using IS_ENABLED() to check if CONFIG_NFS_V4 or
CONFIG_NFS_V4_MODULE are set.
The module (nfs4.ko) will be created in the same directory as nfs.ko and
will be automatically loaded the first time you try to mount over NFS v4.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch exports symbols and moves over the final structures needed by
the v3 module. In addition, I also switch over to using IS_ENABLED() to
check if CONFIG_NFS_V3 or CONFIG_NFS_V3_MODULE are set.
The module (nfs3.ko) will be created in the same directory as nfs.ko and
will be automatically loaded the first time you try to mount over NFS v3.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Commit 57208fa7e5 "NFS: Create an write_pageio_init() function"
did not modify the calls in direct.c, preventing direct io from
using pnfs. This reintroduces that capability.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Commit 1abb50886a "NFS: Create an read_pageio_init() function"
did not modify the call in direct.c, preventing direct io from
using pnfs. This reintroduces that capability.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix 2 bugs in nfs_direct_write_reschedule:
- The request needs to be removed from the 'reqs' list before it can
be added to 'failed'.
- Fix an infinite loop if the 'failed' list is non-empty.
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In nfs_direct_write_reschedule(), the requests from nfs_scan_commit_list
have a refcount of 2, whereas the operations in
nfs_direct_write_completion_ops expect them to have a refcount of 1.
This patch adds a call to release the extra references.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Fred Isaman <iisaman@netapp.com>
Highlights include:
- Fix a couple of mount regressions due to the recent cleanups.
- Fix an Oops in the open recovery code
- Fix an rpc_pipefs upcall hang that results from some of the
net namespace work from 3.4.x (stable kernel candidate).
- Fix a couple of write and o_direct regressions that were found
at last weeks Bakeathon testing event in Ann Arbor.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJP2gmaAAoJEGcL54qWCgDyrBMP/RY/T++He8y5k3M9aEqiIv0q
D8ZVMwzID6f4Zgw4xRg96aYr02sBTw0q+0mP5x1EZmg8mK29rnBiVeKHE1iwSfXq
10/SYISlpIjhJC4I4kHXGd2KClgj7qRRCbDKFRWwoIIwYU+kJn8MRnPa9XqdL8kP
q68lrtayW8THSJDR8bk1GQn+ARxGeoY++qzHxm3vpQCbZVVb19VqKMWAWSN4VKqb
epWehOSAzB3iA7HrLRbf8Y8/sDdXewxCQpr9CC/wxuu++l5ifPphR0ToX+k9VZXI
BKFLUojCUZHTMAgCxuxjrFYehMeyClbzL2lLkz5Pgj0gQhOX6Myj+WMXoEg/uWfo
XNf51FH3yBbnfayTaOUs6Y50iuU+dQO7TUTAoWTPpW9V/iT5z/fWAKUVJhDtrPk5
DVDkR6SEgb4P1RqkehZKLq5k5GSAcTR+MZr452eDrFYXJrY8ORDE6o6kP4Rr3Nnd
n8gap0gHxzIYlhBghem6+nLN+HhpZQopWeD8mNub20VuXsChRDr9/+XWuMCSJaZF
2kleVdt2+rTDzi9bJTRYlsX397oaThL0NbRvshHAwnXIDtIQrzxx6+dUyOsEWMEu
go/EdSUUESXGNlsWTqewCBsOjPeE4L5ijI/QglfDkF+CzD5dDjrxl+5i57iMKVfc
Ydste3pQJkS7PiZu1sWA
=unbu
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
- Fix a couple of mount regressions due to the recent cleanups.
- Fix an Oops in the open recovery code
- Fix an rpc_pipefs upcall hang that results from some of the net
namespace work from 3.4.x (stable kernel candidate).
- Fix a couple of write and o_direct regressions that were found at
last weeks Bakeathon testing event in Ann Arbor."
* tag 'nfs-for-3.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: add an endian notation for sparse
NFSv4.1: integer overflow in decode_cb_sequence_args()
rpc_pipefs: allow rpc_purge_list to take a NULL waitq pointer
NFSv4 do not send an empty SETATTR compound
NFSv2: EOF incorrectly set on short read
NFS: Use the NFS_DEFAULT_VERSION for v2 and v3 mounts
NFS: fix directio refcount bug on commit
NFSv4: Fix unnecessary delegation returns in nfs4_do_open
NFSv4.1: Convert another trivial printk into a dprintk
NFS4: Fix open bug when pnfs module blacklisted
NFS: Remove incorrect BUG_ON in nfs_found_client
NFS: Map minor mismatch error to protocol not support error.
NFS: Fix a commit bug
NFS4: Set parsed mount data version to 4
NFSv4.1: Ensure we clear session state flags after a session creation
NFSv4.1: Convert a trivial printk into a dprintk
NFSv4: Fix up decode_attr_mdsthreshold
NFSv4: Fix an Oops in the open recovery code
NFSv4.1: Fix a request leak on the back channel
This reverts a hunk from commit 0427708657
"NFS: Clean up - Simplify reference counting in fs/nfs/direct.c"
The cleanups in that patch affect the write path, but by the time
processing hits commit the removed reference has been added back by
nfs_scan_commit_list(). Without this reversion, any page that is
sent to commit holds on to an unbalanced reference that is never
freed. The immediate effect is an imbalance over the wire between
OPENs and CLOSEs.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The new commit code fails to copy the verifier into the wb_verf field
of _all_ the nfs_page structures; it only copies it into the first entry.
The consequence is that most requests end up failing to match in
nfs_commit_release.
Fix is to copy the verifier into the req->wb_verf field in
nfs_write_completion.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Fred Isaman <iisaman@netapp.com>
Use the same mechanism as the block devices are using, but move the
helper functions from fs/direct-io.c into fs/inode.c to remove the
dependency on CONFIG_BLOCK.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Keep track of the number of bytes read or written via buffered, direct, and
mem-mapped i/o for use by mdsthreshold size_io hints.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Function rename to ensure that the functionality of nfs_unlock_request()
mirrors that of nfs_lock_request(). Then let nfs_unlock_and_release_request()
do the work of what used to be called nfs_unlock_request()...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Fred Isaman <iisaman@netapp.com>
We only have two places where we need to grab a reference when trying
to lock the nfs_page. We're better off making that explicit.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Fred Isaman <iisaman@netapp.com>
Fix the following sparse warnings:
fs/nfs/direct.c:221:6: warning: symbol 'nfs_direct_readpage_release' was
not declared. Should it be static?
fs/nfs/read.c:38:43: warning: non-ANSI function declaration of function
'nfs_readhdr_alloc'
fs/nfs/objlayout/objio_osd.c:214:5: warning: symbol '__alloc_objio_seg'
was not declared. Should it be static?
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Fred Isaman <iisaman@netapp.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Fix the following compile warnings:
fs/nfs/direct.c: In function 'nfs_direct_read_schedule_segment':
fs/nfs/direct.c:325:11: warning: comparison of distinct pointer types
lacks a cast [enabled by default]
fs/nfs/direct.c:325:11: warning: comparison of distinct pointer types
lacks a cast [enabled by default]
fs/nfs/direct.c:325:11: warning: comparison of distinct pointer types
lacks a cast [enabled by default]
fs/nfs/direct.c:352:27: warning: comparison of distinct pointer types
lacks a cast [enabled by default]
fs/nfs/direct.c: In function 'nfs_direct_write_schedule_segment':
fs/nfs/direct.c:622:11: warning: comparison of distinct pointer types
lacks a cast [enabled by default]
fs/nfs/direct.c:622:11: warning: comparison of distinct pointer types
lacks a cast [enabled by default]
fs/nfs/direct.c:622:11: warning: comparison of distinct pointer types
lacks a cast [enabled by default]
fs/nfs/direct.c:650:27: warning: comparison of distinct pointer types
lacks a cast [enabled by default]
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Fred Isaman <iisaman@netapp.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
The O_DIRECT code shouldn't need to hold 2 references to each page. The
reference held by the struct nfs_page should suffice.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Fred Isaman <iisaman@netapp.com>