Re-arrange the logic in the NFS direct I/O path so that nfs_read/write_data
structs are allocated just before they are scheduled, rather than
allocating them all at once before we start scheduling requests.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Neil Brown observed that the kmalloc() in nfs_get_user_pages() is more
likely to fail if the I/O is large enough to require the allocation of more
than a single page to keep track of all the pinned pages in the user's
buffer.
Instead of tracking one large page array per dreq/iocb, track pages per
nfs_read/write_data, just like the cached I/O path does. An array for
pages is already allocated for us by nfs_readdata_alloc() (and the write
and commit equivalents).
This is also required for adding support for vectored I/O to the NFS direct
I/O path.
The original reason to pin the user buffer and allocate all the NFS data
structures before trying to schedule I/O was to ensure all needed resources
are allocated on the client before starting to send requests. This reduces
the chance that resource exhaustion on the client will cause a short read
or write.
On the other hand, for an application making very large application I/O
requests, this means that it will be nearly impossible for the application
to make forward progress on a resource-limited client.
Thus, moving the buffer pinning functionality into the I/O scheduling
loops should be good for scalability. The next patch will do the same for
NFS data structure allocation.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean-up and fix a minor bug: the logic was dirtying page cache pages on
both read and write operations.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Make the user_addr, user_count, and pos parameters explicit to the
scheduler routines, and remove the fields from nfs_direct_req. The
iovec API will be passing in a series of these, not just one set.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
An NFSv3/v4 client must reschedule on-the-wire writes if the writes are
UNSTABLE, and the server reboots before the client can complete a
subsequent COMMIT request.
To support direct asynchronous scatter-gather writes, the write
rescheduler in fs/nfs/direct.c must not depend on the I/O parameters
in the controlling nfs_direct_req structure. iovecs can be somewhat
arbitrarily complex, so there could be an unbounded amount of information
to save for a rarely encountered requirement.
Refactor the direct write rescheduler so it uses information from each
nfs_write_data structure to reschedule writes, instead of caching that
information in the controlling nfs_direct_req structure.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Factor out the logic that increments and decrements the outstanding I/O
count. This will be a commonly used bit of code in upcoming patches.
Also make this an atomic_t again, since it will be very often manipulated
outside dreq->spin lock.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
As fs/nfs/inode.c is rather large, heterogenous and unwieldy, the attached
patch splits it up into a number of files:
(*) fs/nfs/inode.c
Strictly inode specific functions.
(*) fs/nfs/super.c
Superblock management functions for NFS and NFS4, normal access, clones
and referrals. The NFS4 superblock functions _could_ move out into a
separate conditionally compiled file, but it's probably not worth it as
there're so many common bits.
(*) fs/nfs/namespace.c
Some namespace-specific functions have been moved here.
(*) fs/nfs/nfs4namespace.c
NFS4-specific namespace functions (this could be merged into the previous
file). This file is conditionally compiled.
(*) fs/nfs/internal.h
Inter-file declarations, plus a few simple utility functions moved from
fs/nfs/inode.c.
Additionally, all the in-.c-file externs have been moved here, and those
files they were moved from now includes this file.
For the most part, the functions have not been changed, only some multiplexor
functions have changed significantly.
I've also:
(*) Added some extra banner comments above some functions.
(*) Rearranged the function order within the files to be more logical and
better grouped (IMO), though someone may prefer a different order.
(*) Reduced the number of #ifdefs in .c files.
(*) Added missing __init and __exit directives.
Signed-Off-By: David Howells <dhowells@redhat.com>
* git://git.linux-nfs.org/pub/linux/nfs-2.6: (103 commits)
SUNRPC,RPCSEC_GSS: spkm3--fix config dependencies
SUNRPC,RPCSEC_GSS: spkm3: import contexts using NID_cast5_cbc
LOCKD: Make nlmsvc_traverse_shares return void
LOCKD: nlmsvc_traverse_blocks return is unused
SUNRPC,RPCSEC_GSS: fix krb5 sequence numbers.
NFSv4: Dont list system.nfs4_acl for filesystems that don't support it.
SUNRPC,RPCSEC_GSS: remove unnecessary kmalloc of a checksum
SUNRPC: Ensure rpc_call_async() always calls tk_ops->rpc_release()
SUNRPC: Fix memory barriers for req->rq_received
NFS: Fix a race in nfs_sync_inode()
NFS: Clean up nfs_flush_list()
NFS: Fix a race with PG_private and nfs_release_page()
NFSv4: Ensure the callback daemon flushes signals
SUNRPC: Fix a 'Busy inodes' error in rpc_pipefs
NFS, NLM: Allow blocking locks to respect signals
NFS: Make nfs_fhget() return appropriate error values
NFSv4: Fix an oops in nfs4_fill_super
lockd: blocks should hold a reference to the nlm_file
NFSv4: SETCLIENTID_CONFIRM should handle NFS4ERR_DELAY/NFS4ERR_RESOURCE
NFSv4: Send the delegation stateid for SETATTR calls
...
Rewrap the overly long source code lines resulting from the previous
patch's addition of the slab cache flag SLAB_MEM_SPREAD. This patch
contains only formatting changes, and no function change.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Mark file system inode and similar slab caches subject to SLAB_MEM_SPREAD
memory spreading.
If a slab cache is marked SLAB_MEM_SPREAD, then anytime that a task that's
in a cpuset with the 'memory_spread_slab' option enabled goes to allocate
from such a slab cache, the allocations are spread evenly over all the
memory nodes (task->mems_allowed) allowed to that task, instead of favoring
allocation on the node local to the current cpu.
The following inode and similar caches are marked SLAB_MEM_SPREAD:
file cache
==== =====
fs/adfs/super.c adfs_inode_cache
fs/affs/super.c affs_inode_cache
fs/befs/linuxvfs.c befs_inode_cache
fs/bfs/inode.c bfs_inode_cache
fs/block_dev.c bdev_cache
fs/cifs/cifsfs.c cifs_inode_cache
fs/coda/inode.c coda_inode_cache
fs/dquot.c dquot
fs/efs/super.c efs_inode_cache
fs/ext2/super.c ext2_inode_cache
fs/ext2/xattr.c (fs/mbcache.c) ext2_xattr
fs/ext3/super.c ext3_inode_cache
fs/ext3/xattr.c (fs/mbcache.c) ext3_xattr
fs/fat/cache.c fat_cache
fs/fat/inode.c fat_inode_cache
fs/freevxfs/vxfs_super.c vxfs_inode
fs/hpfs/super.c hpfs_inode_cache
fs/isofs/inode.c isofs_inode_cache
fs/jffs/inode-v23.c jffs_fm
fs/jffs2/super.c jffs2_i
fs/jfs/super.c jfs_ip
fs/minix/inode.c minix_inode_cache
fs/ncpfs/inode.c ncp_inode_cache
fs/nfs/direct.c nfs_direct_cache
fs/nfs/inode.c nfs_inode_cache
fs/ntfs/super.c ntfs_big_inode_cache_name
fs/ntfs/super.c ntfs_inode_cache
fs/ocfs2/dlm/dlmfs.c dlmfs_inode_cache
fs/ocfs2/super.c ocfs2_inode_cache
fs/proc/inode.c proc_inode_cache
fs/qnx4/inode.c qnx4_inode_cache
fs/reiserfs/super.c reiser_inode_cache
fs/romfs/inode.c romfs_inode_cache
fs/smbfs/inode.c smb_inode_cache
fs/sysv/inode.c sysv_inode_cache
fs/udf/super.c udf_inode_cache
fs/ufs/super.c ufs_inode_cache
net/socket.c sock_inode_cache
net/sunrpc/rpc_pipe.c rpc_inode_cache
The choice of which slab caches to so mark was quite simple. I marked
those already marked SLAB_RECLAIM_ACCOUNT, except for fs/xfs, dentry_cache,
inode_cache, and buffer_head, which were marked in a previous patch. Even
though SLAB_RECLAIM_ACCOUNT is for a different purpose, it marks the same
potentially large file system i/o related slab caches as we need for memory
spreading.
Given that the rule now becomes "wherever you would have used a
SLAB_RECLAIM_ACCOUNT slab cache flag before (usually the inode cache), use
the SLAB_MEM_SPREAD flag too", this should be easy enough to maintain.
Future file system writers will just copy one of the existing file system
slab cache setups and tend to get it right without thinking.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Now that we have aio writes, it is possible for dreq->outstanding to be
zero, but for the I/O not to have completed. Convert struct nfs_direct_req
to use a completion to signal when the I/O is done.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Introduced by NFS aio+dio patches.
Test plan:
Compile kernel with CONFIG_NFS enabled on 64-bit hardware.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The struct nfs_direct_req currently keeps a pointer to the file descriptor
without referencing it. This may cause problems if the parent process is
killed.
The nfs_open_context should normally have all the information that we're
currently using the filp for, and unlike fput(), is safe to release from
an rpciod process context.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently NFS O_DIRECT writes use FILE_SYNC so that a COMMIT is not
necessary. This simplifies the internal logic, but this could be a
difficult workload for some servers.
Instead, let's send UNSTABLE writes, and after they all complete, send a
COMMIT for the dirty range. After the COMMIT returns successfully, then do
the wake_up or fire off aio_complete().
Test plan:
Async direct I/O tests against Solaris (or any server that requires
committed unstable writes). Reboot server during test.
Based on an earlier patch by Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
^C against "iozone -I" is hitting the assertion in nfs_clear_inode().
Test plan:
"iozone -i0 -I -a -c" against a slow server, then control C. This should
not cause an oops.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Three atomic_t variables cause a lot of bus locking. Because they are all
used in the same places in the code, just use a single spin lock.
Now that the atomic_t variables are gone, we can remove the request size
limitation since the code no longer depends on the limited width of atomic_t
on some platforms.
Test plan:
Compile with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled. Millions of fsx
operations, iozone, OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up tab damage and comments. Replace "file_offset" with more commonly
used "pos".
Test plan:
Compile with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For async iocb's, the NFS direct write path now returns EIOCBQUEUED,
and calls aio_complete when all the requested writes are finished. The
synchronous part of the NFS direct write path behaves exactly as it
was before.
Shared mapped NFS files will have some coherency difficulties when
accessed concurrently with aio+dio. Will need to explore how this
is handled in the local file system case.
Test plan:
aio-stress with "-O". OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pass the iocb argument all the way down to the direct write request
scheduler, and make it available in nfs_direct_write_result.
Test plan:
Compile the kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Millions of fsx-odirect ops. OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Eliminate the persistent use of automatic storage in all parts of the
NFS client's direct write path to pave the way for introducing support
for aio against files opened with the O_DIRECT flag.
Test plan:
Compile the kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Millions of fsx-odirect ops. OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Duplicate infrastructure from direct read path that will allow write
path to generate multiple write requests concurrently. This will
enable us to add support for aio in this path.
Temporarily we will lose the ability to do UNSTABLE writes followed by
a COMMIT in the direct write path. However, all applications I am
aware of that use NFS O_DIRECT currently write in relatively small
chunks, so this should not be inconvenient in any way.
Test plan:
Millions of fsx-odirect ops. OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Factor out the common piece of completing an NFS direct I/O request.
Test plan:
Compile kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Factor out a small common piece of the path that allocate nfs_direct_req
structures.
Test plan:
Compile kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We're about to add asynchrony to the NFS direct write path. Begin by
abstracting out the common pieces in the read path.
The first piece is nfs_direct_read_wait, which works the same whether the
process is waiting for a read or a write.
Test plan:
Compile kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
For async iocb's, the NFS direct read path should return EIOCBQUEUED and
call aio_complete when all the requested reads are finished. The
synchronous part of the NFS direct read path behaves exactly as it was
before.
Test plan:
aio-stress with "-O". OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pass the iocb argument all the way down to the direct read request
scheduler, and make it available in nfs_direct_read_result.
Test plan:
Compile the kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Millions of fsx-odirect ops. OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Eliminate the persistent use of automatic storage in all parts of the NFS
client's direct read path to pave the way for introducing support for aio
against files opened with the O_DIRECT flag.
Test plan:
Compile the kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled.
Millions of fsx-odirect ops. OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
size_t is used for holding byte counts, so use it for variables storing rsize.
Note that the write path will be updated as we add support for async
O_DIRECT writes.
Test plan:
Need to verify that existing comparisons against new size_t variables behave
correctly.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Update to latest coding style standards. Remove block comments on
statically defined functions, and place function definitions all on
one line.
Test plan:
Compile kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The NFS client's a_ops->direct_IO method, nfs_direct_IO, is required to
be present to allow NFS files to be opened with O_DIRECT, but is never
called because the NFS client shunts reads and writes to files opened
with O_DIRECT directly to its own routines.
Gut the nfs_direct_IO function. This eliminates the only part of the
NFS client's direct I/O path that requires support for multi-segment
iovs, allowing further simplification in subsequent patches.
Test plan:
Compile the kernel with CONFIG_NFS and CONFIG_NFS_DIRECTIO enabled. Millions
of fsx-odirect ops. OraSim.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Same callback hierarchy inversion as for the NFS write calls. This patch is
not strictly speaking needed by the O_DIRECT code, but avoids confusing
differences between the asynchronous read and write code.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Invoke the byte and event counter macros where we want to count bytes and
events.
Clean-up: fix a possible NULL dereference in nfs_lock, and simplify
nfs_file_open.
Test-plan:
fsx and iozone on UP and SMP systems, with and without pre-emption. Watch
for memory overwrite bugs, and performance loss (significantly more CPU
required per op).
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Based on an original patch by Mike O'Connor and Greg Banks of SGI.
Mike states:
A normal user can panic an NFS client and cause a local DoS with
'judicious'(?) use of O_DIRECT. Any O_DIRECT write to an NFS file where the
user buffer starts with a valid mapped page and contains an unmapped page,
will crash in this way. I haven't followed the code, but O_DIRECT reads with
similar user buffers will probably also crash albeit in different ways.
Details: when nfs_get_user_pages() calls get_user_pages(), it detects and
correctly handles get_user_pages() returning an error, which happens if the
first page covered by the user buffer's address range is unmapped. However,
if the first page is mapped but some subsequent page isn't, get_user_pages()
will return a positive number which is less than the number of pages requested
(this behaviour is sort of analagous to a short write() call and appears to be
intentional). nfs_get_user_pages() doesn't detect this and hands off the
array of pages (whose last few elements are random rubbish from the newly
allocated array memory) to it's caller, whence they go to
nfs_direct_write_seg(), which then totally ignores the nr_pages it's given,
and calculates its own idea of how many pages are in the array from the user
buffer length. Needless to say, when it comes to transmit those uninitialised
page* pointers, we see a crash in the network stack.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Most NFS server implementations allow up to 64KB reads and writes on the
wire. The Solaris NFS server allows up to a megabyte, for instance.
Now the Linux NFS client supports transfer sizes up to 1MB, too. This will
help reduce protocol and context switch overhead on read/write intensive NFS
workloads, and support larger atomic read and write operations on servers
that support them.
Test-plan:
Connectathon and iozone on mount point with wsize=rsize>32768 over TCP.
Tests with NFS over UDP to verify the maximum RPC payload size cap.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Replace ad hoc write parameter sanity checking in nfs_file_direct_write()
with a call to generic_write_checks(). This should make the proper checks
modulo the O_LARGEFILE flag, and should catch NFSv2-specific limitations by
virtue of i_sb->s_maxbytes.
Test plan:
Posix compliance testing with both NFSv2 and NFSv3.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The NFSv4 model requires us to complete all RPC calls that might
establish state on the server whether or not the user wants to
interrupt it. We may also need to schedule new work (including
new RPC calls) in order to cancel the new state.
The asynchronous RPC model will allow us to ensure that RPC calls
always complete, but in order to allow for "synchronous" RPC, we
want to add the ability to wait for completion.
The waits are, of course, interruptible.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Shrink the RPC task structure. Instead of storing separate pointers
for task->tk_exit and task->tk_release, put them in a structure.
Also pass the user data pointer as a parameter instead of passing it via
task->tk_calldata. This enables us to nest callbacks.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure we call unmap_mapping_range() and sync dirty pages to disk before
doing an NFS direct write.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix some dprintk's so that NLM, NFS client, and RPC client compile
cleanly if CONFIG_SYSCTL is disabled.
Test plan:
Compile kernel with CONFIG_NFS enabled and CONFIG_SYSCTL disabled.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The following patch removes the f_error field and all checks of f_error.
Trond said:
f_error was introduced for NFS, and made sense when we were guaranteed
always to have a file pointer around when write errors occurred. Since
then, we have (for various reasons) had to introduce the nfs_open_context in
order to track the file read/write state, and it made sense to move our
f_error tracking there too.
Signed-off-by: Christoph Lameter <christoph@lameter.com>
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Instead of looking at whether or not the file is open for writes before
we accept to update the length using the server value, we should rather
be looking at whether or not we are currently caching any writes.
Failure to do so means in particular that we're not updating the file
length correctly after obtaining a POSIX or BSD lock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!