Running "./nfstest_delegation --runtest recall26" uncovers that
client doesn't recover the lock when we have an appending open,
where the initial open got a write delegation.
Instead of checking for the passed in open context against
the file lock's open context. Check that the state is the same.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The newly introduced gss_seq_send64_fetch_and_inc() fails to build on
32-bit architectures:
net/sunrpc/auth_gss/gss_krb5_seal.c:144:14: note: in expansion of macro 'cmpxchg'
seq_send = cmpxchg(&ctx->seq_send64, old, old + 1);
^~~~~~~
arch/x86/include/asm/cmpxchg.h:128:3: error: call to '__cmpxchg_wrong_size' declared with attribute error: Bad argument size for cmpxchg
__cmpxchg_wrong_size(); \
As the message tells us, cmpxchg() cannot be used on 64-bit arguments,
that's what cmpxchg64() does.
Fixes: 571ed1fd23 ("SUNRPC: Replace krb5_seq_lock with a lockless scheme")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If we're revalidating an existing dentry in order to open a file, we need
to ensure that we check the directory has not changed before we optimise
away the lookup.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Refactor the code in nfs_lookup_revalidate() as a stepping stone towards
optimising and fixing nfs4_lookup_revalidate().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We need to ensure that inode and dentry revalidation occurs correctly
on reopen of a file that is already open. Currently, we can end up
not revalidating either in the case of NFSv4.0, due to the 'cached open'
path.
Let's fix that by ensuring that we only do cached open for the special
cases of open recovery and delegation return.
Reported-by: Stan Hu <stanhu@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Module removal is RCU safe by design, so we really have no need to
lock the auth_flavors[] array. Substitute a lockless scheme to
add/remove entries in the array, and then use rcu.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Now that each struct nfs_pgio_header corresponds to one RPC call, we
only have one writer to the struct nfs_pgio_header.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Save a few bytes by allowing the read/write specific fields of the
structures to share storage.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When the server fails to return post-op attributes, the client's
attempt to place read data directly in the page cache fails, and
so we have to do an extra copy in order to realign the data with
page borders.
This patch attempts to detect servers that don't return post-op
attributes on read (e.g. for pNFS) and adjusts the placement
calculation accordingly.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The convention in the rest of the code is to have a separate function
for anything that might be ifdef-ed out.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
This is to match kernel coding style for switch statements.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Most places in the kernel tend to line up cases with the switch to
reduce indentation, so move this over to match that style.
Additionally, I handle the (status >= 0) case in the switch so that we
only "goto restart" from a single place after error handling.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Moving all of this into a new function removes the need for cramped
indentation, making the code overall easier to look at. I also take
this chance to switch copy recovery over to using
nfs4_stateid_match_other()
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
for tightly coupled DSes client must ignore provided synthetic uid and
gid as stated in draft-ietf-nfsv4-flex-files-19#section-5.1.
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The intention of nfs4_session_set_rwsize() was to cap the r/wsize to the
buffer sizes negotiated by the CREATE_SESSION. The initial code had a
bug whereby we would not check the values negotiated by nfs_probe_fsinfo()
(the assumption being that CREATE_SESSION will always negotiate buffer values
that are sane w.r.t. the server's preferred r/wsizes) but would only check
values set by the user in the 'mount' command.
The code was changed in 4.11 to _always_ set the r/wsize, meaning that we
now never use the server preferred r/wsizes. This is the regression that
this patch fixes.
Also rename the function to nfs4_session_limit_rwsize() in order to avoid
future confusion.
Fixes: 033853325f (NFSv4.1 respect server's max size in CREATE_SESSION")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reduce contention on the inode->i_lock by ensuring that we use RCU
when looking up the NFS open context.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Speed up lookups of an existing lock context by avoiding the inode->i_lock,
and using RCU instead.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
For the 'files' and 'flexfiles' layout types, we do not expect the reply
to be any larger than 4k. The block and scsi layout types are a little more
greedy, so we keep allocating the maximum response size for now.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add a bvec array to struct xdr_buf, and have the client allocate it
when we need to receive data into pages.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the RPC call relies on the receive call allocating pages as buffers,
then let's label it so that we
a) Don't leak memory by allocating pages for requests that do not expect
this behaviour
b) Can optimise for the common case where calls do not require allocation.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We no longer need priority semantics on the xprt->sending queue, because
the order in which tasks are sent is now dictated by their position in
the send queue.
Note that the backlog queue remains a priority queue, meaning that
slot resources are still managed in order of task priority.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Fix up the priority queue to not batch by owner, but by queue, so that
we allow '1 << priority' elements to be dequeued before switching to
the next priority queue.
The owner field is still used to wake up requests in round robin order
by owner to avoid single processes hogging the RPC layer by loading the
queues.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the server is slow, we can find ourselves with quite a lot of entries
on the receive queue. Converting the search from an O(n) to O(log(n))
can make a significant difference, particularly since we have to hold
a number of locks while searching.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Treat socket write space handling in the same way we now treat transport
congestion: by denying the XPRT_LOCK until the transport signals that it
has free buffer space.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The theory was that we would need to grab the socket lock anyway, so we
might as well use it to gate the allocation of RPC slots for a TCP
socket.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Rather than forcing each and every RPC task to grab the socket write
lock in order to send itself, we allow whichever task is holding the
write lock to attempt to drain the entire transmit queue.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Avoid memory starvation by giving RPCs that are tagged with the
RPC_TASK_SWAPPER flag the highest priority.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Both RDMA and UDP transports require the request to get a "congestion control"
credit before they can be transmitted. Right now, this is done when
the request locks the socket. We'd like it to happen when a request attempts
to be transmitted for the first time.
In order to support retransmission of requests that already hold such
credits, we also want to ensure that they get queued first, so that we
don't deadlock with requests that have yet to obtain a credit.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
One of the intentions with the priority queues was to ensure that no
single process can hog the transport. The field task->tk_owner therefore
identifies the RPC call's origin, and is intended to allow the RPC layer
to organise queues for fairness.
This commit therefore modifies the transmit queue to group requests
by task->tk_owner, and ensures that we round robin among those groups.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Remove the checks for whether or not we need to transmit, and whether
or not a reply has been received. Those are already handled in
call_transmit() itself.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When we shift to using the transmit queue, then the task that holds the
write lock will not necessarily be the same as the one being transmitted.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Fix up the back channel code to recognise that it has already been
transmitted, so does not need to be called again.
Also ensure that we set req->rq_task.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>