Receive buffers are always the same size, but each Send WR has a
variable number of SGEs, based on the contents of the xdr_buf being
sent.
While assembling a Send WR, keep track of the number of SGEs so that
we don't exceed the device's maximum, or walk off the end of the
Send SGE array.
For now the Send path just fails if it exceeds the maximum.
The current logic in svc_rdma_accept bases the maximum number of
Send SGEs on the largest NFS request that can be sent or received.
In the transport layer, the limit is actually based on the
capabilities of the underlying device, not on properties of the
Upper Layer Protocol.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
svc_rdma_op_ctxt's are pre-allocated and maintained on a per-xprt
free list. This eliminates the overhead of calling kmalloc / kfree,
both of which grab a globally shared lock that disables interrupts.
Introduce a replacement to svc_rdma_op_ctxt's that is built
especially for the svcrdma Send path.
Subsequent patches will take advantage of this new structure by
allocating real resources which are then cached in these objects.
The allocations are freed when the transport is torn down.
I've renamed the structure so that static type checking can be used
to ensure that uses of op_ctxt and send_ctxt are not confused. As an
additional clean up, structure fields are renamed to conform with
kernel coding conventions.
Additional clean ups:
- Handle svc_rdma_send_ctxt_get allocation failure at each call
site, rather than pre-allocating and hoping we guessed correctly
- All send_ctxt_put call-sites request page freeing, so remove
the @free_pages argument
- All send_ctxt_put call-sites unmap SGEs, so fold that into
svc_rdma_send_ctxt_put
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: Since there's already a svc_rdma_op_ctxt being passed
around with the running count of mapped SGEs, drop unneeded
parameters to svc_rdma_post_send_wr().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: svc_rdma_dma_map_buf does mostly the same thing as
svc_rdma_dma_map_page, so let's fold these together.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There is a significant latency penalty when processing an ingress
Receive if the Receive buffer resides in memory that is not on the
same NUMA node as the the CPU handling completions for a CQ.
The system administrator and the device driver determine which CPU
handles completions. This CPU does not change during life of the CQ.
Further the Upper Layer does not have any visibility of which CPU it
is.
Allocating Receive buffers in the Receive completion handler
guarantees that Receive buffers are allocated on the preferred NUMA
node for that CQ.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The current Receive path uses an array of pages which are allocated
and DMA mapped when each Receive WR is posted, and then handed off
to the upper layer in rqstp::rq_arg. The page flip releases unused
pages in the rq_pages pagelist. This mechanism introduces a
significant amount of overhead.
So instead, kmalloc the Receive buffer, and leave it DMA-mapped
while the transport remains connected. This confers a number of
benefits:
* Each Receive WR requires only one receive SGE, no matter how large
the inline threshold is. This helps the server-side NFS/RDMA
transport operate on less capable RDMA devices.
* The Receive buffer is left allocated and mapped all the time. This
relieves svc_rdma_post_recv from the overhead of allocating and
DMA-mapping a fresh buffer.
* svc_rdma_wc_receive no longer has to DMA unmap the Receive buffer.
It has to DMA sync only the number of bytes that were received.
* svc_rdma_build_arg_xdr no longer has to free a page in rq_pages
for each page in the Receive buffer, making it a constant-time
function.
* The Receive buffer is now plugged directly into the rq_arg's
head[0].iov_vec, and can be larger than a page without spilling
over into rq_arg's page list. This enables simplification of
the RDMA Read path in subsequent patches.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Rather than releasing the incoming svc_rdma_recv_ctxt at the end of
svc_rdma_recvfrom, hold onto it until svc_rdma_sendto.
This permits the contents of the Receive buffer to be preserved
through svc_process and then referenced directly in sendto as it
constructs Write and Reply chunks to return to the client.
The real changes will come in subsequent patches.
Note: I cannot use ->xpo_release_rqst for this purpose because that
is called _before_ ->xpo_sendto. svc_rdma_sendto uses information in
the received Call transport header to construct the Reply transport
header, which is preserved in the RPC's Receive buffer.
The historical comment in svc_send() isn't helpful: it is already
obvious that ->xpo_release_rqst is being called before ->xpo_sendto,
but there is no explanation for this ordering going back to the
beginning of the git era.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently svc_rdma_recv_ctxt_put's callers have to know whether they
want to free the ctxt's pages or not. This means the human
developers have to know when and why to set that free_pages
argument.
Instead, the ctxt should carry that information with it so that
svc_rdma_recv_ctxt_put does the right thing no matter who is
calling.
We want to keep track of the number of pages in the Receive buffer
separately from the number of pages pulled over by RDMA Read. This
is so that the correct number of pages can be freed properly and
that number is well-documented.
So now, rc_hdr_count is the number of pages consumed by head[0]
(ie., the page index where the Read chunk should start); and
rc_page_count is always the number of pages that need to be released
when the ctxt is put.
The @free_pages argument is no longer needed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: No need to retain rq_depth in struct svcrdma_xprt, it is
used only in svc_rdma_accept().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
svc_rdma_op_ctxt's are pre-allocated and maintained on a per-xprt
free list. This eliminates the overhead of calling kmalloc / kfree,
both of which grab a globally shared lock that disables interrupts.
To reduce contention further, separate the use of these objects in
the Receive and Send paths in svcrdma.
Subsequent patches will take advantage of this separation by
allocating real resources which are then cached in these objects.
The allocations are freed when the transport is torn down.
I've renamed the structure so that static type checking can be used
to ensure that uses of op_ctxt and recv_ctxt are not confused. As an
additional clean up, structure fields are renamed to conform with
kernel coding conventions.
As a final clean up, helpers related to recv_ctxt are moved closer
to the functions that use them.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This includes:
* Posting on the Send and Receive queues
* Send, Receive, Read, and Write completion
* Connect upcalls
* QP errors
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This includes:
* Transport accept and tear-down
* Decisions about using Write and Reply chunks
* Each RDMA segment that is handled
* Whenever an RDMA_ERR is sent
As a clean-up, I've standardized the order of the includes, and
removed some now redundant dprintk call sites.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: Move #include <trace/events/rpcrdma.h> into source files,
similar to how it is done with trace/events/sunrpc.h.
Server-side trace points will be part of the rpcrdma subsystem,
just like the client-side trace points.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Ensure each RDMA listener and its children transports are created in
the same net namespace as the user that started the NFS service.
This is similar to how listener sockets are created in
svc_create_socket, required for enabling support for containers.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
INFINIBAND_ADDR_TRANS depends on INFINIBAND. So there's no need for
options which depend INFINIBAND_ADDR_TRANS to also depend on INFINIBAND.
Remove the unnecessary INFINIBAND depends.
Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Clean up: The only call site is in the same file as the function's
definition.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: There is only one remaining call site for this helper.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up. There is only one call-site for this helper, and it can be
simplified by using list_first_entry_or_null().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: These functions are no longer used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Receive completion and Reply handling are done by a BOUND
workqueue, meaning they run on only one CPU.
Posting receives is currently done in the send_request path, which
on large systems is typically done on a different CPU than the one
handling Receive completions. This results in movement of
Receive-related cachelines between the sending and receiving CPUs.
More importantly, it means that currently Receives are posted while
the transport's write lock is held, which is unnecessary and costly.
Finally, allocation of Receive buffers is performed on-demand in
the Receive completion handler. This helps guarantee that they are
allocated on the same NUMA node as the CPU that handles Receive
completions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
For clarity, report the posting and completion of Receive CQEs.
Also, the wc->byte_len field contains garbage if wc->status is
non-zero, and the vendor error field contains garbage if wc->status
is zero. For readability, don't save those fields in those cases.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This simplifies allocation of the generic RPC slot and xprtrdma
specific per-RPC resources.
It also makes xprtrdma more like the socket-based transports:
->buf_alloc and ->buf_free are now responsible only for send and
receive buffers.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
rpcrdma_buffer_get acquires an rpcrdma_req and rep for each RPC.
Currently this is done in the call_allocate action, and sometimes it
can fail if there are many outstanding RPCs.
When call_allocate fails, the RPC task is put on the delayq. It is
awoken a few milliseconds later, but there's no guarantee it will
get a buffer at that time. The RPC task can be repeatedly put back
to sleep or even starved.
The call_allocate action should rarely fail. The delayq mechanism is
not meant to deal with transport congestion.
In the current sunrpc stack, there is a friendlier way to deal with
this situation. These objects are actually tantamount to an RPC
slot (rpc_rqst) and there is a separate FSM action, distinct from
call_allocate, for allocating slot resources. This is the
call_reserve action.
When allocation fails during this action, the RPC is placed on the
transport's backlog queue. The backlog mechanism provides a stronger
guarantee that when the RPC is awoken, a buffer will be available
for it; and backlogged RPCs are awoken one-at-a-time.
To make slot resource allocation occur in the call_reserve action,
create special ->alloc_slot and ->free_slot call-outs for xprtrdma.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Refactor: xprtrdma needs to have better control over when RPCs are
awoken from the backlog queue, so replace xprt_free_slot with a
transport op callout.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
alloc_slot is a transport-specific op, but initializing an rpc_rqst
is common to all transports. In addition, the only part of initial-
izing an rpc_rqst that needs serialization is getting a fresh XID.
Move rpc_rqst initialization to common code in preparation for
adding a transport-specific alloc_slot to xprtrdma.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
For FRWR, the computation of max_send_wr is split between
frwr_op_open and rpcrdma_ep_create, which makes it difficult to tell
that the max_send_wr result is currently incorrect if frwr_op_open
has to reduce the credit limit to accommodate a small max_qp_wr.
This is a problem now that extra WRs are needed for backchannel
operations and a drain CQE.
So, refactor the computation so that it is all done in ->ro_open,
and fix the FRWR version of this computation so that it
accommodates HCAs with small max_qp_wr correctly.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Set up RPC/RDMA transport in mount.nfs's network namespace. This
passes the correct namespace information to the RDMA core, similar
to how RPC sockets are created (see xs_create_sock).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
rdma_resolve_addr(3) says:
> This call is used to map a given destination IP address to a
> usable RDMA address. The IP to RDMA address mapping is done
> using the local routing tables, or via ARP.
If this can't be done, there's no local device that can be used
to establish an RDMA-capable network path to the remote. In this
case, the RDMA CM very quickly posts an RDMA_CM_EVENT_ADDR_ERROR
upcall.
Currently rpcrdma_conn_upcall() converts RDMA_CM_EVENT_ADDR_ERROR
to EHOSTUNREACH. mount.nfs seems to want to retry EHOSTUNREACH
forever, thinking that this is a temporary situation. This makes
mount.nfs appear to hang if I try to mount with proto=rdma through,
say, a conventional Ethernet device.
If the admin has specified proto=rdma along with a server IP address
that requires a network path that does not support RDMA, instead
let's fail with a permanent error. -EPROTONOSUPPORT is returned when
NFSv4 or one of its minor versions is not supported.
-EPROTO is not (currently) retried by mount.nfs.
There are potentially other similar cases where -EPROTO is an
appropriate return code.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Olga Kornievskaia <kolga@netapp.com>
Tested-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The ro_release_mr methods check whether mr->mr_list is empty.
Therefore, be sure to always use list_del_init when removing an MR
linked into a list using that field. Otherwise, when recovering from
transport failures or device removal, list corruption can result, or
MRs can get mapped or unmapped an odd number of times, resulting in
IOMMU-related failures.
In general this fix is appropriate back to v4.8. However, code
changes since then make it impossible to apply this patch directly
to stable kernels. The fix would have to be applied by hand or
reworked for kernels earlier than v4.16.
Backport guidance -- there are several cases:
- When creating an MR, initialize mr_list so that using list_empty
on an as-yet-unused MR is safe.
- When an MR is being handled by the remote invalidation path,
ensure that mr_list is reinitialized when it is removed from
rl_registered.
- When an MR is being handled by rpcrdma_destroy_mrs, it is removed
from mr_all, but it may still be on an rl_registered list. In
that case, the MR needs to be removed from that list before being
released.
- Other cases are covered by using list_del_init in rpcrdma_mr_pop.
Fixes: 9d6b040978 ('xprtrdma: Place registered MWs on a ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
if we ever hit rpc_gssd_dummy_depopulate() dentry passed to
it has refcount equal to 1. __rpc_rmpipe() drops it and
dput() done after that hits an already freed dentry.
Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Stable bugfixes:
- xprtrdma: Fix corner cases when handling device removal # v4.12+
- xprtrdma: Fix latency regression on NUMA NFS/RDMA clients # v4.15+
Features:
- New sunrpc tracepoint for RPC pings
- Finer grained NFSv4 attribute checking
- Don't unnecessarily return NFS v4 delegations
Other bugfixes and cleanups:
- Several other small NFSoRDMA cleanups
- Improvements to the sunrpc RTT measurements
- A few sunrpc tracepoint cleanups
- Various fixes for NFS v4 lock notifications
- Various sunrpc and NFS v4 XDR encoding cleanups
- Switch to the ida_simple API
- Fix NFSv4.1 exclusive create
- Forget acl cache after setattr operation
- Don't advance the nfs_entry readdir cookie if xdr decoding fails
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlrNG1IACgkQ18tUv7Cl
QOvotw//fQoUgQ/AOJGlZo/4ws2mGJN3dfwwKM8xYOnHaxppOYubZRHwvswK8d22
+XR/Q6IVbUxI3mJluv1L0d9CJT06s3c9CO90McIJbk4CWihGP19bNIY4JiPlzrbv
4FDiyOvMBej2UXbHX5EzKj0srxyBoEVf3iUAIa6DaHi3c6EIUo6fP3d2eRNJStqd
WMyZs+nqr2W9biyClxntT7l/Sk+o+4I7M3Oo9pjjS+PiePYdaMrL5T1kPeHaJshF
GMGXkbvVdqpDRiXX84R9+2/nuSiA15eEnaR94UNvs84oLR3qob3ZhxhudqFdSPrX
RS6E7m34gY/EaQm/wbB26PZm+3jHd4Pqm5SKLbyFfoCmG6oMwBvXNRJZas1DFaHM
CMOECvfAr6kixVLkAN0MNQ2Ku/FuJ52OLP1dRLmxsblocnhEPujc6RSz6Ju/v3a0
adbpmJMA2IoSGgXMu3g1VGnjHfMj7ZmjtpigXVvlcUqQGCL7t4ngh23cpeTQeJ76
bMwSHUQu18NbmtJjBTE+PIm7mdCrpQD7ZuOPWpK62zxLYUnnv7nm75m84DrDru7d
XAmrCmdUJNrVWQs6BAtCXgO4PZ6xNGLosb0xTQXTAQYftc+DRJ9SW/VGc0Mp1L9m
0G0iz++b8cy4Pih5UCDJcCkpjCIvHLcn72zn1kbufWqG3xr2koc=
=IlWo
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"Stable bugfixes:
- xprtrdma: Fix corner cases when handling device removal # v4.12+
- xprtrdma: Fix latency regression on NUMA NFS/RDMA clients # v4.15+
Features:
- New sunrpc tracepoint for RPC pings
- Finer grained NFSv4 attribute checking
- Don't unnecessarily return NFS v4 delegations
Other bugfixes and cleanups:
- Several other small NFSoRDMA cleanups
- Improvements to the sunrpc RTT measurements
- A few sunrpc tracepoint cleanups
- Various fixes for NFS v4 lock notifications
- Various sunrpc and NFS v4 XDR encoding cleanups
- Switch to the ida_simple API
- Fix NFSv4.1 exclusive create
- Forget acl cache after setattr operation
- Don't advance the nfs_entry readdir cookie if xdr decoding fails"
* tag 'nfs-for-4.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (47 commits)
NFS: advance nfs_entry cookie only after decoding completes successfully
NFSv3/acl: forget acl cache after setattr
NFSv4.1: Fix exclusive create
NFSv4: Declare the size up to date after it was set.
nfs: Use ida_simple API
NFSv4: Fix the nfs_inode_set_delegation() arguments
NFSv4: Clean up CB_GETATTR encoding
NFSv4: Don't ask for attributes when ACCESS is protected by a delegation
NFSv4: Add a helper to encode/decode struct timespec
NFSv4: Clean up encode_attrs
NFSv4; Clean up XDR encoding of type bitmap4
NFSv4: Allow GFP_NOIO sleeps in decode_attr_owner/decode_attr_group
SUNRPC: Add a helper for encoding opaque data inline
SUNRPC: Add helpers for decoding opaque and string types
NFSv4: Ignore change attribute invalidations if we hold a delegation
NFS: More fine grained attribute tracking
NFS: Don't force unnecessary cache invalidation in nfs_update_inode()
NFS: Don't redirty the attribute cache in nfs_wcc_update_inode()
NFS: Don't force a revalidation of all attributes if change is missing
NFS: Convert NFS_INO_INVALID flags to unsigned long
...
Michal Kalderon has found some corner cases around device unload
with active NFS mounts that I didn't have the imagination to test
when xprtrdma device removal was added last year.
- The ULP device removal handler is responsible for deallocating
the PD. That wasn't clear to me initially, and my own testing
suggested it was not necessary, but that is incorrect.
- The transport destruction path can no longer assume that there
is a valid ID.
- When destroying a transport, ensure that ib_free_cq() is not
invoked on a CQ that was already released.
Reported-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Fixes: bebd031866 ("xprtrdma: Support unplugging an HCA from ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This information can help track down local misconfiguration issues
as well as network partitions and unresponsive servers.
There are several ways to send a ping, and with transport multi-
plexing, the exact rpc_xprt that is used is sometimes not known by
the upper layer. The rpc_xprt pointer passed to the trace point
call also has to be RCU-safe.
I found a spot inside the client FSM where an rpc_xprt pointer is
always available and safe to use.
Suggested-by: Bill Baker <Bill.Baker@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Introduce a low-overhead mechanism to report information about
latencies of individual RPCs. The goal is to enable user space to
filter the trace record for latency outliers, or build histograms,
etc.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: struct rpc_task carries a pointer to a struct rpc_clnt,
and in fact task->tk_client is always what is passed into trace
points that are already passing @task.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If recording xprt->stat.max_slots is moved into xprt_alloc_slot,
then xprt->num_reqs is never manipulated outside
xprt->reserve_lock. There's no longer a need for xprt->num_reqs to
be atomic.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Some RPC transports have more overhead in their send_request
callouts than others. For example, for RPC-over-RDMA:
- Marshaling an RPC often has to DMA map the RPC arguments
- Registration methods perform memory registration as part of
marshaling
To capture just server and network latencies more precisely: when
sending a Call, capture the rq_xtime timestamp _after_ the transport
header has been marshaled.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Some RPC transports have more overhead in their reply handlers
than others. For example, for RPC-over-RDMA:
- RPC completion has to wait for memory invalidation, which is
not a part of the server/network round trip
- Recently a context switch was introduced into the reply handler,
which further artificially inflates the measure of RPC RTT
To capture just server and network latencies more precisely: when
receiving a reply, compute the RTT as soon as the XID is recognized
rather than at RPC completion time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Since commit 33849792cb ("xprtrdma: Detect unreachable NFS/RDMA
servers more reliably"), the xprtrdma transport now has a ->timer
callout. But xprtrdma does not need to compute RTT data, only UDP
needs that. Move the xprt_update_rtt call into the UDP transport
implementation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Refactor: Both rpcrdma_create_req call sites have to allocate the
buffer where the transport header is built, so just move that
allocation into rpcrdma_create_req.
This buffer is a fixed size. There's no needed information available
in call_allocate that is not also available when the transport is
created.
The original purpose for allocating these buffers on demand was to
reduce the possibility that an allocation failure during transport
creation will hork the mount operation during low memory scenarios.
Some relief for this rare possibility is coming up in the next few
patches.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
With FRWR, the client transport can perform memory registration and
post a Send with just a single ib_post_send.
This reduces contention between the send_request path and the Send
Completion handlers, and reduces the overhead of registering a chunk
that has multiple segments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
RPC-over-RDMA version 1 credit accounting relies on there being a
response message for every RPC Call. This means that RPC procedures
that have no reply will disrupt credit accounting, just in the same
way as a retransmit would (since it is sent because no reply has
arrived). Deal with the "no reply" case the same way.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Create fewer MRs on average. Many workloads don't need as many as
32 MRs, and the transport can now quickly restock the MR free list.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Currently, when the MR free list is exhausted during marshaling, the
RPC/RDMA transport places the RPC task on the delayq, which forces a
wait for HZ >> 2 before the marshal and send is retried.
With this change, the transport now places such an RPC task on the
pending queue, and wakes it just as soon as more MRs have been
created. Creating more MRs typically takes less than a millisecond,
and this waking mechanism is less deadlock-prone.
Moreover, the waiting RPC task is holding the transport's write
lock, which blocks the transport from sending RPCs. Therefore faster
recovery from MR exhaustion is desirable.
This is the same mechanism that the TCP transport utilizes when
handling write buffer space exhaustion.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: The generic rq_connect_cookie is sufficient to detect RPC
Call retransmission.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: We need to check only that the value does not exceed the
range of the u8 field it's going into.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
With v4.15, on one of my NFS/RDMA clients I measured a nearly
doubling in the latency of small read and write system calls. There
was no change in server round trip time. The extra latency appears
in the whole RPC execution path.
"git bisect" settled on commit ccede75985 ("xprtrdma: Spread reply
processing over more CPUs") .
After some experimentation, I found that leaving the WQ bound and
allowing the scheduler to pick the dispatch CPU seems to eliminate
the long latencies, and it does not introduce any new regressions.
The fix is implemented by reverting only the part of
commit ccede75985 ("xprtrdma: Spread reply processing over more
CPUs") that dispatches RPC replies specifically on the CPU where the
matching RPC call was made.
Interestingly, saving the CPU number and later queuing reply
processing there was effective _only_ for a NFS READ and WRITE
request. On my NUMA client, in-kernel RPC reply processing for
asynchronous RPCs was dispatched on the same CPU where the RPC call
was made, as expected. However synchronous RPCs seem to get their
reply dispatched on some other CPU than where the call was placed,
every time.
Fixes: ccede75985 ("xprtrdma: Spread reply processing over ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
server xdr decoding (with an eye towards eliminating a data copy in the
RDMA case).
I did some refactoring of the delegation code in preparation for
eliminating some delegation self-conflicts and implementing write
delegations.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJaxi5LAAoJECebzXlCjuG+deAQAL9NHsv6bIydkE6wX305c/bR
gm73yryF1kfOuHmiLq15mrljiKyCEbRSPqzzM8k2TkywHhyvOLEjxHbhZnwDyQec
DaZUzWLKNkK64UFXEvTKyNwfGObwsGQ+QLkV7N9mF3Ps9M9/u2vMHKQypvA9hJ7z
DGN7MO7Ud7N0Viu03vp4m+p7gypoWGFj6Sh1QAkR/7TE/supcS+qqOWU4vLpYFhu
/l2gJym59FWqHajwqs0Qu9LpHfsEx5HySZbj7GczbGRMka3y/AnjgnngcriP+63B
ZcPpqSdD4Yeq1OJklU5Wicy+u54rFkA9VE1EArrC9RAEwav6iMhhhaESpRH2JFJE
SO7cgUSCb2+65XgeSBfDygn+09PcN+eRF3sxXkQpHKsozQOH+qdyr7F4/ePunwi8
Ah7pIkczRUrj7gMmlNOg97wpHbffO4YnpRESA934qf7MMHRQwDsEkl512kFAyadZ
g1DI3iByfUpBQvRWJSLasyjyWUqRZDMmyO3yi3i/08sMI3XE1IOWpNkJAooNYC1X
1FCDn1VXlTdcmC8yw6Da1L05PCVf25tSjpYQZ6r25KtrVi9iBOGV741vmyVMvDEw
OwUuVatRd+AY+YJ6iUraNH4SlnHWZ/qKDFJECWbrG/4uUHyo8FIXGAgL8RgtbS8+
fQeRjaDyhHD0XH7TJ10x
=fJI1
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.17' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Chuck Lever did a bunch of work on nfsd tracepoints, on RDMA, and on
server xdr decoding (with an eye towards eliminating a data copy in
the RDMA case).
I did some refactoring of the delegation code in preparation for
eliminating some delegation self-conflicts and implementing write
delegations"
* tag 'nfsd-4.17' of git://linux-nfs.org/~bfields/linux: (40 commits)
nfsd: fix incorrect umasks
sunrpc: remove incorrect HMAC request initialization
NFSD: Clean up legacy NFS SYMLINK argument XDR decoders
NFSD: Clean up legacy NFS WRITE argument XDR decoders
nfsd: Trace NFSv4 COMPOUND execution
nfsd: Add I/O trace points in the NFSv4 read proc
nfsd: Add I/O trace points in the NFSv4 write path
nfsd: Add "nfsd_" to trace point names
nfsd: Record request byte count, not count of vectors
nfsd: Fix NFSD trace points
svc: Report xprt dequeue latency
sunrpc: Report per-RPC execution stats
sunrpc: Re-purpose trace_svc_process
sunrpc: Save remote presentation address in svc_xprt for trace events
sunrpc: Simplify trace_svc_recv
sunrpc: Simplify do_enqueue tracing
sunrpc: Move trace_svc_xprt_dequeue()
sunrpc: Update show_svc_xprt_flags() to include recently added flags
svc: Simplify ->xpo_secure_port
sunrpc: Remove unneeded pointer dereference
...
make_checksum_hmac_md5() is allocating an HMAC transform and doing
crypto API calls in the following order:
crypto_ahash_init()
crypto_ahash_setkey()
crypto_ahash_digest()
This is wrong because it makes no sense to init() the request before a
key has been set, given that the initial state depends on the key. And
digest() is short for init() + update() + final(), so in this case
there's no need to explicitly call init() at all.
Before commit 9fa68f6200 ("crypto: hash - prevent using keyed hashes
without setting key") the extra init() had no real effect, at least for
the software HMAC implementation. (There are also hardware drivers that
implement HMAC-MD5, and it's not immediately obvious how gracefully they
handle init() before setkey().) But now the crypto API detects this
incorrect initialization and returns -ENOKEY. This is breaking NFS
mounts in some cases.
Fix it by removing the incorrect call to crypto_ahash_init().
Reported-by: Michael Young <m.a.young@durham.ac.uk>
Fixes: 9fa68f6200 ("crypto: hash - prevent using keyed hashes without setting key")
Fixes: fffdaef2eb ("gss_krb5: Add support for rc4-hmac encryption")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Move common code in NFSD's legacy SYMLINK decoders into a helper.
The immediate benefits include:
- one fewer data copies on transports that support DDP
- consistent error checking across all versions
- reduction of code duplication
- support for both legal forms of SYMLINK requests on RDMA
transports for all versions of NFS (in particular, NFSv2, for
completeness)
In the long term, this helper is an appropriate spot to perform a
per-transport call-out to fill the pathname argument using, say,
RDMA Reads.
Filling the pathname in the proc function also means that eventually
the incoming filehandle can be interpreted so that filesystem-
specific memory can be allocated as a sink for the pathname
argument, rather than using anonymous pages.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Move common code in NFSD's legacy NFS WRITE decoders into a helper.
The immediate benefit is reduction of code duplication and some nice
micro-optimizations (see below).
In the long term, this helper can perform a per-transport call-out
to fill the rq_vec (say, using RDMA Reads).
The legacy WRITE decoders and procs are changed to work like NFSv4,
which constructs the rq_vec just before it is about to call
vfs_writev.
Why? Calling a transport call-out from the proc instead of the XDR
decoder means that the incoming FH can be resolved to a particular
filesystem and file. This would allow pages from the backing file to
be presented to the transport to be filled, rather than presenting
anonymous pages and copying or flipping them into the file's page
cache later.
I also prefer using the pages in rq_arg.pages, instead of pulling
the data pages directly out of the rqstp::rq_pages array. This is
currently the way the NFSv3 write decoder works, but the other two
do not seem to take this approach. Fixing this removes the only
reference to rq_pages found in NFSD, eliminating an NFSD assumption
about how transports use the pages in rq_pages.
Lastly, avoid setting up the first element of rq_vec as a zero-
length buffer. This happens with an RDMA transport when a normal
Read chunk is present because the data payload is in rq_arg's
page list (none of it is in the head buffer).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Record the time between when a rqstp is enqueued on a transport
and when it is dequeued. This includes how long the rqstp waits on
the queue and how long it takes the kernel scheduler to wake a
nfsd thread to service it.
The svc_xprt_dequeue trace point is altered to include the number
of microseconds between xprt_enqueue and xprt_dequeue.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Introduce a mechanism to report the server-side execution latency of
each RPC. The goal is to enable user space to filter the trace
record for latency outliers, build histograms, etc.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently, trace_svc_process has two call sites:
1. Just after a call to svc_send. svc_send already invokes
trace_svc_send with the same arguments just before returning
2. Just before a call to svc_drop. svc_drop already invokes
trace_svc_drop with the same arguments just after it is called
Therefore trace_svc_process does not provide any additional
information not already provided by these other trace points.
However, it would be useful to record the incoming RPC procedure.
So reuse trace_svc_process for this purpose.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
TP_printk defines a format string that is passed to user space for
converting raw trace event records to something human-readable.
My user space's printf (Oracle Linux 7), however, does not have a
%pI format specifier. The result is that what is supposed to be an
IP address in the output of "trace-cmd report" is just a string that
says the field couldn't be displayed.
To fix this, adopt the same approach as the client: maintain a pre-
formated presentation address for occasions when %pI is not
available.
The location of the trace_svc_send trace point is adjusted so that
rqst->rq_xprt is not NULL when the trace event is recorded.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There doesn't seem to be a lot of value in calling trace_svc_recv
in the failing case.
1. There are two very common cases: one is the transport is not
ready, and the other is shutdown. Neither is terribly interesting.
2. The trace record for the failing case contains nothing but
the status code.
Therefore the trace point call site in the error exit is removed.
Since the trace point is now recording a length instead of a
status, rename the status field and remove the case that records a
zero XID.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
There are three cases where svc_xprt_do_enqueue() returns without
waking an nfsd thread:
1. There is no work to do
2. The transport is already busy
3. There are no available nfsd threads
Only 3. is truly interesting. Move the trace point so it records
that there was work to do and either an nfsd thread was awoken, or
a free one could not found.
As an additional clean up, remove a redundant comment and a couple
of dprintk call sites.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Reduce the amount of noise generated by trace_svc_xprt_dequeue by
moving it to the end of svc_get_next_xprt. This generates exactly
one trace event when a ready xprt is found, rather than spurious
events when there is no work to do. The empty events contain no
information that can't be obtained simply by tracing function calls
to svc_xprt_dequeue.
A small additional benefit is simplification of the svc_xprt_event
trace class, which no longer has to handle the case when the @xprt
parameter is NULL.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: Instead of returning a value that is used to set or clear
a bit, just make ->xpo_secure_port mangle that bit, and return void.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: Noticed during code inspection that there is already a
local automatic variable "xprt" so dereferencing rqst->rq_xprt
again is unnecessary.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These pernet_operations look similar to rpcsec_gss_net_ops,
they just create and destroy another caches. So, they also
can be async.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These pernet_operations initialize and destroy sunrpc_net_id
refered per-net items. Only used global list is cache_list,
and accesses already serialized.
sunrpc_destroy_cache_detail() check for list_empty() without
cache_list_lock, but when it's called from unregister_pernet_subsys(),
there can't be callers in parallel, so we won't miss list_empty()
in this case.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prefer the direct use of octal for permissions.
Done with checkpatch -f --types=SYMBOLIC_PERMS --fix-inplace
and some typing.
Miscellanea:
o Whitespace neatening around these conversions.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clean up: The value of the byte_count parameter is already passed
to rdma_build_arg_xdr as part of the svc_rdma_op_ctxt structure.
Further, without the parameter called "byte_count" there is no need
to have the abbreviated "bc" automatic variable. "bc" can now be
called something more intuitive.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The target needs to return the lesser of the client's Inbound RDMA
Read Queue Depth (IRD), provided in the connection parameters, and
the local device's Outbound RDMA Read Queue Depth (ORD). The latter
limit is max_qp_init_rd_atom, not max_qp_rd_atom.
The svcrdma_ord value caps the ORD value for iWARP transports, which
do not exchange ORD/IRD values at connection time. Since no other
Linux kernel RDMA-enabled storage target sees fit to provide this
cap, I'm removing it here too.
initiator_depth is a u8, so ensure the computed ORD value does not
overflow that field.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: Other completion handlers use pr_err, not pr_warn.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The interface for flushing the sunrpc auth cache was poorly
designed and has caused problems a number of times.
The design is that you write a timestamp, and all entries
created before that time are discarded.
The most obvious problem is that this is not what people
actually want. They want to just flush the whole cache.
The 1-second granularity can be a problem, as can the use
of wall-clock time.
A current problem is that code will write the current time to
this file - expecting it to clear everything - and if the
seconds number ticks over before this timestamp is checked,
the test "then >= now" fails, and a full flush isn't forced.
So lets just drop the subtleties and always flush the whole
cache. The worst this could do is impose an extra cost
refilling it, but that would require someone to be using
non-standard tools.
We still report an error if the string written is not a number,
but we cause any valid number to flush the whole cache.
Reported-by: "Wang, Alan 1. (NSB - CN/Hangzhou)" <alan.1.wang@nokia-sbell.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Fix unaligned access in gss_{get,verify}_mic_v2() on sparc64
Signed-off-by: James Ettle <james@ettle.org.uk>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Changes since v1:
Added changes in these files:
drivers/infiniband/hw/usnic/usnic_transport.c
drivers/staging/lustre/lnet/lnet/lib-socket.c
drivers/target/iscsi/iscsi_target_login.c
drivers/vhost/net.c
fs/dlm/lowcomms.c
fs/ocfs2/cluster/tcp.c
security/tomoyo/network.c
Before:
All these functions either return a negative error indicator,
or store length of sockaddr into "int *socklen" parameter
and return zero on success.
"int *socklen" parameter is awkward. For example, if caller does not
care, it still needs to provide on-stack storage for the value
it does not need.
None of the many FOO_getname() functions of various protocols
ever used old value of *socklen. They always just overwrite it.
This change drops this parameter, and makes all these functions, on success,
return length of sockaddr. It's always >= 0 and can be differentiated
from an error.
Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.
rpc_sockname() lost "int buflen" parameter, since its only use was
to be passed to kernel_getsockname() as &buflen and subsequently
not used in any way.
Userspace API is not changed.
text data bss dec hex filename
30108430 2633624 873672 33615726 200ef6e vmlinux.before.o
30108109 2633612 873672 33615393 200ee21 vmlinux.o
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: David S. Miller <davem@davemloft.net>
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
CC: linux-bluetooth@vger.kernel.org
CC: linux-decnet-user@lists.sourceforge.net
CC: linux-wireless@vger.kernel.org
CC: linux-rdma@vger.kernel.org
CC: linux-sctp@vger.kernel.org
CC: linux-nfs@vger.kernel.org
CC: linux-x25@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:
for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
done
with de-mangling cleanups yet to come.
NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do. But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.
The next patch from Al will sort out the final differences, and we
should be all done.
Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hightlights include:
Stable fixes:
- Fix an incorrect calculation of the RDMA send scatter gather element limit
- Fix an Oops when attempting to free resources after RDMA device removal
Bugfixes:
- SUNRPC: Ensure we always release the TCP socket in a timely fashion when
the connection is shut down.
- SUNRPC: Don't call __UDPX_INC_STATS() from a preemptible context
Latency/Performance:
- SUNRPC: Queue latency sensitive socket tasks to the less contended
xprtiod queue
- SUNRPC: Make the xprtiod workqueue unbounded.
- SUNRPC: Make the rpciod workqueue unbounded
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJafdI4AAoJEGcL54qWCgDyMQ0P/3sHdDzjQ+WKIWLhJX6Ol9kx
yc74z92Shh56wNvSqssaBQDYSN/2mjtgvmzEDs5KFqsBSRu5aF2JRUAT+QmQ03MT
tD+Q7FzDJUh/nvSjwGFCI2ZIJ4Vk0cPQJLc47+pT7QQN7eAEm+6zldWZ4cEUDLUQ
+9LWpefWlCFVv2WivTg/9KNl7HR5zHnZQIIDqAiJ33zYnT72J1lMgp50GEwaYNd6
7IpPFsGq+sa7nVH3R32SbLYBzdHZxtMStJJE7egN+Evyr1k6tLSnJq3ke8GkQAHI
WyV9lh8kTGEpea/QL0v0WkWeE3sHFg2RgEdxKrnhmUMOGlaOL1QGhLs9ELT1ny3w
/H6E18WXMMmIat/XoYW0FNn0neSI2ilk4XC88l8+uMz/x6ZLsOG7Xpm2plFL6Tu7
UNb0436dL8ZCOd9ipjs5otVMRVGbf3AA8P8PqwJLXGEErzPGnscw/sVNiM5n7EEs
YyQIlCshaQZv1kXkQrgsawa8cf9gDG0xa1JkZ3m6vYBEtKLPxqLlYBCK2Az6P8BB
rTtswPfDDF1uQPRLz7Bs8AHr2zUiyg0LbYldL2YvpE023Cdk5Y2NMs1d6FzZeVjS
lu1p2oIbDgcWV9V4UV3Ml7NYZcns3oHy2sYNdgz1t5/ZSmp6agEvKoa7rWOhR4Lp
fGt595Jh2sCEUELPE+gK
=KST7
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull more NFS client updates from Trond Myklebust:
"A few bugfixes and some small sunrpc latency/performance improvements
before the merge window closes:
Stable fixes:
- fix an incorrect calculation of the RDMA send scatter gather
element limit
- fix an Oops when attempting to free resources after RDMA device
removal
Bugfixes:
- SUNRPC: Ensure we always release the TCP socket in a timely fashion
when the connection is shut down.
- SUNRPC: Don't call __UDPX_INC_STATS() from a preemptible context
Latency/Performance:
- SUNRPC: Queue latency sensitive socket tasks to the less contended
xprtiod queue
- SUNRPC: Make the xprtiod workqueue unbounded.
- SUNRPC: Make the rpciod workqueue unbounded"
* tag 'nfs-for-4.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
SUNRPC: Don't call __UDPX_INC_STATS() from a preemptible context
fix parallelism for rpc tasks
Make the xprtiod workqueue unbounded.
SUNRPC: Queue latency-sensitive socket tasks to xprtiod
SUNRPC: Ensure we always close the socket after a connection shuts down
xprtrdma: Fix BUG after a device removal
xprtrdma: Fix calculation of ri_max_send_sges
Calling __UDPX_INC_STATS() from a preemptible context leads to a
warning of the form:
BUG: using __this_cpu_add() in preemptible [00000000] code: kworker/u5:0/31
caller is xs_udp_data_receive_workfn+0x194/0x270
CPU: 1 PID: 31 Comm: kworker/u5:0 Not tainted 4.15.0-rc8-00076-g90ea9f1 #2
Workqueue: xprtiod xs_udp_data_receive_workfn
Call Trace:
dump_stack+0x85/0xc1
check_preemption_disabled+0xce/0xe0
xs_udp_data_receive_workfn+0x194/0x270
process_one_work+0x318/0x620
worker_thread+0x20a/0x390
? process_one_work+0x620/0x620
kthread+0x120/0x130
? __kthread_bind_mask+0x60/0x60
ret_from_fork+0x24/0x30
Since we're taking a spinlock in those functions anyway, let's fix the
issue by moving the call so that it occurs under the spinlock.
Reported-by: kernel test robot <fengguang.wu@intel.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
But it's also a fairly small update this time around. Some cleanup,
RDMA fixes, overlayfs fixes, and a fix for an NFSv4 state bug.
The bigger deal for nfsd this time around is Jeff Layton's
already-merged i_version patches. This series has a minor conflict with
that one, and the resolution should be obvious. (Stephen Rothwell has
been carrying it in linux-next for what it's worth.)
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJafNVvAAoJECebzXlCjuG+yZUP/2SctFtkW638z9frLcIVt5M6
x5hluw5jtFrVqq/KoMwi7rVaMzhdvcgwwfaLciqrPCOmcMKlOqiWslyCV0wZVCZS
jabkOeinKVAyPTlESesNyArWKBWaB8QaYDwbkQ5Y76U9Ma5gwSghS1wc8vrNduZY
2StieESOiOs9LljXf5SqCC5nN9s7gs4qtCK7aZ3JIt4661Lh39LqyO5zxLnc78eL
USnJKHjTSreY2Vd1/TdNWyZhiim43wdrB+jpy6IoocTqyhYalkCz1iYdJn1arqtP
iIddPpczKxkHekFVj7/Kfa+ATFtdXIpivOBhhOT0oY8HukTd58bh/oUMrFt4BSuP
MQst0R9h1sanBE18XBPlXuIK51sm3AjjOGaQycl/Mzes+dMRgIP/KspAcnwwXHqG
gyZsF3VzliFTc9s0SyiAz2AxNTUnjd+LV3E0DUeivURa6V3pc+sFlQzi8PRxRaep
0gmhYcZsfwdDKZ/kbQyQdSWN48NxOLFke4fYjmoUtoyILa0NAHEqafeJkR5EiRTm
tZsL9H/3THEGWygYlXGGBo/J4w5jE3uL/8KkfeuZefzSo0Ujqu0pBALMTnGFLKRx
Mpw7JEqfUwqIVZ0Qh6q9yIcjr89qWv96UpBqRRIkFX5zOPN7B1BH8C89g8qy3Hyt
gm/5BTw4FPE0uAM9Nhsd
=icEX
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.16' of git://linux-nfs.org/~bfields/linux
Pull nfsd update from Bruce Fields:
"A fairly small update this time around. Some cleanup, RDMA fixes,
overlayfs fixes, and a fix for an NFSv4 state bug.
The bigger deal for nfsd this time around was Jeff Layton's
already-merged i_version patches"
* tag 'nfsd-4.16' of git://linux-nfs.org/~bfields/linux:
svcrdma: Fix Read chunk round-up
NFSD: hide unused svcxdr_dupstr()
nfsd: store stat times in fill_pre_wcc() instead of inode times
nfsd: encode stat->mtime for getattr instead of inode->i_mtime
nfsd: return RESOURCE not GARBAGE_ARGS on too many ops
nfsd4: don't set lock stateid's sc_type to CLOSED
nfsd: Detect unhashed stids in nfsd4_verify_open_stid()
sunrpc: remove dead code in svc_sock_setbufsize
svcrdma: Post Receives in the Receive completion handler
nfsd4: permit layoutget of executable-only files
lockd: convert nlm_rqst.a_count from atomic_t to refcount_t
lockd: convert nlm_lockowner.count from atomic_t to refcount_t
lockd: convert nsm_handle.sm_count from atomic_t to refcount_t
Hi folks,
On a multi-core machine, is it expected that we can have parallel RPCs
handled by each of the per-core workqueue?
In testing a read workload, observing via "top" command that a single
"kworker" thread is running servicing the requests (no parallelism).
It's more prominent while doing these operations over krb5p mount.
What has been suggested by Bruce is to try this and in my testing I
see then the read workload spread among all the kworker threads.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
A single NFSv4 WRITE compound can often have three operations:
PUTFH, WRITE, then GETATTR.
When the WRITE payload is sent in a Read chunk, the client places
the GETATTR in the inline part of the RPC/RDMA message, just after
the WRITE operation (sans payload). The position value in the Read
chunk enables the receiver to insert the Read chunk at the correct
place in the received XDR stream; that is between the WRITE and
GETATTR.
According to RFC 8166, an NFS/RDMA client does not have to add XDR
round-up to the Read chunk that carries the WRITE payload. The
receiver adds XDR round-up padding if it is absent and the
receiver's XDR decoder requires it to be present.
Commit 193bcb7b37 ("svcrdma: Populate tail iovec when receiving")
attempted to add support for receiving such a compound so that just
the WRITE payload appears in rq_arg's page list, and the trailing
GETATTR is placed in rq_arg's tail iovec. (TCP just strings the
whole compound into the head iovec and page list, without regard
to the alignment of the WRITE payload).
The server transport logic also had to accommodate the optional XDR
round-up of the Read chunk, which it did simply by lengthening the
tail iovec when round-up was needed. This approach is adequate for
the NFSv2 and NFSv3 WRITE decoders.
Unfortunately it is not sufficient for nfsd4_decode_write. When the
Read chunk length is a couple of bytes less than PAGE_SIZE, the
computation at the end of nfsd4_decode_write allows argp->pagelen to
go negative, which breaks the logic in read_buf that looks for the
tail iovec.
The result is that a WRITE operation whose payload length is just
less than a multiple of a page succeeds, but the subsequent GETATTR
in the same compound fails with NFS4ERR_OP_ILLEGAL because the XDR
decoder can't find it. Clients ignore the error, but they must
update their attribute cache via a separate round trip.
As nfsd4_decode_write appears to expect the payload itself to always
have appropriate XDR round-up, have svc_rdma_build_normal_read_chunk
add the Read chunk XDR round-up to the page_len rather than
lengthening the tail iovec.
Reported-by: Olga Kornievskaia <kolga@netapp.com>
Fixes: 193bcb7b37 ("svcrdma: Populate tail iovec when receiving")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The response to a write_space notification is very latency sensitive,
so we should queue it to the lower latency xprtiod_workqueue. This
is something we already do for the other cases where an rpc task
holds the transport XPRT_LOCKED bitlock.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Stable fixes:
- Fix calculating ri_max_send_sges, which can oops if max_sge is too small
- Fix a BUG after device removal if freed resources haven't been allocated yet
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlp6D5MACgkQ18tUv7Cl
QOt2IhAA3lY89PxSWhQRpQbxCAvnCATL2IjjT8z3v8eLJrNrkJ0X7mOUIP47dQHb
cvF/XNfemQssAqd9lWuKCRzx9vFbBrUmavzGyYXQBBULGZr6d9ARapIJU0pUJpv3
Mf/tGonj8FJSLT84dbbYpuU5xRUCS/WM53N0mYPUGOfWoOgL2vQQOHqgUDPLRjyx
T+2IHzCuyU0fp/QpSRCyg9OvScY9BOp0P0MER1UTH5+PGI5ApzBSd19Go3/gupj2
H8kuHsI800EWJbUynDFwDwsh+YDwDIs+WVkfEkD27LHhqT7L0W7RusV9xSzguoaJ
CK3g8L4GsqIKASQQCgqRJ1hcUbaHTHgqOvejIH8CBlH4YWMXanzT2aRh1AdDwERY
07n1zsJSyFZ4YMDRg786snWK/b3udeGv/eKS5/jEMxhNNNaTjPyceQHt3ZTNw/KO
/TYSvxYJWvaw/e4UoqcL7wb2rrr78uGTdLRDUcbtNllYk6hla4bUQkDp4D/rdoWk
vQoNpEvMty7r2+GosKgfZru/bOfZpAi8AXqyGPZPjz4LC/woYaro3HhvA4BLRRfO
NDsCSdejLO5iIssOvKfz99nxqD9VGkYF3rjxUlZ4ksQw+kdJ3IU+As5NLCdWd2xW
jCELPCXE2zlps6RjowJSykKqmtD0oGzdP3vVCiU4IbmVYF4LLpw=
=eXee
-----END PGP SIGNATURE-----
Merge tag 'nfs-rdma-for-4.16-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
NFS-over-RDMA client fixes for Linux 4.16 #2
Stable fixes:
- Fix calculating ri_max_send_sges, which can oops if max_sge is too small
- Fix a BUG after device removal if freed resources haven't been allocated yet
Ensure that we release the TCP socket once it is in the TCP_CLOSE or
TCP_TIME_WAIT state (and only then) so that we don't confuse rkhunter
and its ilk.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Setting values in struct sock directly is the usual method. Remove
the long dead code using set_fs() and the related comment.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Michal Kalderon reports a BUG that occurs just after device removal:
[ 169.112490] rpcrdma: removing device qedr0 for 192.168.110.146:20049
[ 169.143909] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 169.181837] IP: rpcrdma_dma_unmap_regbuf+0xa/0x60 [rpcrdma]
The RPC/RDMA client transport attempts to allocate some resources
on demand. Registered buffers are one such resource. These are
allocated (or re-allocated) by xprt_rdma_allocate to hold RPC Call
and Reply messages. A hardware resource is associated with each of
these buffers, as they can be used for a Send or Receive Work
Request.
If a device is removed from under an NFS/RDMA mount, the transport
layer is responsible for releasing all hardware resources before
the device can be finally unplugged. A BUG results when the NFS
mount hasn't yet seen much activity: the transport tries to release
resources that haven't yet been allocated.
rpcrdma_free_regbuf() already checks for this case, so just move
that check to cover the DEVICE_REMOVAL case as well.
Reported-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Fixes: bebd031866 ("xprtrdma: Support unplugging an HCA ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Commit 16f906d66c ("xprtrdma: Reduce required number of send
SGEs") introduced the rpcrdma_ia::ri_max_send_sges field. This fixes
a problem where xprtrdma would not work if the device's max_sge
capability was small (low single digits).
At least RPCRDMA_MIN_SEND_SGES are needed for the inline parts of
each RPC. ri_max_send_sges is set to this value:
ia->ri_max_send_sges = max_sge - RPCRDMA_MIN_SEND_SGES;
Then when marshaling each RPC, rpcrdma_args_inline uses that value
to determine whether the device has enough Send SGEs to convey an
NFS WRITE payload inline, or whether instead a Read chunk is
required.
More recently, commit ae72950abf ("xprtrdma: Add data structure to
manage RDMA Send arguments") used the ri_max_send_sges value to
calculate the size of an array, but that commit erroneously assumed
ri_max_send_sges contains a value similar to the device's max_sge,
and not one that was reduced by the minimum SGE count.
This assumption results in the calculated size of the sendctx's
Send SGE array to be too small. When the array is used to marshal
an RPC, the code can write Send SGEs into the following sendctx
element in that array, corrupting it. When the device's max_sge is
large, this issue is entirely harmless; but it results in an oops
in the provider's post_send method, if dev.attrs.max_sge is small.
So let's straighten this out: ri_max_send_sges will now contain a
value with the same meaning as dev.attrs.max_sge, which makes
the code easier to understand, and enables rpcrdma_sendctx_create
to calculate the size of the SGE array correctly.
Reported-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Fixes: 16f906d66c ("xprtrdma: Reduce required number of send SGEs")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Highlights include:
Stable bugfixes:
- Fix breakages in the nfsstat utility due to the inclusion of the NFSv4
LOOKUPP operation.
- Fix a NULL pointer dereference in nfs_idmap_prepare_pipe_upcall() due to
nfs_idmap_legacy_upcall() being called without an 'aux' parameter.
- Fix a refcount leak in the standard O_DIRECT error path.
- Fix a refcount leak in the pNFS O_DIRECT fallback to MDS path.
- Fix CPU latency issues with nfs_commit_release_pages()
- Fix the LAYOUTUNAVAILABLE error case in the file layout type.
- NFS: Fix a race between mmap() and O_DIRECT
Features:
- Support the statx() mask and query flags to enable optimisations when
the user is requesting only attributes that are already up to date in
the inode cache, or is specifying the AT_STATX_DONT_SYNC flag.
- Add a module alias for the SCSI pNFS layout type.
Bugfixes:
- Automounting when resolving a NFSv4 referral should preserve the RDMA
transport protocol settings.
- Various other RDMA bugfixes from Chuck.
- pNFS block layout fixes.
- Always set NFS_LOCK_LOST when a lock is lost.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJacHcyAAoJEGcL54qWCgDy5WcP/Aw7GIAVZ+n5B+EIFVvEaWlC
C++eA8vej433ezKRj8IExeCX1C8OKrQZ4iQ3nqIg0mVVQS/Qk+469OhYP9jDagA+
tDZeOs3lbl1EUhUWT+GNxw8bZqKGn9fYzcWFjiYeFTjrcfLYfZTW50V0tofjgmlR
3nQmpPx/56rXE9ZO/EW66HRZWauw7a0hg3/5Ft+F5csqIb3yQOlW8Osp3WClzGBF
to4lS8/IwHvCn3qWAMuivRaMJDxeKrmoJNQh9Kw1Mw3+vurGAjmKo1a153qKPz4N
7wjeP+o3ujc/P7WsJLCIgQRimzSm9FZXMqEVmz07+cIhGbERt2yy0RbHev8bpa+U
3IMj70K9ciPuMZwrAtRAeZL+o9gxlUGUXvTaDUgo4DFgBw9Q5CnMnFn6a725l4h0
nSZsE+bR8d4l/yEjf77SbTrk7atMLfUG1XnKH20i1CUjtd4CaLLzjn81TlbQrfuI
XaFdJUUt63dPTIbhPEk7wHFcITkGZiyXhcepgbaXLiDH/3gyZmqTYzJ2EH14sOC5
NaTueE3ASTiFChvG7jvc89HJN5SN5W11PyzI+GHezx9VkFnPZM3/q2V7oEX2aSld
tkRlDMO4dVmpAA4LVAAarDr0a8ZsJQOdb+yLn21pbmKNAs1vol7tMfJe57ykEV6v
WNAgtKJLtZE0Lh1UyEq0
=5Vv2
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable bugfixes:
- Fix breakages in the nfsstat utility due to the inclusion of the
NFSv4 LOOKUPP operation
- Fix a NULL pointer dereference in nfs_idmap_prepare_pipe_upcall()
due to nfs_idmap_legacy_upcall() being called without an 'aux'
parameter
- Fix a refcount leak in the standard O_DIRECT error path
- Fix a refcount leak in the pNFS O_DIRECT fallback to MDS path
- Fix CPU latency issues with nfs_commit_release_pages()
- Fix the LAYOUTUNAVAILABLE error case in the file layout type
- NFS: Fix a race between mmap() and O_DIRECT
Features:
- Support the statx() mask and query flags to enable optimisations
when the user is requesting only attributes that are already up to
date in the inode cache, or is specifying the AT_STATX_DONT_SYNC
flag
- Add a module alias for the SCSI pNFS layout type
Bugfixes:
- Automounting when resolving a NFSv4 referral should preserve the
RDMA transport protocol settings
- Various other RDMA bugfixes from Chuck
- pNFS block layout fixes
- Always set NFS_LOCK_LOST when a lock is lost"
* tag 'nfs-for-4.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (69 commits)
NFS: Fix a race between mmap() and O_DIRECT
NFS: Remove a redundant call to unmap_mapping_range()
pnfs/blocklayout: Ensure disk address in block device map
pnfs/blocklayout: pnfs_block_dev_map uses bytes, not sectors
lockd: Fix server refcounting
SUNRPC: Fix null rpc_clnt dereference in rpc_task_queued tracepoint
SUNRPC: Micro-optimize __rpc_execute
SUNRPC: task_run_action should display tk_callback
sunrpc: Format RPC events consistently for display
SUNRPC: Trace xprt_timer events
xprtrdma: Correct some documenting comments
xprtrdma: Fix "bytes registered" accounting
xprtrdma: Instrument allocation/release of rpcrdma_req/rep objects
xprtrdma: Add trace points to instrument QP and CQ access upcalls
xprtrdma: Add trace points in the client-side backchannel code paths
xprtrdma: Add trace points for connect events
xprtrdma: Add trace points to instrument MR allocation and recovery
xprtrdma: Add trace points to instrument memory invalidation
xprtrdma: Add trace points in reply decoder path
xprtrdma: Add trace points to instrument memory registration
..
Pull kern_recvmsg reduction from Al Viro:
"kernel_recvmsg() is a set_fs()-using wrapper for sock_recvmsg(). In
all but one case that is not needed - use of ITER_KVEC for ->msg_iter
takes care of the data and does not care about set_fs(). The only
exception is svc_udp_recvfrom() where we want cmsg to be store into
kernel object; everything else can just use sock_recvmsg() and be done
with that.
A followup converting svc_udp_recvfrom() away from set_fs() (and
killing kernel_recvmsg() off) is *NOT* in here - I'd like to hear what
netdev folks think of the approach proposed in that followup)"
* 'work.sock_recvmsg' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
tipc: switch to sock_recvmsg()
smc: switch to sock_recvmsg()
ipvs: switch to sock_recvmsg()
mISDN: switch to sock_recvmsg()
drbd: switch to sock_recvmsg()
lustre lnet_sock_read(): switch to sock_recvmsg()
cfs2: switch to sock_recvmsg()
ncpfs: switch to sock_recvmsg()
dlm: switch to sock_recvmsg()
svc_recvfrom(): switch to sock_recvmsg()
Pull poll annotations from Al Viro:
"This introduces a __bitwise type for POLL### bitmap, and propagates
the annotations through the tree. Most of that stuff is as simple as
'make ->poll() instances return __poll_t and do the same to local
variables used to hold the future return value'.
Some of the obvious brainos found in process are fixed (e.g. POLLIN
misspelled as POLL_IN). At that point the amount of sparse warnings is
low and most of them are for genuine bugs - e.g. ->poll() instance
deciding to return -EINVAL instead of a bitmap. I hadn't touched those
in this series - it's large enough as it is.
Another problem it has caught was eventpoll() ABI mess; select.c and
eventpoll.c assumed that corresponding POLL### and EPOLL### were
equal. That's true for some, but not all of them - EPOLL### are
arch-independent, but POLL### are not.
The last commit in this series separates userland POLL### values from
the (now arch-independent) kernel-side ones, converting between them
in the few places where they are copied to/from userland. AFAICS, this
is the least disruptive fix preserving poll(2) ABI and making epoll()
work on all architectures.
As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
it will trigger only on what would've triggered EPOLLWRBAND on other
architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
at all on sparc. With this patch they should work consistently on all
architectures"
* 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
make kernel-side POLL... arch-independent
eventpoll: no need to mask the result of epi_item_poll() again
eventpoll: constify struct epoll_event pointers
debugging printk in sg_poll() uses %x to print POLL... bitmap
annotate poll(2) guts
9p: untangle ->poll() mess
->si_band gets POLL... bitmap stored into a user-visible long field
ring_buffer_poll_wait() return value used as return value of ->poll()
the rest of drivers/*: annotate ->poll() instances
media: annotate ->poll() instances
fs: annotate ->poll() instances
ipc, kernel, mm: annotate ->poll() instances
net: annotate ->poll() instances
apparmor: annotate ->poll() instances
tomoyo: annotate ->poll() instances
sound: annotate ->poll() instances
acpi: annotate ->poll() instances
crypto: annotate ->poll() instances
block: annotate ->poll() instances
x86: annotate ->poll() instances
...
New features:
- xprtrdma tracepoints
Bugfixes and cleanups:
- Fix memory leak if rpcrdma_buffer_create() fails
- Fix allocating extra rpcrdma_reps for the backchannel
- Remove various unused and redundant variables and lock cycles
- Fix IPv6 support in xprt_rdma_set_port()
- Fix memory leak by calling buf_free for callback replies
- Fix "bytes registered" accounting
- Fix kernel-doc comments
- SUNRPC tracepoint cleanups for consistent information
- Optimizations for __rpc_execute()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlpniEEACgkQ18tUv7Cl
QOsoGQ//ctg5iKTLlnjobnPdtS4zCQCgD8upx/FLawo/LCkDRiRD+lPNbSWy25db
Z/YcYslUuwli3J9PoF36DxBAoep+N5LvD2IIwifr72tbwaRR+gK/xRdxdbiFXYod
9HK+qBN5qRcgKpxPoHHGj9CoF/6ATcqyLhy7bP30UjNhydmlkaYEF1KXeO/l4NQu
CQUyumdy0M9T8QW4EkXY2vkjS9leC4byI8P4iMaJnq157sYYVxLMryDGayK4ZJNH
kmeAZoDhvNm+HKyD7vX90+w8/LPQutJ/A7m309KyMHiZZ23KxDW7r9ZclzcOswd1
NY8HjyUHjl6gvFN9DmtOz95eKmmh1SpCQstw6lKTtNBVzP3+Dw9Lp2tLQFwuCe9y
uzD/GYcFJECyBoHU4xsm9kV0PuzwAfvXahnylLjf7yV4MlDpjOZQT/vMvA96XHo6
oOpT3+zikwCHz2FZM0cj5fQ0uQrxsx2zxAcmr8v5StC17Ng94LQH5ByAiOleJ14p
Az+mAYATNgmSDD07+7UrEH7LD1b2h2xmdbaOa7UXeK2jJaYewGlUEKhKXIhjKJy9
t2yecGKj4Y7omkNf/0YHle2t6WhNyFqjpISiF1KwuGHuqU5krPCpgKBOnuABcJTx
D1BuSw4YQ5ZDfb2w0HAbPuwwRnU5FY1K79MIm7VkdE9ozWfmShA=
=67UR
-----END PGP SIGNATURE-----
Merge tag 'nfs-rdma-for-4.16-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
NFS-over-RDMA client updates for Linux 4.16
New features:
- xprtrdma tracepoints
Bugfixes and cleanups:
- Fix memory leak if rpcrdma_buffer_create() fails
- Fix allocating extra rpcrdma_reps for the backchannel
- Remove various unused and redundant variables and lock cycles
- Fix IPv6 support in xprt_rdma_set_port()
- Fix memory leak by calling buf_free for callback replies
- Fix "bytes registered" accounting
- Fix kernel-doc comments
- SUNRPC tracepoint cleanups for consistent information
- Optimizations for __rpc_execute()
The common case: There are 13 to 14 actions per RPC, and tk_callback
is non-NULL in only one of them. There's no need to store a NULL in
the tk_callback field during each FSM step.
This slightly improves throughput results in dbench and other multi-
threaded benchmarks on my two-socket client on 56Gb InfiniBand, but
will probably be inconsequential on slower systems.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This shows up in every RPC:
kworker/4:1-19772 [004] 3467.373443: rpc_task_run_action: task:4711@2 flags=0e81 state=0005 status=0 action=call_status
kworker/4:1-19772 [004] 3467.373444: rpc_task_run_action: task:4711@2 flags=0e81 state=0005 status=0 action=call_status
What's actually going on is that the first iteration of the RPC
scheduler is invoking the function in tk_callback (in this case,
xprt_timer), then invoking call_status on the next iteration.
Feeding do_action, rather than tk_action, to the "task_run_action"
trace point will now always display the correct FSM step.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Track RPC timeouts: report the XID and the server address to match
the content of network capture.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Fix kernel-doc warnings in net/sunrpc/xprtrdma/ .
net/sunrpc/xprtrdma/verbs.c:1575: warning: No description found for parameter 'count'
net/sunrpc/xprtrdma/verbs.c:1575: warning: Excess function parameter 'min_reqs' description in 'rpcrdma_ep_post_extra_recv'
net/sunrpc/xprtrdma/backchannel.c:288: warning: No description found for parameter 'r_xprt'
net/sunrpc/xprtrdma/backchannel.c:288: warning: Excess function parameter 'xprt' description in 'rpcrdma_bc_receive_call'
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The contents of seg->mr_len changed when ->ro_map stopped returning
the full chunk length in the first segment. Count the full length of
each Write chunk, not the length of the first segment (which now can
only be as large as a page).
Fixes: 9d6b040978 ("xprtrdma: Place registered MWs on a ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>