Clean up: After commit 1e091c3bbf ("svcrdma: Ignore source port
when computing DRC hash"), the IP address stored in xpt_remote
always has a port number of zero. Thus, there's no need to display
the port number when displaying the IP address of a remote NFS/RDMA
client.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up: Use a consistent naming convention so that these trace
points can be enabled quickly via a glob.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Way back when I was writing the RPC/RDMA server-side backchannel
code, I misread the TCP backchannel reply handler logic. When
svc_tcp_recvfrom() successfully receives a backchannel reply, it
does not return -EAGAIN. It sets XPT_DATA and returns zero.
Update svc_rdma_recvfrom() to return zero. Here, XPT_DATA doesn't
need to be set again: it is set whenever a new message is received,
behind a spin lock in a single threaded context.
Also, if handling the cb reply is not successful, the message is
simply dropped. There's no special message framing to deal with as
there is in the TCP case.
Now that the handle_bc_reply() return value is ignored, I've removed
the dprintk call sites in the error exit of handle_bc_reply() in
favor of trace points in other areas that already report the error
cases.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up: Replace a dprintk call site.
This is the last remaining dprintk call site in svc_rdma_rw.c, so
remove dprintk infrastructure as well.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
- De-duplicate code
- Rename the tracepoint with "_err" to allow enabling via glob
- Report the sg_cnt for the failing rw_ctx
- Fix a dumb signage issue
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
It appears that the RPC/RDMA transport does not need serialization
of calls to its xpo_sendto method. Move the mutex into the socket
methods that still need that serialization.
Tail latencies are unambiguously better with this patch applied.
fio randrw 8KB 70/30 on NFSv3, smaller numbers are better:
clat percentiles (usec):
With xpt_mutex:
r | 99.99th=[ 8848]
w | 99.99th=[ 9634]
Without xpt_mutex:
r | 99.99th=[ 8586]
w | 99.99th=[ 8979]
Serializing the construction of RPC/RDMA transport headers is not
really necessary at this point, because the Linux NFS server
implementation never changes its credit grant on a connection. If
that should change, then svc_rdma_sendto will need to serialize
access to the transport's credit grant fields.
Reported-by: kbuild test robot <lkp@intel.com>
[ cel: fix uninitialized variable warning ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Highlights include:
Stable fixes
- fix handling of backchannel binding in BIND_CONN_TO_SESSION
Bugfixes
- Fix a credential use-after-free issue in pnfs_roc()
- Fix potential posix_acl refcnt leak in nfs3_set_acl
- defer slow parts of rpc_free_client() to a workqueue
- Fix an Oopsable race in __nfs_list_for_each_server()
- Fix trace point use-after-free race
- Regression: the RDMA client no longer responds to server disconnect requests
- Fix return values of xdr_stream_encode_item_{present, absent}
- _pnfs_return_layout() must always wait for layoutreturn completion
Cleanups
- Remove unreachable error conditions
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl6tczsACgkQZwvnipYK
APKHWg//QGx2Tolj5dh2jBHa47A5/SYnJxCZAA0/fWdwRtFkW3HyyGne1jU86do2
SMAVpBpri1WJPt5d3DH66gu4l4UxG1h84s7QP4lGfSa85EmtLh+LoZQCZRqYoDOo
JAMzWctELu1TUpaa1N5Dhg/qMtMy6ulRMWgzTLqB9a/pQa3onugTK6W7xiut2prj
PBfFq7N9XXmPboSeGV9bR4L8XKSbTCLEt3U1F2zAGU7UUINvDfpjEXq7BHYCewKL
ObPW6EWZksyna16H8i/xGWoKgE4JFVjMwQAP7UdDBi+FW9RI6UpTBoR6z9N748j0
jEocDbI21wgnwmtrVTbzsYm6ttHl4D4egoNxn7m5zjxTU4Ba/RQG2aaHUGFOYpJj
1FI1f6V1Y5v4mJajdsEH+pGW/4vK/4YMR+7YHJ/hYU/WiXjLf7onIIifdWt4SQdo
lvZbGcx6IAHYUA4lI7hkcvrK4bbqAnPLFq28nlUWEID5q5D+nA1ZR9iN0FToviDy
FYyhQzyfD1kt98SV1DjWUqvDDd6IB64iDZTXGmtWvj6c2nbezGiFffvtzUL5LFxY
QfI8lkpmUyt1EiWlZWhtOh4zsiM5yMZkJB/3RJv3RMmswizSSAHdgCKWhdLpX0bl
TG1L8yEmcTc5ANS37EhlpcBNbfYw7oIF/OXuReTSRoMQl5hxjfY=
=w0zk
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable fixes:
- fix handling of backchannel binding in BIND_CONN_TO_SESSION
Bugfixes:
- Fix a credential use-after-free issue in pnfs_roc()
- Fix potential posix_acl refcnt leak in nfs3_set_acl
- defer slow parts of rpc_free_client() to a workqueue
- Fix an Oopsable race in __nfs_list_for_each_server()
- Fix trace point use-after-free race
- Regression: the RDMA client no longer responds to server disconnect
requests
- Fix return values of xdr_stream_encode_item_{present, absent}
- _pnfs_return_layout() must always wait for layoutreturn completion
Cleanups:
- Remove unreachable error conditions"
* tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: Fix a race in __nfs_list_for_each_server()
NFSv4.1: fix handling of backchannel binding in BIND_CONN_TO_SESSION
SUNRPC: defer slow parts of rpc_free_client() to a workqueue.
NFSv4: Remove unreachable error condition due to rpc_run_task()
SUNRPC: Remove unreachable error condition
xprtrdma: Fix use of xdr_stream_encode_item_{present, absent}
xprtrdma: Fix trace point use-after-free race
xprtrdma: Restore wake-up-all to rpcrdma_cm_event_handler()
nfs: Fix potential posix_acl refcnt leak in nfs3_set_acl
NFS/pnfs: Fix a credential use-after-free issue in pnfs_roc()
NFS/pnfs: Ensure that _pnfs_return_layout() waits for layoutreturn completion
Pull in Christoph Hellwig's series that changes the sysctl's ->proc_handler
methods to take kernel pointers instead. It gets rid of the set_fs address
space overrides used by BPF. As per discussion, pull in the feature branch
into bpf-next as it relates to BPF sysctl progs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200427071508.GV23230@ZenIV.linux.org.uk/T/
Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from userspace in common code. This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.
As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
These new helpers do not return 0 on success, they return the
encoded size. Thus they are not a drop-in replacement for the
old helpers.
Fixes: 5c266df527 ("SUNRPC: Add encoders for list item discriminators")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
It's not safe to use resources pointed to by the @send_wr of
ib_post_send() _after_ that function returns. Those resources are
typically freed by the Send completion handler, which can run before
ib_post_send() returns.
Thus the trace points currently around ib_post_send() in the
client's RPC/RDMA transport are a hazard, even when they are
disabled. Rearrange them so that they touch the Work Request only
_before_ ib_post_send() is invoked.
Fixes: ab03eff58e ("xprtrdma: Add trace points in RPC Call transmit paths")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Commit e28ce90083 ("xprtrdma: kmalloc rpcrdma_ep separate from
rpcrdma_xprt") erroneously removed a xprt_force_disconnect()
call from the "transport disconnect" path. The result was that the
client no longer responded to server-side disconnect requests.
Restore that call.
Fixes: e28ce90083 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Utilize the xpo_release_rqst transport method to ensure that each
rqstp's svc_rdma_recv_ctxt object is released even when the server
cannot return a Reply for that rqstp.
Without this fix, each RPC whose Reply cannot be sent leaks one
svc_rdma_recv_ctxt. This is a 2.5KB structure, a 4KB DMA-mapped
Receive buffer, and any pages that might be part of the Reply
message.
The leak is infrequent unless the network fabric is unreliable or
Kerberos is in use, as GSS sequence window overruns, which result
in connection loss, are more common on fast transports.
Fixes: 3a88092ee3 ("svcrdma: Preserve Receive buffer until svc_rdma_sendto")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Currently, after the forward channel connection goes away,
backchannel operations are causing soft lockups on the server
because call_transmit_status's SOFTCONN logic ignores ENOTCONN.
Such backchannel Calls are aggressively retried until the client
reconnects.
Backchannel Calls should use RPC_TASK_NOCONNECT rather than
RPC_TASK_SOFTCONN. If there is no forward connection, the server is
not capable of establishing a connection back to the client, thus
that backchannel request should fail before the server attempts to
send it. Commit 58255a4e3c ("NFSD: NFSv4 callback client should
use RPC_TASK_SOFTCONN") was merged several years before
RPC_TASK_NOCONNECT was available.
Because setup_callback_client() explicitly sets NOPING, the NFSv4.0
callback connection depends on the first callback RPC to initiate
a connection to the client. Thus NFSv4.0 needs to continue to use
RPC_TASK_SOFTCONN.
Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org> # v4.20+
Highlights include:
Stable fixes:
- Fix a page leak in nfs_destroy_unlinked_subrequests()
- Fix use-after-free issues in nfs_pageio_add_request()
- Fix new mount code constant_table array definitions
- finish_automount() requires us to hold 2 refs to the mount record
Features:
- Improve the accuracy of telldir/seekdir by using 64-bit cookies when
possible.
- Allow one RDMA active connection and several zombie connections to
prevent blocking if the remote server is unresponsive.
- Limit the size of the NFS access cache by default
- Reduce the number of references to credentials that are taken by NFS
- pNFS files and flexfiles drivers now support per-layout segment
COMMIT lists.
- Enable partial-file layout segments in the pNFS/flexfiles driver.
- Add support for CB_RECALL_ANY to the pNFS flexfiles layout type
- pNFS/flexfiles Report NFS4ERR_DELAY and NFS4ERR_GRACE errors from
the DS using the layouterror mechanism.
Bugfixes and cleanups:
- SUNRPC: Fix krb5p regressions
- Don't specify NFS version in "UDP not supported" error
- nfsroot: set tcp as the default transport protocol
- pnfs: Return valid stateids in nfs_layout_find_inode_by_stateid()
- alloc_nfs_open_context() must use the file cred when available
- Fix locking when dereferencing the delegation cred
- Fix memory leaks in O_DIRECT when nfs_get_lock_context() fails
- Various clean ups of the NFS O_DIRECT commit code
- Clean up RDMA connect/disconnect
- Replace zero-length arrays with C99-style flexible arrays
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl6LhhsACgkQZwvnipYK
APIOJxAAiQOgmIg1CV4mrlcVhkwy09N5JAia6AENtoTmwm08nAYg5Y8REb9uX46a
/MJsM2WG8hBCgI6eYmRY8LTr4Ft9rTQEJM9DRMuwQREXwMWwBhUv/QakCeqY1lHE
lyB1z4hj5XKeUoN/OcfALC/GXFFf56A0UyN05nMzeCkBTdd3+qu+hW8Ge1wkAXcr
f0pyLbzdFZlJuTmI4tr8F93g9p3ezuFBuEroT7XPIVJylAdZVumHqnOnz/Mvb99x
rNTsX2dc44GhSAfRnTzPumU3MT6BOLvUzNH1xzdiqKzJrbOnG8WjFodrGr3JWpfp
HkeyYQxJ+Hnfb2LiZBjvMQE8M7kVMZ1jVbrGJEbCxfSqgTly8lOHboqAeKsFaReK
LStnusizdA1LHQVZxPdvn+oL49RDxnzm9dY+DkrXK1qT0GE+icN1CyTyLLfkSCp8
tYvZSJ/qPk5BNZegqH1nBqXkMDkOJ4eEA7+luXDmajRkdRrZ3IWY2M1DpMEoueJ2
j/zoj/NFr1oErU4o7PV9oolA1Euhn1L3wIDuzsbVtjySmbXJNQTtaVVRFpGw3SsZ
7rbqi4BB0SzOooNhQ4q8mLNi4qT7bl/3D04eL8UVzEM73plexhQ8XiOEz/VrIRX7
L9viXH49g4DHQ0rZIaWefxFueqpgbNvQwnlLZl2uQotG9hwhTts=
=YUcP
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- Fix a page leak in nfs_destroy_unlinked_subrequests()
- Fix use-after-free issues in nfs_pageio_add_request()
- Fix new mount code constant_table array definitions
- finish_automount() requires us to hold 2 refs to the mount record
Features:
- Improve the accuracy of telldir/seekdir by using 64-bit cookies
when possible.
- Allow one RDMA active connection and several zombie connections to
prevent blocking if the remote server is unresponsive.
- Limit the size of the NFS access cache by default
- Reduce the number of references to credentials that are taken by
NFS
- pNFS files and flexfiles drivers now support per-layout segment
COMMIT lists.
- Enable partial-file layout segments in the pNFS/flexfiles driver.
- Add support for CB_RECALL_ANY to the pNFS flexfiles layout type
- pNFS/flexfiles Report NFS4ERR_DELAY and NFS4ERR_GRACE errors from
the DS using the layouterror mechanism.
Bugfixes and cleanups:
- SUNRPC: Fix krb5p regressions
- Don't specify NFS version in "UDP not supported" error
- nfsroot: set tcp as the default transport protocol
- pnfs: Return valid stateids in nfs_layout_find_inode_by_stateid()
- alloc_nfs_open_context() must use the file cred when available
- Fix locking when dereferencing the delegation cred
- Fix memory leaks in O_DIRECT when nfs_get_lock_context() fails
- Various clean ups of the NFS O_DIRECT commit code
- Clean up RDMA connect/disconnect
- Replace zero-length arrays with C99-style flexible arrays"
* tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (86 commits)
NFS: Clean up process of marking inode stale.
SUNRPC: Don't start a timer on an already queued rpc task
NFS/pnfs: Reference the layout cred in pnfs_prepare_layoutreturn()
NFS/pnfs: Fix dereference of layout cred in pnfs_layoutcommit_inode()
NFS: Beware when dereferencing the delegation cred
NFS: Add a module parameter to set nfs_mountpoint_expiry_timeout
NFS: finish_automount() requires us to hold 2 refs to the mount record
NFS: Fix a few constant_table array definitions
NFS: Try to join page groups before an O_DIRECT retransmission
NFS: Refactor nfs_lock_and_join_requests()
NFS: Reverse the submission order of requests in __nfs_pageio_add_request()
NFS: Clean up nfs_lock_and_join_requests()
NFS: Remove the redundant function nfs_pgio_has_mirroring()
NFS: Fix memory leaks in nfs_pageio_stop_mirroring()
NFS: Fix a request reference leak in nfs_direct_write_clear_reqs()
NFS: Fix use-after-free issues in nfs_pageio_add_request()
NFS: Fix races nfs_page_group_destroy() vs nfs_destroy_unlinked_subrequests()
NFS: Fix a page leak in nfs_destroy_unlinked_subrequests()
NFS: Remove unused FLUSH_SYNC support in nfs_initiate_pgio()
pNFS/flexfiles: Specify the layout segment range in LAYOUTGET
...
Change the rpcrdma_xprt_disconnect() function so that it no longer
waits for the DISCONNECTED event. This prevents blocking if the
remote is unresponsive.
In rpcrdma_xprt_disconnect(), the transport's rpcrdma_ep is
detached. Upon return from rpcrdma_xprt_disconnect(), the transport
(r_xprt) is ready immediately for a new connection.
The RDMA_CM_DEVICE_REMOVAL and RDMA_CM_DISCONNECTED events are now
handled almost identically.
However, because the lifetimes of rpcrdma_xprt structures and
rpcrdma_ep structures are now independent, creating an rpcrdma_ep
needs to take a module ref count. The ep now owns most of the
hardware resources for a transport.
Also, a kref is needed to ensure that rpcrdma_ep sticks around
long enough for the cm_event_handler to finish.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
rpcrdma_cm_event_handler() is always passed an @id pointer that is
valid. However, in a subsequent patch, we won't be able to extract
an r_xprt in every case. So instead of using the r_xprt's
presentation address strings, extract them from struct rdma_cm_id.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
I eventually want to allocate rpcrdma_ep separately from struct
rpcrdma_xprt so that on occasion there can be more than one ep per
xprt.
The new struct rpcrdma_ep will contain all the fields currently in
rpcrdma_ia and in rpcrdma_ep. This is all the device and CM settings
for the connection, in addition to per-connection settings
negotiated with the remote.
Take this opportunity to rename the existing ep fields from rep_* to
re_* to disambiguate these from struct rpcrdma_rep.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Completion errors after a disconnect often occur much sooner than a
CM_DISCONNECT event. Use this to try to detect connection loss more
quickly.
Note that other kernel ULPs do take care to disconnect explicitly
when a WR is flushed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up:
The upper layer serializes calls to xprt_rdma_close, so there is no
need for an atomic bit operation, saving 8 bytes in rpcrdma_ia.
This enables merging rpcrdma_ia_remove directly into the disconnect
logic.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Move rdma_cm_id creation into rpcrdma_ep_create() so that it is now
responsible for allocating all per-connection hardware resources.
With this clean-up, all three arms of the switch statement in
rpcrdma_ep_connect are exactly the same now, thus the switch can be
removed.
Because device removal behaves a little differently than
disconnection, there is a little more work to be done before
rpcrdma_ep_destroy() can release the connection's rdma_cm_id. So
it is not quite symmetrical with rpcrdma_ep_create() yet.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Make a Protection Domain (PD) a per-connection resource rather than
a per-transport resource. In other words, when the connection
terminates, the PD is destroyed.
Thus there is one less HW resource that remains allocated to a
transport after a connection is closed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Simplify the synopses of functions in the connect and
disconnect paths in preparation for combining the rpcrdma_ia and
struct rpcrdma_ep structures.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Simplify the synopses of functions in the post_send path
by combining the struct rpcrdma_ia and struct rpcrdma_ep arguments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: prepare for combining the rpcrdma_ia and rpcrdma_ep
structures. Take the opportunity to rename the function to be
consistent with the "subsystem _ object _ verb" naming scheme.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Refactor rpcrdma_ep_create(), rpcrdma_ep_disconnect(), and
rpcrdma_ep_destroy().
rpcrdma_ep_create will be invoked at connect time instead of at
transport set-up time. It will be responsible for allocating per-
connection resources. In this patch it allocates the CQs and
creates a QP. More to come.
rpcrdma_ep_destroy() is the inverse functionality that is
invoked at disconnect time. It will be responsible for releasing
the CQs and QP.
These changes should be safe to do because both connect and
disconnect is guaranteed to be serialized by the transport send
lock.
This takes us another step closer to resolving the address and route
only at connect time so that connection failover to another device
will work correctly.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Two changes:
- Show the number of SG entries that were mapped. This helps debug
DMA-related problems.
- Record the MR's resource ID instead of its memory address. This
groups each MR with its associated rdma-tool output, and reduces
needless exposure of memory addresses.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
On some platforms, DMA mapping part of a page is more costly than
copying bytes. Indeed, not involving the I/O MMU can help the
RPC/RDMA transport scale better for tiny I/Os across more RDMA
devices. This is because interaction with the I/O MMU is eliminated
for each of these small I/Os. Without the explicit unmapping, the
NIC no longer needs to do a costly internal TLB shoot down for
buffers that are just a handful of bytes.
Since pull-up is now a more a frequent operation, I've introduced a
trace point in the pull-up path. It can be used for debugging or
user-space tools that count pull-up frequency.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Performance optimization: Avoid syncing the transport buffer twice
when Reply buffer pull-up is necessary.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Same idea as the receive-side changes I did a while back: use
xdr_stream helpers rather than open-coding the XDR chunk list
encoders. This builds the Reply transport header from beginning to
end without backtracking.
As additional clean-ups, fill in documenting comments for the XDR
encoders and sprinkle some trace points in the new encoding
functions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up. These are taken from the client-side RPC/RDMA transport
to a more global header file so they can be used elsewhere.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
These trace points are misnamed:
trace_svcrdma_encode_wseg
trace_svcrdma_encode_write
trace_svcrdma_encode_reply
trace_svcrdma_encode_rseg
trace_svcrdma_encode_read
trace_svcrdma_encode_pzr
Because they actually trace posting on the Send Queue. Let's rename
them so that I can add trace points in the chunk list encoders that
actually do trace chunk list encoding events.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Preparing for subsequent patches, no behavior change expected.
Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto()
path. This enables passing more information about Requester-
provided Write and Reply chunks into those lower-level functions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Preparing for subsequent patches, no behavior change expected.
Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto()
path. This enables passing more information about Requester-
provided Write and Reply chunks into those lower-level functions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Preparing for subsequent patches, no behavior change expected.
Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto()
path. This enables passing more information about Requester-
provided Write and Reply chunks into the lower-level send
functions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cache the locations of the Requester-provided Write list and Reply
chunk so that the Send path doesn't need to parse the Call header
again.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The logic that checks incoming network headers has to be scrupulous.
De-duplicate: replace open-coded buffer overflow checks with the use
of xdr_stream helpers that are used most everywhere else XDR
decoding is done.
One minor change to the sanity checks: instead of checking the
length of individual segments, cap the length of the whole chunk
to be sure it can fit in the set of pages available in rq_pages.
This should be a better test of whether the server can handle the
chunks in each request.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up. This trace point is no longer needed because the RDMA/core
CMA code has an equivalent trace point that was added by commit
ed999f820a ("RDMA/cma: Add trace points in RDMA Connection
Manager").
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Introduce a helper function to compute the XDR pad size of a
variable-length XDR object.
Clean up: Replace open-coded calculation of XDR pad sizes.
I'm sure I haven't found every instance of this calculation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This error path is almost never executed. Found by code inspection.
Fixes: 99722fe4d5 ("svcrdma: Persistently allocate and DMA-map Send buffers")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svcrdma expects that the payload falls precisely into the xdr_buf
page vector. This does not seem to be the case for
nfsd4_encode_readv().
This code is called only when fops->splice_read is missing or when
RQ_SPLICE_OK is clear, so it's not a noticeable problem in many
common cases.
Add new transport method: ->xpo_read_payload so that when a READ
payload does not fit exactly in rq_res's page vector, the XDR
encoder can inform the RPC transport exactly where that payload is,
without the payload's XDR pad.
That way, when a Write chunk is present, the transport knows what
byte range in the Reply message is supposed to be matched with the
chunk.
Note that the Linux NFS server implementation of NFS/RDMA can
currently handle only one Write chunk per RPC-over-RDMA message.
This simplifies the implementation of this fix.
Fixes: b042098063 ("nfsd4: allow exotic read compounds")
Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=198053
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The @nents value that was passed to ib_dma_map_sg() has to be passed
to the matching ib_dma_unmap_sg() call. If ib_dma_map_sg() choses to
concatenate sg entries, it will return a different nents value than
it was passed.
The bug was exposed by recent changes to the AMD IOMMU driver, which
enabled sg entry concatenation.
Looking all the way back to commit 4143f34e01 ("xprtrdma: Port to
new memory registration API") and reviewing other kernel ULPs, it's
not clear that the frwr_map() logic was ever correct for this case.
Reported-by: Andre Tomt <andre@tomt.net>
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: This simplifies the logic in rpcrdma_post_recvs.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
To safely get rid of all rpcrdma_reps from a particular connection
instance, xprtrdma has to wait until each of those reps is finished
being used. A rep may be backing the rq_rcv_buf of an RPC that has
just completed, for example.
Since it is safe to invoke rpcrdma_rep_destroy() only in the Receive
completion handler, simply mark reps remaining in the rb_all_reps
list after the transport is drained. These will then be deleted as
rpcrdma_post_recvs pulls them off the rep free list.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This reduces the hardware and memory footprint of an unconnected
transport.
At some point in the future, transport reconnect will allow
resolving the destination IP address through a different device. The
current change enables reps for the new connection to be allocated
on whichever NUMA node the new device affines to after a reconnect.
Note that this does not destroy _all_ the transport's reps... there
will be a few that are still part of a running RPC completion.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Currently the underlying RDMA device is chosen at transport set-up
time. But it will soon be at connect time instead.
The maximum size of a transport header is based on device
capabilities. Thus transport header buffers have to be allocated
_after_ the underlying device has been chosen (via address and route
resolution); ie, in the connect worker.
Thus, move the allocation of transport header buffers to the connect
worker, after the point at which the underlying RDMA device has been
chosen.
This also means the RDMA device is available to do a DMA mapping of
these buffers at connect time, instead of in the hot I/O path. Make
that optimization as well.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Refactor: Perform the "is supported" check in rpcrdma_ep_create()
instead of in rpcrdma_ia_open(). frwr_open() is where most of the
logic to query device attributes is already located.
The current code displays a redundant error message when the device
does not support FRWR. As an additional clean-up, this patch removes
the extra message.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
To support device hotplug and migrating a connection between devices
of different capabilities, we have to guarantee that all in-kernel
devices can support the same max NFS payload size (1 megabyte).
This means that possibly one or two in-tree devices are no longer
supported for NFS/RDMA because they cannot support 1MB rsize/wsize.
The only one I confirmed was cxgb3, but it has already been removed
from the kernel.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: there is no need to keep two copies of the same value.
Also, in subsequent patches, rpcrdma_ep_create() will be called in
the connect worker rather than at set-up time.
Minor fix: Initialize the transport's sendctx to the value based on
the capabilities of the underlying device, not the maximum setting.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The size of the sendctx queue depends on the value stored in
ia->ri_max_send_sges. This value is determined by querying the
underlying device.
Eventually, rpcrdma_ia_open() and rpcrdma_ep_create() will be called
in the connect worker rather than at transport set-up time. The
underlying device will not have been chosen device set-up time.
The sendctx queue will thus have to be created after the underlying
device has been chosen via address and route resolution; in other
words, in the connect worker.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean-up. The max_send_sge value also happens to be stored in
ep->rep_attr. Let's keep just a single copy.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Since v5.4, a device removal occasionally triggered this oops:
Dec 2 17:13:53 manet kernel: BUG: unable to handle page fault for address: 0000000c00000219
Dec 2 17:13:53 manet kernel: #PF: supervisor read access in kernel mode
Dec 2 17:13:53 manet kernel: #PF: error_code(0x0000) - not-present page
Dec 2 17:13:53 manet kernel: PGD 0 P4D 0
Dec 2 17:13:53 manet kernel: Oops: 0000 [#1] SMP
Dec 2 17:13:53 manet kernel: CPU: 2 PID: 468 Comm: kworker/2:1H Tainted: G W 5.4.0-00050-g53717e43af61 #883
Dec 2 17:13:53 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015
Dec 2 17:13:53 manet kernel: Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
Dec 2 17:13:53 manet kernel: RIP: 0010:rpcrdma_wc_receive+0x7c/0xf6 [rpcrdma]
Dec 2 17:13:53 manet kernel: Code: 6d 8b 43 14 89 c1 89 45 78 48 89 4d 40 8b 43 2c 89 45 14 8b 43 20 89 45 18 48 8b 45 20 8b 53 14 48 8b 30 48 8b 40 10 48 8b 38 <48> 8b 87 18 02 00 00 48 85 c0 75 18 48 8b 05 1e 24 c4 e1 48 85 c0
Dec 2 17:13:53 manet kernel: RSP: 0018:ffffc900035dfe00 EFLAGS: 00010246
Dec 2 17:13:53 manet kernel: RAX: ffff888467290000 RBX: ffff88846c638400 RCX: 0000000000000048
Dec 2 17:13:53 manet kernel: RDX: 0000000000000048 RSI: 00000000f942e000 RDI: 0000000c00000001
Dec 2 17:13:53 manet kernel: RBP: ffff888467611b00 R08: ffff888464e4a3c4 R09: 0000000000000000
Dec 2 17:13:53 manet kernel: R10: ffffc900035dfc88 R11: fefefefefefefeff R12: ffff888865af4428
Dec 2 17:13:53 manet kernel: R13: ffff888466023000 R14: ffff88846c63f000 R15: 0000000000000010
Dec 2 17:13:53 manet kernel: FS: 0000000000000000(0000) GS:ffff88846fa80000(0000) knlGS:0000000000000000
Dec 2 17:13:53 manet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 2 17:13:53 manet kernel: CR2: 0000000c00000219 CR3: 0000000002009002 CR4: 00000000001606e0
Dec 2 17:13:53 manet kernel: Call Trace:
Dec 2 17:13:53 manet kernel: __ib_process_cq+0x5c/0x14e [ib_core]
Dec 2 17:13:53 manet kernel: ib_cq_poll_work+0x26/0x70 [ib_core]
Dec 2 17:13:53 manet kernel: process_one_work+0x19d/0x2cd
Dec 2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf
Dec 2 17:13:53 manet kernel: worker_thread+0x1a6/0x25a
Dec 2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf
Dec 2 17:13:53 manet kernel: kthread+0xf4/0xf9
Dec 2 17:13:53 manet kernel: ? kthread_queue_delayed_work+0x74/0x74
Dec 2 17:13:53 manet kernel: ret_from_fork+0x24/0x30
The proximal cause is that this rpcrdma_rep has a rr_rdmabuf that
is still pointing to the old ib_device, which has been freed. The
only way that is possible is if this rpcrdma_rep was not destroyed
by rpcrdma_ia_remove.
Debugging showed that was indeed the case: this rpcrdma_rep was
still in use by a completing RPC at the time of the device removal,
and thus wasn't on the rep free list. So, it was not found by
rpcrdma_reps_destroy().
The fix is to introduce a list of all rpcrdma_reps so that they all
can be found when a device is removed. That list is used to perform
only regbuf DMA unmapping, replacing that call to
rpcrdma_reps_destroy().
Meanwhile, to prevent corruption of this list, I've moved the
destruction of temp rpcrdma_rep objects to rpcrdma_post_recvs().
rpcrdma_xprt_drain() ensures that post_recvs (and thus rep_destroy) is
not invoked while rpcrdma_reps_unmap is walking rb_all_reps, thus
protecting the rb_all_reps list.
Fixes: b0b227f071 ("xprtrdma: Use an llist to manage free rpcrdma_reps")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
I've found that on occasion, "rmmod <dev>" will hang while if an NFS
is under load.
Ensure that ri_remove_done is initialized only just before the
transport is woken up to force a close. This avoids the completion
possibly getting initialized again while the CM event handler is
waiting for a wake-up.
Fixes: bebd031866 ("xprtrdma: Support unplugging an HCA from under an NFS mount")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Possibly most interesting is Trond's fixes for some callback races that
were due to my incomplete understanding of rpc client shutdown.
Unfortunately at the last minute I've started noticing a new
intermittent failure to send callbacks. As the logic seems basically
correct, I'm leaving Trond's patches in for now, and hope to find a fix
in the next week so I don't have to revert those patches.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl3r3AAVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+rjkP/3L6DZs0Uv0BYbGq5Gmit0uoPSQk
8BT7oQhbagCh+ULRYWCnK6cz82wejR4Gzq4PLyl5x5Vcc5x+bLoPI9YgiRZlIbZu
ZvSg93E6SITLfq5xRlDC0MlIVZkI+HoIfyYgv1aYiWvQ3834bcx4DxVm9h7cNpT3
x37anEFi1lv3n9fct3obOrs3AvCS76XyA6VVhcSLJ77amKQ+O7LI0crqUc6cuX2i
CkTwTSDwyCrzkx3dZ2xDPDTbLecxw+Ce4adaby5v3GEQo3TOCmEWX92D3dvzfMmv
ICU07FsVOILnIT/fmC91b1+JWVRLjUUBw5EPmDduwSP/yw4YnIEODFEP/wAUAmMJ
vJ9hi9c1rThQ9n8h08RIwA2snhnpXRxKCWhpIRY6WM8DhHL9Y9AuVPYTKxhQOjPK
l3wbOGcMW63NrTOPHHN7hTB0vDLgPKIXYVIrMvZTd/P7CghDDEbhT1gDvx/IL3Uq
WrHKbJtK7rbx9i2bh5f6fH0DRrv7lxbD0ffunRRa3twPAe6zsG9WPjsbZZraZzEg
O7/o3wZu2N7MpL5bXPfzB+5ylOTxvNWew07NJjA4BIOfwin3bw/71YfB0Vnoairv
PhmbN2Dj4/t82ld0JU5GJWojpUfH4ARXM2Li9WO99wzx+KrxScsqGPnRMFe9dC7b
Q7ltP1p0gUbkJ88Z
=b2zA
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.5' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"This is a relatively quiet cycle for nfsd, mainly various bugfixes.
Possibly most interesting is Trond's fixes for some callback races
that were due to my incomplete understanding of rpc client shutdown.
Unfortunately at the last minute I've started noticing a new
intermittent failure to send callbacks. As the logic seems basically
correct, I'm leaving Trond's patches in for now, and hope to find a
fix in the next week so I don't have to revert those patches"
* tag 'nfsd-5.5' of git://linux-nfs.org/~bfields/linux: (24 commits)
nfsd: depend on CRYPTO_MD5 for legacy client tracking
NFSD fixing possible null pointer derefering in copy offload
nfsd: check for EBUSY from vfs_rmdir/vfs_unink.
nfsd: Ensure CLONE persists data and metadata changes to the target file
SUNRPC: Fix backchannel latency metrics
nfsd: restore NFSv3 ACL support
nfsd: v4 support requires CRYPTO_SHA256
nfsd: Fix cld_net->cn_tfm initialization
lockd: remove __KERNEL__ ifdefs
sunrpc: remove __KERNEL__ ifdefs
race in exportfs_decode_fh()
nfsd: Drop LIST_HEAD where the variable it declares is never used.
nfsd: document callback_wq serialization of callback code
nfsd: mark cb path down on unknown errors
nfsd: Fix races between nfsd4_cb_release() and nfsd4_shutdown_callback()
nfsd: minor 4.1 callback cleanup
SUNRPC: Fix svcauth_gss_proxy_init()
SUNRPC: Trace gssproxy upcall results
sunrpc: fix crash when cache_head become valid before update
nfsd: remove private bin2hex implementation
...
I noticed that for callback requests, the reported backlog latency
is always zero, and the rtt value is crazy big. The problem was that
rqst->rq_xtime is never set for backchannel requests.
Fixes: 78215759e2 ("SUNRPC: Make RTT measurement more ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
New Features:
- New tracepoints for congestion control and Local Invalidate WRs
Bugfixes and Cleanups:
- Eliminate log noise in call_reserveresult
- Fix unstable connections after a reconnect
- Clean up some code duplication
- Close race between waking a sender and posting a receive
- Fix MR list corruption, and clean up MR usage
- Remove unused rpcrdma_sendctx fields
- Try to avoid DMA mapping pages if it is too costly
- Wake pending tasks if connection fails
- Replace some dprintk()s with tracepoints
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl3MGSwACgkQ18tUv7Cl
QOt3LhAAz4T4DSGDb6QxUGxDRlusvHBpPFXA+GQOMRBVKiWHtrBcZT0UUybAm3Zp
i2B1gVZJSo2JeA5vg3rK0yJmGiNPs4QedTUHiRsISrKOvUo+ITAEdNXgIIB3tAM9
Pkwf0AMSqbzpUjKNHVGDGyhZ2WZM66zsDFI2CFh4Ul7VX/1NhM3xaUgTSEDJhl3z
tE+aULrwGTQvUq4JKXQ3vu4f8rsbxxNKfvaZyQIKPo79nEFdtniVn18u5p010HDP
ldJJAtY9qozhqWKwaSNEj6guW4U9wesLPrb7cBysHWjgivU17bwEbN/ZR3YrxoHI
trpBdr5994FmOCz9mcKxH+BlS0bO7QSPS2r2TpgIMjKCm8scuZlhlnMQxHV8mEpz
EpoC65qgcmqyeeOcIHnA/eN13ZAYgGKsRBIPEWRE/w+3Yz4bupsKZ/blSRzXdJpQ
forMrAGTYa64NqdnRRDxf6PMwk8fqIDeTHTybMSLghUAQi89zYK0tZpgFikwBYRJ
dqNGp4usCLtZ0c2nDnDg00arOZwqPQnxycNexHYNpHOACurCF9FhbaaZjsecZCoy
QsSFN98K6KI5ztPY1p7DL5N36IC3VDwbgi0COKtF+xB3P0pIZ/Pzwo0KbfXQF0KH
dmTrnpNY/Yq71i1ow6LTdYZ3hq7ZGztaXEEkl7udNvK97pP0UR0=
=Jlak
-----END PGP SIGNATURE-----
Merge tag 'nfs-rdma-for-5.5-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
NFSoRDMA Client Updates for Linux 5.5
New Features:
- New tracepoints for congestion control and Local Invalidate WRs
Bugfixes and Cleanups:
- Eliminate log noise in call_reserveresult
- Fix unstable connections after a reconnect
- Clean up some code duplication
- Close race between waking a sender and posting a receive
- Fix MR list corruption, and clean up MR usage
- Remove unused rpcrdma_sendctx fields
- Try to avoid DMA mapping pages if it is too costly
- Wake pending tasks if connection fails
- Replace some dprintk()s with tracepoints
If there are RDMA back channel requests being processed by the
server threads, then we should hold a reference to the transport
to ensure it doesn't get freed from underneath us.
Reported-by: Neil Brown <neilb@suse.de>
Fixes: 63cae47005 ("xprtrdma: Handle incoming backward direction RPC calls")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Use a single trace point to record each connection's
negotiated inline thresholds and the computed maximum byte size
of transport headers.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Slightly reduce overhead and display more useful information.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
For debugging, the op_connect trace point should report the computed
connect delay. We can then ensure that the delay is computed at the
proper times, for example.
As a further clean-up, remove a few low-value "heartbeat" trace
points in the connect path.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Pending tasks are currently never awoken when the connect worker
fails. The reason is that XPRT_CONNECTED is always clear after a
failure return of rpcrdma_ep_connect, thus the
xprt_test_and_clear_connected() check in xprt_rdma_connect_worker()
always fails.
- xprt_rdma_close always clears XPRT_CONNECTED.
- rpcrdma_ep_connect always clears XPRT_CONNECTED.
After reviewing the TCP connect worker, it appears that there's no
need for extra test_and_set paranoia in xprt_rdma_connect_worker.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
On some platforms, DMA mapping part of a page is more costly than
copying bytes. Restore the pull-up code and use that when we
think it's going to be faster. The heuristic for now is to pull-up
when the size of the RPC message body fits in the buffer underlying
the head iovec.
Indeed, not involving the I/O MMU can help the RPC/RDMA transport
scale better for tiny I/Os across more RDMA devices. This is because
interaction with the I/O MMU is eliminated, as is handling a Send
completion, for each of these small I/Os. Without the explicit
unmapping, the NIC no longer needs to do a costly internal TLB shoot
down for buffers that are just a handful of bytes.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Refactor: Replace spaghetti with code that makes it plain what needs
to be done for each rtype. This makes it easier to add features and
optimizations.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: This field is not needed in the Send completion handler,
so it can be moved to struct rpcrdma_req to reduce the size of
struct rpcrdma_sendctx, and to reduce the amount of memory that
is sloshed between the sending process and the Send completion
process.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Micro-optimization: Save eight bytes in a frequently allocated
structure.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Micro-optimization: Save eight bytes in a frequently allocated
structure.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
ia->ri_id is replaced during a reconnect. The connect_worker runs
with the transport send lock held to prevent ri_id from being
dereferenced by the send_request path during this process.
Currently, however, there is no guarantee that ia->ri_id is stable
in the MR recycling worker, which operates in the background and is
not serialized with the connect_worker in any way.
But now that Local_Inv completions are being done in process
context, we can handle the recycling operation there instead of
deferring the recycling work to another process. Because the
disconnect path drains all work before allowing tear down to
proceed, it is guaranteed that Local Invalidations complete only
while the ri_id pointer is stable.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
MRs are now allocated on demand so we can safely throw them away on
disconnect. This way an idle transport can disconnect and it won't
pin hardware MR resources.
Two additional changes:
- Now that all MRs are destroyed on disconnect, there's no need to
check during header marshaling if a req has MRs to recycle. Each
req is sent only once per connection, and now rl_registered is
guaranteed to be empty when rpcrdma_marshal_req is invoked.
- Because MRs are now destroyed in a WQ_MEM_RECLAIM context, they
also must be allocated in a WQ_MEM_RECLAIM context. This reduces
the likelihood that device driver memory allocation will trigger
memory reclaim during NFS writeback.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Close some holes introduced by commit 6dc6ec9e04 ("xprtrdma: Cache
free MRs in each rpcrdma_req") that could result in list corruption.
In addition, the result that is tabulated in @count is no longer
used, so @count is removed.
Fixes: 6dc6ec9e04 ("xprtrdma: Cache free MRs in each rpcrdma_req")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
A recent clean up attempted to separate Receive handling and RPC
Reply processing, in the name of clean layering.
Unfortunately, we can't do this because the Receive Queue has to be
refilled _after_ the most recent credit update from the responder
is parsed from the transport header, but _before_ we wake up the
next RPC sender. That is right in the middle of
rpcrdma_reply_handler().
Usually this isn't a problem because current responder
implementations don't vary their credit grant. The one exception is
when a connection is established: the grant goes from one to a much
larger number on the first Receive. The requester MUST post enough
Receives right then so that any outstanding requests can be sent
without risking RNR and connection loss.
Fixes: 6ceea36890 ("xprtrdma: Refactor Receive accounting")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up/code de-duplication.
Nit: RPC_CWNDSHIFT is incorrect as the initial value for xprt->cwnd.
This mistake does not appear to have operational consequences, since
the cwnd value is replaced with a valid value upon the first Receive
completion.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This is because xprt_request_get_cong() is allowing more than one
RPC Call to be transmitted before the first Receive on the new
connection. The first Receive fills the Receive Queue based on the
server's credit grant. Before that Receive, there is only a single
Receive WR posted because the client doesn't know the server's
credit grant.
Solution is to clear rq_cong on all outstanding rpc_rqsts when the
the cwnd is reset. This is because an RPC/RDMA credit is good for
one connection instance only.
Fixes: 75891f502f ("SUNRPC: Support for congestion control ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When adding frwr_unmap_async way back when, I re-used the existing
trace_xprtrdma_post_send() trace point to record the return code
of ib_post_send.
Unfortunately there are some cases where re-using that trace point
causes a crash. Instead, construct a trace point specific to posting
Local Invalidate WRs that will always be safe to use in that context,
and will act as a trace log eye-catcher for Local Invalidation.
Fixes: 847568942f ("xprtrdma: Remove fr_state")
Fixes: d8099feda4 ("xprtrdma: Reduce context switching due ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Bill Baker <bill.baker@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Capture the total size of Sends, the size of DMA map and the
matching DMA unmap to ensure operation is correct.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
- add a new knfsd file cache, so that we don't have to open and
close on each (NFSv2/v3) READ or WRITE. This can speed up
read and write in some cases. It also replaces our readahead
cache.
- Prevent silent data loss on write errors, by treating write
errors like server reboots for the purposes of write caching,
thus forcing clients to resend their writes.
- Tweak the code that allocates sessions to be more forgiving,
so that NFSv4.1 mounts are less likely to hang when a server
already has a lot of clients.
- Eliminate an arbitrary limit on NFSv4 ACL sizes; they should
now be limited only by the backend filesystem and the
maximum RPC size.
- Allow the server to enforce use of the correct kerberos
credentials when a client reclaims state after a reboot.
And some miscellaneous smaller bugfixes and cleanup.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl2OoFcVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+dRoP/3OW1NxPjpjbCQWZL0M+O3AYJJla
W8E+uoZKMosFEe/ymokMD0Vn5s47jPaMCifMjHZa2GygW8zHN9X2v0HURx/lob+o
/rJXwMn78N/8kdbfDz2FvaCPeT0IuNzRIFBV8/sSXofqwCBwvPo+cl0QGrd4/xLp
X35qlupx62TRk+kbdRjvv8kpS5SJ7BvR+FSA1WubNYWw2hpdEsr2OCFdGq2Wvthy
DK6AfGBXfJGsOE+HGCSj6ejRV6i0UOJ17P8gRSsx+YT0DOe5E7ROjt+qvvRwk489
wmR8Vjuqr1e40eGAUq3xuLfk5F5NgycY4ekVxk/cTVFNwWcz2DfdjXQUlyPAbrSD
SqIyxN1qdKT24gtr7AHOXUWJzBYPWDgObCVBXUGzyL81RiDdhf38HRNjL2TcSDld
tzCjQ0wbPw+iT74v6qQRY05oS+h3JOtDjU4pxsBnxVtNn4WhGJtaLfWW8o1C1QwU
bc4aX3TlYhDmzU7n7Zjt4rFXGJfyokM+o6tPao1Z60Pmsv1gOk4KQlzLtW/jPHx4
ZwYTwVQUKRDBfC62nmgqDyGI3/Qu11FuIxL2KXUCgkwDxNWN4YkwYjOGw9Lb5qKM
wFpxq6CDNZB/IWLEu8Yg85kDPPUJMoI657lZb7Osr/MfBpU0YljcMOIzLBy8uV1u
9COUbPaQipiWGu/0
=diBo
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Highlights:
- Add a new knfsd file cache, so that we don't have to open and close
on each (NFSv2/v3) READ or WRITE. This can speed up read and write
in some cases. It also replaces our readahead cache.
- Prevent silent data loss on write errors, by treating write errors
like server reboots for the purposes of write caching, thus forcing
clients to resend their writes.
- Tweak the code that allocates sessions to be more forgiving, so
that NFSv4.1 mounts are less likely to hang when a server already
has a lot of clients.
- Eliminate an arbitrary limit on NFSv4 ACL sizes; they should now be
limited only by the backend filesystem and the maximum RPC size.
- Allow the server to enforce use of the correct kerberos credentials
when a client reclaims state after a reboot.
And some miscellaneous smaller bugfixes and cleanup"
* tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linux: (34 commits)
sunrpc: clean up indentation issue
nfsd: fix nfs read eof detection
nfsd: Make nfsd_reset_boot_verifier_locked static
nfsd: degraded slot-count more gracefully as allocation nears exhaustion.
nfsd: handle drc over-allocation gracefully.
nfsd: add support for upcall version 2
nfsd: add a "GetVersion" upcall for nfsdcld
nfsd: Reset the boot verifier on all write I/O errors
nfsd: Don't garbage collect files that might contain write errors
nfsd: Support the server resetting the boot verifier
nfsd: nfsd_file cache entries should be per net namespace
nfsd: eliminate an unnecessary acl size limit
Deprecate nfsd fault injection
nfsd: remove duplicated include from filecache.c
nfsd: Fix the documentation for svcxdr_tmpalloc()
nfsd: Fix up some unused variable warnings
nfsd: close cached files prior to a REMOVE or RENAME that would replace target
nfsd: rip out the raparms cache
nfsd: have nfsd_test_lock use the nfsd_file cache
nfsd: hook up nfs4_preprocess_stateid_op to the nfsd_file cache
...
Stable bugfixes:
- Dequeue the request from the receive queue while we're re-encoding # v4.20+
- Fix buffer handling of GSS MIC without slack # 5.1
Features:
- Increase xprtrdma maximum transport header and slot table sizes
- Add support for nfs4_call_sync() calls using a custom rpc_task_struct
- Optimize the default readahead size
- Enable pNFS filelayout LAYOUTGET on OPEN
Other bugfixes and cleanups:
- Fix possible null-pointer dereferences and memory leaks
- Various NFS over RDMA cleanups
- Various NFS over RDMA comment updates
- Don't receive TCP data into a reset request buffer
- Don't try to parse incomplete RPC messages
- Fix congestion window race with disconnect
- Clean up pNFS return-on-close error handling
- Fixes for NFS4ERR_OLD_STATEID handling
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl2NC04ACgkQ18tUv7Cl
QOs4Tg//bAlGs+dIKixAmeMKmTd6I34laUnuyV/12yPQDgo6bryLrTngfe2BYvmG
2l+8H7yHfR4/gQE4vhR0c15xFgu6pvjBGR0/nNRaXienIPXO4xsQkcaxVA7SFRY2
HjffZwyoBfjyRps0jL+2sTsKbRtSkf9Dn+BONRgesg51jK1jyWkXqXpmgi4uMO4i
ojpTrW81dwo7Yhv08U2A/Q1ifMJ8F9dVYuL5sm+fEbVI/Nxoz766qyB8rs8+b4Xj
3gkfyh/Y1zoMmu6c+r2Q67rhj9WYbDKpa6HH9yX1zM/RLTiU7czMX+kjuQuOHWxY
YiEk73NjJ48WJEep3odess1q/6WiAXX7UiJM1SnDFgAa9NZMdfhqMm6XduNO1m60
sy0i8AdxdQciWYexOXMsBuDUCzlcoj4WYs1QGpY3uqO1MznQS/QUfu65fx8CzaT5
snm6ki5ivqXH/js/0Z4MX2n/sd1PGJ5ynMkekxJ8G3gw+GC/oeSeGNawfedifLKK
OdzyDdeiel5Me1p4I28j1WYVLHvtFmEWEU9oytdG0D/rjC/pgYgW/NYvAao8lQ4Z
06wdcyAM66ViAPrbYeE7Bx4jy8zYRkiw6Y3kIbLgrlMugu3BhIW5Mi3BsgL4f4am
KsqkzUqPZMCOVwDuUILSuPp4uHaR+JTJttywiLniTL6reF5kTiA=
=4Ey6
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"Stable bugfixes:
- Dequeue the request from the receive queue while we're re-encoding
# v4.20+
- Fix buffer handling of GSS MIC without slack # 5.1
Features:
- Increase xprtrdma maximum transport header and slot table sizes
- Add support for nfs4_call_sync() calls using a custom
rpc_task_struct
- Optimize the default readahead size
- Enable pNFS filelayout LAYOUTGET on OPEN
Other bugfixes and cleanups:
- Fix possible null-pointer dereferences and memory leaks
- Various NFS over RDMA cleanups
- Various NFS over RDMA comment updates
- Don't receive TCP data into a reset request buffer
- Don't try to parse incomplete RPC messages
- Fix congestion window race with disconnect
- Clean up pNFS return-on-close error handling
- Fixes for NFS4ERR_OLD_STATEID handling"
* tag 'nfs-for-5.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (53 commits)
pNFS/filelayout: enable LAYOUTGET on OPEN
NFS: Optimise the default readahead size
NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU
NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
NFSv4: Fix OPEN_DOWNGRADE error handling
pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping the state seqid
NFSv4: Add a helper to increment stateid seqids
NFSv4: Handle RPC level errors in LAYOUTRETURN
NFSv4: Handle NFS4ERR_DELAY correctly in return-on-close
NFSv4: Clean up pNFS return-on-close error handling
pNFS: Ensure we do clear the return-on-close layout stateid on fatal errors
NFS: remove unused check for negative dentry
NFSv3: use nfs_add_or_obtain() to create and reference inodes
NFS: Refactor nfs_instantiate() for dentry referencing callers
SUNRPC: Fix congestion window race with disconnect
SUNRPC: Don't try to parse incomplete RPC messages
SUNRPC: Rename xdr_buf_read_netobj to xdr_buf_read_mic
SUNRPC: Fix buffer handling of GSS MIC without slack
SUNRPC: RPC level errors should always set task->tk_rpc_status
SUNRPC: Don't receive TCP data into a request buffer that has been reset
...
Eli Dorfman reports that after a series of idle disconnects, an
RPC/RDMA transport becomes unusable (rdma_create_qp returns
-ENOMEM). Problem was tracked down to increasing Send Queue size
after each reconnect.
The rdma_create_qp() API does not promise to leave its @qp_init_attr
parameter unaltered. In fact, some drivers do modify one or more of
its fields. Thus our calls to rdma_create_qp must use a fresh copy
of ib_qp_init_attr each time.
This fix is appropriate for kernels dating back to late 2007, though
it will have to be adapted, as the connect code has changed over the
years.
Reported-by: Eli Dorfman <eli@vastdata.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Ensure that the re-establishment delay does not grow exponentially
on each good reconnect. This probably should have been part of
commit 675dd90ad0 ("xprtrdma: Modernize ops->connect").
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The optimization done in "xprtrdma: Simplify rpcrdma_mr_pop" was a
bit too optimistic. MRs left over after a reconnect still need to
be recycled, not added back to the free list, since they could be
in flight or actually fully registered.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Micro-optimization: In rpcrdma_post_recvs, since commit e340c2d6ef
("xprtrdma: Reduce the doorbell rate (Receive)"), the common case is
to return without doing anything. Found with perf.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Micro-optimization: Save the cost of three function calls during
transport header encoding.
These were "noinline" before to generate more meaningful call stacks
during debugging, but this code is now pretty stable.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
For the moment the returned value just happens to be correct because
the current backchannel server implementation does not vary the
number of credits it offers. The spec does permit this value to
change during the lifetime of a connection, however.
The actual maximum is fixed for all RPC/RDMA transports, because
each transport instance has to pre-allocate the resources for
processing BC requests. That's the value that should be returned.
Fixes: 7402a4fedc ("SUNRPC: Fix up backchannel slot table ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: The function name should match the documenting comment.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
rpcrdma_rep objects are removed from their free list by only a
single thread: the Receive completion handler. Thus that free list
can be converted to an llist, where a single-threaded consumer and
a multi-threaded producer (rpcrdma_buffer_put) can both access the
llist without the need for any serialization.
This eliminates spin lock contention between the Receive completion
handler and rpcrdma_buffer_get, and makes the rep consumer wait-
free.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Now that the free list is used sparingly, get rid of the
separate spin lock protecting it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Instead of a globally-contended MR free list, cache MRs in each
rpcrdma_req as they are released. This means acquiring and releasing
an MR will be lock-free in the common case, even outside the
transport send lock.
The original idea of per-rpcrdma_req MR free lists was suggested by
Shirley Ma <shirley.ma@oracle.com> several years ago. I just now
figured out how to make that idea work with on-demand MR allocation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Probably would be good to also pass GFP flags to ib_alloc_mr.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Refactor: Retrieve an MR and handle error recovery entirely in
rpc_rdma.c, as this is not a device-specific function.
Note that since commit 89f90fe1ad ("SUNRPC: Allow calls to
xprt_transmit() to drain the entire transmit queue"), the
xprt_transmit function handles the cond_resched. The transport no
longer has to do this itself.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up. There is only one remaining rpcrdma_mr_put call site, and
it can be directly replaced with unmap_and_put because mr->mr_dir is
set to DMA_NONE just before the call.
Now all the call sites do a DMA unmap, and we can just rename
mr_unmap_and_put to mr_put, which nicely matches mr_get.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: rpcrdma_mr_pop call sites check if the list is empty
first. Let's replace the list_empty with less costly logic.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Commit 48be539dd4 ("xprtrdma: Introduce ->alloc_slot call-out for
xprtrdma") added a separate alloc_slot and free_slot to the RPC/RDMA
transport. Later, commit 75891f502f ("SUNRPC: Support for
congestion control when queuing is enabled") modified the generic
alloc/free_slot methods, but neglected the methods in xprtrdma.
Found via code review.
Fixes: 75891f502f ("SUNRPC: Support for congestion control ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: There are other "all" list heads. For code clarity
distinguish this one as for use only for MRs by renaming it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Make the field name the same for all trace points that handle
pointers to struct rpcrdma_rep. That makes it easy to grep for
matching rep points in trace output.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Although I haven't seen any performance results that justify it,
I've received several complaints that NFS/RDMA no longer supports
a maximum rsize and wsize of 1MB. These days it is somewhat smaller.
To simplify the logic that determines whether a chunk list is
necessary, the implementation uses a fixed maximum size of the
transport header. Currently that maximum size is 256 bytes, one
quarter of the default inline threshold size for RPC/RDMA v1.
Since commit a78868497c ("xprtrdma: Reduce max_frwr_depth"), the
size of chunks is also smaller to take advantage of inline page
lists in device internal MR data structures.
The combination of these two design choices has reduced the maximum
NFS rsize and wsize that can be used for most RNIC/HCAs. Increasing
the maximum transport header size and the maximum number of RDMA
segments it can contain increases the negotiated maximum rsize/wsize
on common RNIC/HCAs.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Commit 302d3deb20 ("xprtrdma: Prevent inline overflow") added this
calculation back in 2016, but got it wrong. I tested only the lower
bound, which is why there is a max_t there. The upper bound should be
rounded up too.
Now, when using DIV_ROUND_UP, that takes care of the lower bound as
well.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Comment was made obsolete by commit 8cec3dba76 ("xprtrdma:
rpcrdma_regbuf alignment").
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Things have changed since this comment was written. In particular,
the reworking of connection closing, on-demand creation of MRs, and
the removal of fr_state all mean that deferring MR recovery to
frwr_map is no longer needed. The description is obsolete.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Use a wait-free mechanism for managing the svc_rdma_recv_ctxts free
list. Subsequently, sc_recv_lock can be eliminated.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: the system workqueue will work just as well.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Send and Receive completion is handled on a single CPU selected at
the time each Completion Queue is allocated. Typically this is when
an initiator instantiates an RDMA transport, or when a target
accepts an RDMA connection.
Some ULPs cannot open a connection per CPU to spread completion
workload across available CPUs and MSI vectors. For such ULPs,
provide an API that allows the RDMA core to select a completion
vector based on the device's complement of available comp_vecs.
ULPs that invoke ib_alloc_cq() with only comp_vector 0 are converted
to use the new API so that their completion workloads interfere less
with each other.
Suggested-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Cc: <linux-cifs@vger.kernel.org>
Cc: <v9fs-developer@lists.sourceforge.net>
Link: https://lore.kernel.org/r/20190729171923.13428.52555.stgit@manet.1015granger.net
Signed-off-by: Doug Ledford <dledford@redhat.com>
Merge yet more updates from Andrew Morton:
"The rest of MM and a kernel-wide procfs cleanup.
Summary of the more significant patches:
- Patch series "mm/memory_hotplug: Factor out memory block
devicehandling", v3. David Hildenbrand.
Some spring-cleaning of the memory hotplug code, notably in
drivers/base/memory.c
- "mm: thp: fix false negative of shmem vma's THP eligibility". Yang
Shi.
Fix /proc/pid/smaps output for THP pages used in shmem.
- "resource: fix locking in find_next_iomem_res()" + 1. Nadav Amit.
Bugfix and speedup for kernel/resource.c
- Patch series "mm: Further memory block device cleanups", David
Hildenbrand.
More spring-cleaning of the memory hotplug code.
- Patch series "mm: Sub-section memory hotplug support". Dan
Williams.
Generalise the memory hotplug code so that pmem can use it more
completely. Then remove the hacks from the libnvdimm code which
were there to work around the memory-hotplug code's constraints.
- "proc/sysctl: add shared variables for range check", Matteo Croce.
We have about 250 instances of
int zero;
...
.extra1 = &zero,
in the tree. This is a tree-wide sweep to make all those private
"zero"s and "one"s use global variables.
Alas, it isn't practical to make those two global integers const"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (38 commits)
proc/sysctl: add shared variables for range check
mm: migrate: remove unused mode argument
mm/sparsemem: cleanup 'section number' data types
libnvdimm/pfn: stop padding pmem namespaces to section alignment
libnvdimm/pfn: fix fsdax-mode namespace info-block zero-fields
mm/devm_memremap_pages: enable sub-section remap
mm: document ZONE_DEVICE memory-model implications
mm/sparsemem: support sub-section hotplug
mm/sparsemem: prepare for sub-section ranges
mm: kill is_dev_zone() helper
mm/hotplug: kill is_dev_zone() usage in __remove_pages()
mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap()
mm/hotplug: prepare shrink_{zone, pgdat}_span for sub-section removal
mm/sparsemem: add helpers track active portions of a section at boot
mm/sparsemem: introduce a SECTION_IS_EARLY flag
mm/sparsemem: introduce struct mem_section_usage
drivers/base/memory.c: get rid of find_memory_block_hinted()
mm/memory_hotplug: move and simplify walk_memory_blocks()
mm/memory_hotplug: rename walk_memory_range() and pass start+size instead of pfns
mm: make register_mem_sect_under_node() static
...
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.
On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.
The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:
$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248
Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.
This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:
# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%
[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Highlights include:
Stable fixes:
- SUNRPC: Ensure bvecs are re-synced when we re-encode the RPC request
- Fix an Oops in ff_layout_track_ds_error due to a PTR_ERR() dereference
- Revert buggy NFS readdirplus optimisation
- NFSv4: Handle the special Linux file open access mode
- pnfs: Fix a problem where we gratuitously start doing I/O through the MDS
Features:
- Allow NFS client to set up multiple TCP connections to the server using
a new 'nconnect=X' mount option. Queue length is used to balance load.
- Enhance statistics reporting to report on all transports when using
multiple connections.
- Speed up SUNRPC by removing bh-safe spinlocks
- Add a mechanism to allow NFSv4 to request that containers set a
unique per-host identifier for when the hostname is not set.
- Ensure NFSv4 updates the lease_time after a clientid update
Bugfixes and cleanup:
- Fix use-after-free in rpcrdma_post_recvs
- Fix a memory leak when nfs_match_client() is interrupted
- Fix buggy file access checking in NFSv4 open for execute
- disable unsupported client side deduplication
- Fix spurious client disconnections
- Fix occasional RDMA transport deadlock
- Various RDMA cleanups
- Various tracepoint fixes
- Fix the TCP callback channel to guarantee the server can actually send
the number of callback requests that was negotiated at mount time.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl0wz/EACgkQZwvnipYK
APJUrQ/9FpbWTaI/H0BkBf+/iIDAwuGozldDDRkHpHvykvcbNSFPNk5CA4fV/lCq
rI0tD9ijj0KJhAjS1RWLb9oXMUm/MDIIcZNvsbOiIfGftzGnxAz022KS2JyV+Xk6
ul7sCQ8QmbQJzfZk+aFZxl5slS9i5nM+Sa//k5SK3JghOlLuYkW50OlmE6ME8rKP
sLWD2quLh1J5mjG+B9ml4YWozNi0elFN6ZqdzO/SjIhIG7rn2Hkr0glWcEprdiS7
qHoF6dIn80oR37YNrcbBtEGqBnMnWkpGAGq2Hgf5I2YtO+xfEgMqqs9FNIcahtQG
SGa1bnpcys9+fG4fnwzgOgYkvGzKdi/bIYc8lhBowl+76z+6VgjwlwT3wRtnWfO9
288LZDMMfOSYoDkbA9iGtoQRviIsoGMoeH/r5tZe7U1W35h60NGPVhH9sP9/Gc2s
Va1xBYK6PaIg+UxeghJmCnNK3zANyEI7AtDVvn0wsJuDcgXO5/PAANenc6/Fl6nf
U9XB4wmgaM2jN5uqLYT3bJAaAPAXTOayVnnd9oDYb87/31sYM6iwzSPkmm5+qIsi
0Ftks4PSjpjTTrmUvs3CL+ll3Y9bWUd58qlooGwi6u1DnXDQDzbIoRQBezNFJ+Lc
MahUVY7w9ATyJFJRvbscO7mRS8qTmr4qP91bAOsYQ0P2Yo02L7k=
=eR7d
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- SUNRPC: Ensure bvecs are re-synced when we re-encode the RPC
request
- Fix an Oops in ff_layout_track_ds_error due to a PTR_ERR()
dereference
- Revert buggy NFS readdirplus optimisation
- NFSv4: Handle the special Linux file open access mode
- pnfs: Fix a problem where we gratuitously start doing I/O through
the MDS
Features:
- Allow NFS client to set up multiple TCP connections to the server
using a new 'nconnect=X' mount option. Queue length is used to
balance load.
- Enhance statistics reporting to report on all transports when using
multiple connections.
- Speed up SUNRPC by removing bh-safe spinlocks
- Add a mechanism to allow NFSv4 to request that containers set a
unique per-host identifier for when the hostname is not set.
- Ensure NFSv4 updates the lease_time after a clientid update
Bugfixes and cleanup:
- Fix use-after-free in rpcrdma_post_recvs
- Fix a memory leak when nfs_match_client() is interrupted
- Fix buggy file access checking in NFSv4 open for execute
- disable unsupported client side deduplication
- Fix spurious client disconnections
- Fix occasional RDMA transport deadlock
- Various RDMA cleanups
- Various tracepoint fixes
- Fix the TCP callback channel to guarantee the server can actually
send the number of callback requests that was negotiated at mount
time"
* tag 'nfs-for-5.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (68 commits)
pnfs/flexfiles: Add tracepoints for detecting pnfs fallback to MDS
pnfs: Fix a problem where we gratuitously start doing I/O through the MDS
SUNRPC: Optimise transport balancing code
SUNRPC: Ensure the bvecs are reset when we re-encode the RPC request
pnfs/flexfiles: Fix PTR_ERR() dereferences in ff_layout_track_ds_error
NFSv4: Don't use the zero stateid with layoutget
SUNRPC: Fix up backchannel slot table accounting
SUNRPC: Fix initialisation of struct rpc_xprt_switch
SUNRPC: Skip zero-refcount transports
SUNRPC: Replace division by multiplication in calculation of queue length
NFSv4: Validate the stateid before applying it to state recovery
nfs4.0: Refetch lease_time after clientid update
nfs4: Rename nfs41_setup_state_renewal
nfs4: Make nfs4_proc_get_lease_time available for nfs4.0
nfs: Fix copy-and-paste error in debug message
NFS: Replace 16 seq_printf() calls by seq_puts()
NFS: Use seq_putc() in nfs_show_stats()
Revert "NFS: readdirplus optimization by cache mechanism" (memleak)
SUNRPC: Fix transport accounting when caller specifies an rpc_xprt
NFS: Record task, client ID, and XID in xdr_status trace points
...
Add a per-transport maximum limit in the socket case, and add
helpers to allow the NFSv4 code to discover that limit.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
NFSoRDMA client updates for 5.3
New features:
- Add a way to place MRs back on the free list
- Reduce context switching
- Add new trace events
Bugfixes and cleanups:
- Fix a BUG when tracing is enabled with NFSv4.1
- Fix a use-after-free in rpcrdma_post_recvs
- Replace use of xdr_stream_pos in rpcrdma_marshal_req
- Fix occasional transport deadlock
- Fix show_nfs_errors macros, other tracing improvements
- Remove RPCRDMA_REQ_F_PENDING and fr_state
- Various simplifications and refactors
This topic branch covers a fundamental change in how our sg lists are
allocated to make mq more efficient by reducing the size of the
preallocated sg list. This necessitates a large number of driver
changes because the previous guarantee that if a driver specified
SG_ALL as the size of its scatter list, it would get a non-chained
list and didn't need to bother with scatterlist iterators is now
broken and every driver *must* use scatterlist iterators.
This was broken out as a separate topic because we need to convert all
the drivers before pulling the trigger and unconverted drivers kept
being found, necessitating a rebase.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXSTzzCYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishZB+AP9I8j/s
wWfg0Z3WNuf4D5I3rH4x1J3cQTqPJed+RjwgcQEA1gZvtOTg1ZEn/CYMVnaB92x0
t6MZSchIaFXeqfD+E7U=
=cv8o
-----END PGP SIGNATURE-----
Merge tag 'scsi-sg' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI scatter-gather list updates from James Bottomley:
"This topic branch covers a fundamental change in how our sg lists are
allocated to make mq more efficient by reducing the size of the
preallocated sg list.
This necessitates a large number of driver changes because the
previous guarantee that if a driver specified SG_ALL as the size of
its scatter list, it would get a non-chained list and didn't need to
bother with scatterlist iterators is now broken and every driver
*must* use scatterlist iterators.
This was broken out as a separate topic because we need to convert all
the drivers before pulling the trigger and unconverted drivers kept
being found, necessitating a rebase"
* tag 'scsi-sg' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (21 commits)
scsi: core: don't preallocate small SGL in case of NO_SG_CHAIN
scsi: lib/sg_pool.c: clear 'first_chunk' in case of no preallocation
scsi: core: avoid preallocating big SGL for data
scsi: core: avoid preallocating big SGL for protection information
scsi: lib/sg_pool.c: improve APIs for allocating sg pool
scsi: esp: use sg helper to iterate over scatterlist
scsi: NCR5380: use sg helper to iterate over scatterlist
scsi: wd33c93: use sg helper to iterate over scatterlist
scsi: ppa: use sg helper to iterate over scatterlist
scsi: pcmcia: nsp_cs: use sg helper to iterate over scatterlist
scsi: imm: use sg helper to iterate over scatterlist
scsi: aha152x: use sg helper to iterate over scatterlist
scsi: s390: zfcp_fc: use sg helper to iterate over scatterlist
scsi: staging: unisys: visorhba: use sg helper to iterate over scatterlist
scsi: usb: image: microtek: use sg helper to iterate over scatterlist
scsi: pmcraid: use sg helper to iterate over scatterlist
scsi: ipr: use sg helper to iterate over scatterlist
scsi: mvumi: use sg helper to iterate over scatterlist
scsi: lpfc: use sg helper to iterate over scatterlist
scsi: advansys: use sg helper to iterate over scatterlist
...
Adapt and apply changes that were made to the TCP socket connect
code. See the following commits for details on the purpose of
these changes:
Commit 7196dbb02e ("SUNRPC: Allow changing of the TCP timeout parameters on the fly")
Commit 3851f1cdb2 ("SUNRPC: Limit the reconnect backoff timer to the max RPC message timeout")
Commit 02910177ae ("SUNRPC: Fix reconnection timeouts")
Some common transport code is moved to xprt.c to satisfy the code
duplication police.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up.
There is only one remaining function, rpcrdma_buffer_put(), that
uses this field. Its caller can supply a pointer to the correct
rpcrdma_buffer, enabling the removal of an 8-byte pointer field
from a frequently-allocated shared data structure.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up.
Move the "not present" case into the individual chunk encoders. This
improves code organization and readability.
The reason for the original organization was to optimize for the
case where there there are no chunks. The optimization turned out to
be inconsequential, so let's err on the side of code readability.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
rb_lock is contended between rpcrdma_buffer_create,
rpcrdma_buffer_put, and rpcrdma_post_recvs.
Commit e340c2d6ef ("xprtrdma: Reduce the doorbell rate (Receive)")
causes rpcrdma_post_recvs to take the rb_lock repeatedly when it
determines more Receives are needed. Streamline this code path so
it takes the lock just once in most cases to build the Receive
chain that is about to be posted.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up.
Commit 7c8d9e7c88 ("xprtrdma: Move Receive posting to Receive
handler") reduced the number of rpcrdma_rep_create call sites to
one. After that commit, the backchannel code no longer invokes it.
Therefore the free list logic added by commit d698c4a02e
("xprtrdma: Fix backchannel allocation of extra rpcrdma_reps") is
no longer necessary, and in fact adds some extra overhead that we
can do without.
Simply post any newly created reps. They will get added back to
the rb_recv_bufs list when they subsequently complete.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Eliminate a context switch in the path that handles RPC wake-ups
when a Receive completion has to wait for a Send completion.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Since commit ba69cd122e ("xprtrdma: Remove support for FMR memory
registration"), FRWR is the only supported memory registration mode.
We can take advantage of the asynchronous nature of FRWR's LOCAL_INV
Work Requests to get rid of the completion wait by having the
LOCAL_INV completion handler take care of DMA unmapping MRs and
waking the upper layer RPC waiter.
This eliminates two context switches when local invalidation is
necessary. As a side benefit, we will no longer need the per-xprt
deferred completion work queue.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When a marshal operation fails, any MRs that were already set up for
that request are recycled. Recycling releases MRs and creates new
ones, which is expensive.
Since commit f287762308 ("xprtrdma: Chain Send to FastReg WRs")
was merged, recycling FRWRs is unnecessary. This is because before
that commit, frwr_map had already posted FAST_REG Work Requests,
so ownership of the MRs had already been passed to the NIC and thus
dealing with them had to be delayed until they completed.
Since that commit, however, FAST_REG WRs are posted at the same time
as the Send WR. This means that if marshaling fails, we are certain
the MRs are safe to simply unmap and place back on the free list
because neither the Send nor the FAST_REG WRs have been posted yet.
The kernel still has ownership of the MRs at this point.
This reduces the total number of MRs that the xprt has to create
under heavy workloads and makes the marshaling logic less brittle.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Now that both the Send and Receive completions are handled in
process context, it is safe to DMA unmap and return MRs to the
free or recycle lists directly in the completion handlers.
Doing this means rpcrdma_frwr no longer needs to track the state of
each MR, meaning that a VALID or FLUSHED MR can no longer appear on
an xprt's MR free list. Thus there is no longer a need to track the
MR's registration state in rpcrdma_frwr.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Commit 9590d083c1 ("xprtrdma: Use xprt_pin_rqst in
rpcrdma_reply_handler") pins incoming RPC/RDMA replies so they
can be left in the pending requests queue while they are being
processed without introducing a race between ->buf_free and the
transport's reply handler. Therefore RPCRDMA_REQ_F_PENDING is no
longer necessary.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Under high I/O workloads, I've noticed that an RPC/RDMA transport
occasionally deadlocks (IOPS goes to zero, and doesn't recover).
Diagnosis shows that the sendctx queue is empty, but when sendctxs
are returned to the queue, the xprt_write_space wake-up never
occurs. The wake-up logic in rpcrdma_sendctx_put_locked is racy.
I noticed that both EMPTY_SCQ and XPRT_WRITE_SPACE are implemented
via an atomic bit. Just one of those is sufficient. Removing
EMPTY_SCQ in favor of the generic bit mechanism makes the deadlock
un-reproducible.
Without EMPTY_SCQ, rpcrdma_buffer::rb_flags is no longer used and
is therefore removed.
Unfortunately this patch does not apply cleanly to stable. If
needed, someone will have to port it and test it.
Fixes: 2fad659209 ("xprtrdma: Wait on empty sendctx queue")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This is a latent bug. xdr_stream_pos works by subtracting
xdr_stream::nwords from xdr_buf::len. But xdr_stream::nwords is not
initialized by xdr_init_encode().
It works today only because all fields in rpcrdma_req::rl_stream
are initialized to zero by rpcrdma_req_create, making the
subtraction in xdr_stream_pos always a no-op.
I found this issue via code inspection. It was introduced by commit
39f4cd9e99 ("xprtrdma: Harden chunk list encoding against send
buffer overflow"), but the code has changed enough since then that
this fix can't be automatically applied to stable.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
failures on high-memory machines and fixing the DRC over RDMA.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl0fiP4VHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG++dwP/RkuKV6sjmopr5/SNK334UDGbpAk
h+VWAJSrnrLDqk2Ezwz4cO8XrPxMEiQQdKtiyIM51oshq9EN0u0gdOS7ycdz9mAm
qm1WgekRH3vNiNYU+im2r7CXXrBUYeggi1clbOoJnEwsKxV0qFG74OxO8fB5gNMP
Jeq46nofwSAjxLQBwTkHJhs0cV7rmJhq6mVWJ4lzD6JTzMH7FqO0zoJJHfyP5xb+
SXSheqWK4eTKhvPJR0JwyiUBUXYzNZyDoNlyRCSVylfxha3cTxF0GeG1Pm2uS+sm
V8QOnqud+4cF5Qa2zAZ2T/w7dgsfouAODQOi/gzDwE0EM+FojbOnk0CJwL7wuzB3
flkmWOMER0RV92z7gWqh5JDQFoHeVkldZfFTrYfVWIcPV+pGLiRayzt+dlSbpaYj
09jMlBLHXxHwCqPT5u8GFKveMNluYyIoKc3s38eojX2u3eg9HNsXoCfmbC2RGaZ4
iT2E5isl7donclTHDKEU7RkWnaboSQoB+oodMWH8TN7p9FfgpsaObWTebrsbvvBN
DMrdc+nJ+x78krI9XKpSQOmPpJH9siaIFn6nVq5oLNjytaH+UrrArvkLMKhuSW6L
8u5fU3SvL0Eriuz7EtgZwosy+VPvFpXgQVMdJ+z/cm32mOeFgcz4BNEMaQuFJVaN
fbY6fM46Xngo0A3x
=f2gw
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.2-2' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"Two more quick bugfixes for nfsd: fixing a regression causing mount
failures on high-memory machines and fixing the DRC over RDMA"
* tag 'nfsd-5.2-2' of git://linux-nfs.org/~bfields/linux:
nfsd: Fix overflow causing non-working mounts on 1 TB machines
svcrdma: Ignore source port when computing DRC hash
Dereference wr->next /before/ the memory backing wr has been
released. This issue was found by code inspection. It is not
expected to be a significant problem because it is in an error
path that is almost never executed.
Fixes: 7c8d9e7c88 ("xprtrdma: Move Receive posting to ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
sg_alloc_table_chained() currently allows the caller to provide one
preallocated SGL and returns if the requested number isn't bigger than
size of that SGL. This is used to inline an SGL for an IO request.
However, scattergather code only allows that size of the 1st preallocated
SGL to be SG_CHUNK_SIZE(128). This means a substantial amount of memory
(4KB) is claimed for the SGL for each IO request. If the I/O is small, it
would be prudent to allocate a smaller SGL.
Introduce an extra parameter to sg_alloc_table_chained() and
sg_free_table_chained() for specifying size of the preallocated SGL.
Both __sg_free_table() and __sg_alloc_table() assume that each SGL has the
same size except for the last one. Change the code to allow both functions
to accept a variable size for the 1st preallocated SGL.
[mkp: attempted to clarify commit desc]
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: netdev@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The DRC appears to be effectively empty after an RPC/RDMA transport
reconnect. The problem is that each connection uses a different
source port, which defeats the DRC hash.
Clients always have to disconnect before they send retransmissions
to reset the connection's credit accounting, thus every retransmit
on NFS/RDMA will miss the DRC.
An NFS/RDMA client's IP source port is meaningless for RDMA
transports. The transport layer typically sets the source port value
on the connection to a random ephemeral port. The server already
ignores it for the "secure port" check. See commit 16e4d93f6d
("NFSD: Ignore client's source port on RDMA transports").
The Linux NFS server's DRC resolves XID collisions from the same
source IP address by using the checksum of the first 200 bytes of
the RPC call header.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
struct boo entry[];
};
instance = kzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The comment hasn't been accurate for several years.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Commit e1ede312f1 ("xprtrdma: Fix helper that drains the
transport") replaced the ib_drain_qp() call, so update documenting
comments to reflect current operation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: rely on the trace points instead.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up.
Move the remaining field in rpcrdma_create_data_internal so the
structure can be removed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up.
The inline settings are actually a characteristic of the endpoint,
and not related to the device. They are also modified after the
transport instance is created, so they do not belong in the cdata
structure either.
Lastly, let's use names that are more natural to RDMA than to NFS:
inline_write -> inline_send and inline_read -> inline_recv. The
/proc files retain their names to avoid breaking user space.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up.
xprt_rdma_max_inline_{read,write} cannot be set to large values
by virtue of proc_dointvec_minmax. The current maximum is
RPCRDMA_MAX_INLINE, which is much smaller than RPCRDMA_MAX_SEGS *
PAGE_SIZE.
The .rsize and .wsize fields are otherwise unused in the current
code base, and thus can be removed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up.
Since commit 54cbd6b0c6 ("xprtrdma: Delay DMA mapping Send and
Receive buffers"), a pointer to the device is now saved in each
regbuf when it is DMA mapped.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Instead of using a fixed number, allow the amount of Send completion
batching to vary based on the client's maximum credit limit.
- A larger default gives a small boost to IOPS throughput
- Reducing it based on max_requests gives a safe result when the
max credit limit is cranked down (eg. when the device has a small
max_qp_wr).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Minor clean-ups I've stumbled on since sendctx was merged last year.
In particular, making Send completion processing more efficient
appears to have a measurable impact on IOPS throughput.
Note: test_and_clear_bit() returns a value, thus an explicit memory
barrier is not necessary.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Record an event when rpcrdma_marshal_req returns a non-zero return
value to help track down why an xprt close might have occurred.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reflects the change introduced in commit 067c469671 ("NFSv4.1:
Bump the default callback session slot count to 16").
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The Receive handler runs in process context, thus can use on-demand
GFP_KERNEL allocations instead of pre-allocation.
This makes the xprtrdma backchannel independent of the number of
backchannel session slots provisioned by the Upper Layer protocol.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
For code legibility, clean up the function names to be consistent
with the pattern: "rpcrdma" _ object-type _ action
Also rpcrdma_regbuf_alloc and rpcrdma_regbuf_free no longer have any
callers outside of verbs.c, and can thus be made static.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up by providing an API to do this common task.
At this point, the difference between rpcrdma_get_sendbuf and
rpcrdma_get_recvbuf has become tiny. These can be collapsed into a
single helper.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Allocating an rpcrdma_req's regbufs at xprt create time enables
a pair of micro-optimizations:
First, if these regbufs are always there, we can eliminate two
conditional branches from the hot xprt_rdma_allocate path.
Second, by allocating a 1KB buffer, it places a lower bound on the
size of these buffers, without adding yet another conditional
branch. The lower bound reduces the number of hardway re-
allocations. In fact, for some workloads it completely eliminates
hardway allocations.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Allocate the struct rpcrdma_regbuf separately from the I/O buffer
to better guarantee the alignment of the I/O buffer and eliminate
the wasted space between the rpcrdma_regbuf metadata and the buffer
itself.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
For code legibility, clean up the function names to be consistent
with the pattern: "rpcrdma" _ object-type _ action
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>