Currently svc_rdma_sendto() migrates xdr_buf pages into a separate
page list and NULLs out a bunch of entries in rq_pages while the
pages are under I/O. The Send completion handler then frees those
pages later.
Instead, let's wait for the Send completion, then handle page
releasing in the nfsd thread. I'd like to avoid the cost of 250+
put_page() calls in the Send completion handler, which is single-
threaded.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This brings it in line with the regular tcp backchannel, which also has
all those timeouts disabled.
Prevents the backchannel from timing out, getting some async operations
like server side copying getting stuck indefinitely on the client side.
Signed-off-by: Timo Rothenpieler <timo@rothenpieler.org>
Fixes: 5d252f90a8 ("svcrdma: Add class for RDMA backwards direction transport")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
During the final stages of publication of RFC 8167, reviewers
requested that we use the term "reverse direction" rather than
"backwards direction". Update comments to reflect this preference.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When counting the number of SGEs needed to construct a Send request,
do not count result payloads. And, when copying the Reply message
into the pull-up buffer, result payloads are not to be copied to the
Send buffer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
First, refactor: Dereference the svc_rdma_send_ctxt inside
svc_rdma_send() instead of at every call site.
Then, it can be passed into trace_svcrdma_post_send() to get the
proper completion ID.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Way back when I was writing the RPC/RDMA server-side backchannel
code, I misread the TCP backchannel reply handler logic. When
svc_tcp_recvfrom() successfully receives a backchannel reply, it
does not return -EAGAIN. It sets XPT_DATA and returns zero.
Update svc_rdma_recvfrom() to return zero. Here, XPT_DATA doesn't
need to be set again: it is set whenever a new message is received,
behind a spin lock in a single threaded context.
Also, if handling the cb reply is not successful, the message is
simply dropped. There's no special message framing to deal with as
there is in the TCP case.
Now that the handle_bc_reply() return value is ignored, I've removed
the dprintk call sites in the error exit of handle_bc_reply() in
favor of trace points in other areas that already report the error
cases.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
It appears that the RPC/RDMA transport does not need serialization
of calls to its xpo_sendto method. Move the mutex into the socket
methods that still need that serialization.
Tail latencies are unambiguously better with this patch applied.
fio randrw 8KB 70/30 on NFSv3, smaller numbers are better:
clat percentiles (usec):
With xpt_mutex:
r | 99.99th=[ 8848]
w | 99.99th=[ 9634]
Without xpt_mutex:
r | 99.99th=[ 8586]
w | 99.99th=[ 8979]
Serializing the construction of RPC/RDMA transport headers is not
really necessary at this point, because the Linux NFS server
implementation never changes its credit grant on a connection. If
that should change, then svc_rdma_sendto will need to serialize
access to the transport's credit grant fields.
Reported-by: kbuild test robot <lkp@intel.com>
[ cel: fix uninitialized variable warning ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Currently, after the forward channel connection goes away,
backchannel operations are causing soft lockups on the server
because call_transmit_status's SOFTCONN logic ignores ENOTCONN.
Such backchannel Calls are aggressively retried until the client
reconnects.
Backchannel Calls should use RPC_TASK_NOCONNECT rather than
RPC_TASK_SOFTCONN. If there is no forward connection, the server is
not capable of establishing a connection back to the client, thus
that backchannel request should fail before the server attempts to
send it. Commit 58255a4e3c ("NFSD: NFSv4 callback client should
use RPC_TASK_SOFTCONN") was merged several years before
RPC_TASK_NOCONNECT was available.
Because setup_callback_client() explicitly sets NOPING, the NFSv4.0
callback connection depends on the first callback RPC to initiate
a connection to the client. Thus NFSv4.0 needs to continue to use
RPC_TASK_SOFTCONN.
Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org> # v4.20+
Performance optimization: Avoid syncing the transport buffer twice
when Reply buffer pull-up is necessary.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Same idea as the receive-side changes I did a while back: use
xdr_stream helpers rather than open-coding the XDR chunk list
encoders. This builds the Reply transport header from beginning to
end without backtracking.
As additional clean-ups, fill in documenting comments for the XDR
encoders and sprinkle some trace points in the new encoding
functions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Preparing for subsequent patches, no behavior change expected.
Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto()
path. This enables passing more information about Requester-
provided Write and Reply chunks into those lower-level functions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
I noticed that for callback requests, the reported backlog latency
is always zero, and the rtt value is crazy big. The problem was that
rqst->rq_xtime is never set for backchannel requests.
Fixes: 78215759e2 ("SUNRPC: Make RTT measurement more ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Convert the transport callback to actually put the request to sleep
instead of just setting a timeout. This is in preparation for
rpc_sleep_on_timeout().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
tsh_size was added to accommodate transports that send a pre-amble
before each RPC message. However, this assumes the pre-amble is
fixed in size, which isn't true for some transports. That makes
tsh_size not very generic.
Also I'd like to make the estimation of RPC send and receive
buffer sizes more precise. tsh_size doesn't currently appear to be
accounted for at all by call_allocate.
Therefore let's just remove the tsh_size concept, and make the only
transports that have a non-zero tsh_size employ a direct approach.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Transport disconnect processing does a "wake pending tasks" at
various points.
Suppose an RPC Reply is being processed. The RPC task that Reply
goes with is waiting on the pending queue. If a disconnect wake-up
happens before reply processing is done, that reply, even if it is
good, is thrown away, and the RPC has to be sent again.
This window apparently does not exist for socket transports because
there is a lock held while a reply is being received which prevents
the wake-up call until after reply processing is done.
To resolve this, all RPC replies being processed on an RPC-over-RDMA
transport have to complete before pending tasks are awoken due to a
transport disconnect.
Callers that already hold the transport write lock may invoke
->ops->close directly. Others use a generic helper that schedules
a close when the write lock can be taken safely.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
already supported COPY, by copying a limited amount of data and then
returning a short result, letting the client resend. The asynchronous
protocol should offer better performance at the expense of some
complexity.
The other highlight is Trond's work to convert the duplicate reply cache
to a red-black tree, and to move it and some other server caches to RCU.
(Previously these have meant taking global spinlocks on every RPC.)
Otherwise, some RDMA work and miscellaneous bugfixes.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJb2KWzAAoJECebzXlCjuG+gcQP/3DldB86CFxgSFx0t+h+s+TV
CdYJDPyLyRkEMiD+4dCPPuhueve+j5BPHVsDbn98FTWrEn131NMIs6uhU/VGTtAU
6a8f/ExtZ5U7s39MJCzlk2ozVElBc3QPp7p3p9NKn0Wi0PXbVgjuIqR5o2vwa8Si
KOVdLm6ylfav/HTH8DO6zFPJRsTgTwcJOivXXshjpglMKAcw8AuqSsGgBrDeGpgU
u91Vi0EM1vt96+CA6a01mTgC/sFX7EqGvxUUHOrKWf5cIjnpT3FDvouYPxi+GH8Z
SIDlaMQyXF5m4m6MhELNTP4v97XAHyPJtvLkEe5lggTyABPiA2heo9e8onysWkzV
1v8OZHCVFa1UL34mDlnFxbFCYVr7FFKMGjTBR/ntinobPfAbWRCO1Hdd+bBGPDD4
byf7ctDVp7KQ2bSatIdlYavikuGDHWFDZHzPHlqkD3gpIZSNvhe26sV3NZqIFlXO
cMUega2Y5mXmULauHhxAcNGtDK7dF5hHoMWKJy0DNxiyDiDLylwDOIfwt1De3Q7V
ycd/wUytUS2LkAhyS2mvoDK6eXTBAeQwzmXAqveh6rewwO83HC/t9mtKBBDomvKG
xRpRPmmbj9ijbwkilEBmijjR47wrihmEVIFahznEerZ+//QOfVVOB0MNtzIyU9/k
CnP1ZNvOs3LR1pxxwFa8
=TTo0
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.20' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Olga added support for the NFSv4.2 asynchronous copy protocol. We
already supported COPY, by copying a limited amount of data and then
returning a short result, letting the client resend. The asynchronous
protocol should offer better performance at the expense of some
complexity.
The other highlight is Trond's work to convert the duplicate reply
cache to a red-black tree, and to move it and some other server caches
to RCU. (Previously these have meant taking global spinlocks on every
RPC)
Otherwise, some RDMA work and miscellaneous bugfixes"
* tag 'nfsd-4.20' of git://linux-nfs.org/~bfields/linux: (30 commits)
lockd: fix access beyond unterminated strings in prints
nfsd: Fix an Oops in free_session()
nfsd: correctly decrement odstate refcount in error path
svcrdma: Increase the default connection credit limit
svcrdma: Remove try_module_get from backchannel
svcrdma: Remove ->release_rqst call in bc reply handler
svcrdma: Reduce max_send_sges
nfsd: fix fall-through annotations
knfsd: Improve lookup performance in the duplicate reply cache using an rbtree
knfsd: Further simplify the cache lookup
knfsd: Simplify NFS duplicate replay cache
knfsd: Remove dead code from nfsd_cache_lookup
SUNRPC: Simplify TCP receive code
SUNRPC: Replace the cache_detail->hash_lock with a regular spinlock
SUNRPC: Remove non-RCU protected lookup
NFS: Fix up a typo in nfs_dns_ent_put
NFS: Lockless DNS lookups
knfsd: Lockless lookup of NFSv4 identities.
SUNRPC: Lockless server RPCSEC_GSS context lookup
knfsd: Allow lockless lookups of the exports
...
Since commit ffe1f0df58 ("rpcrdma: Merge svcrdma and xprtrdma
modules into one"), the forward and backchannel components are part
of the same kernel module. A separate try_module_get() call in the
backchannel code is no longer necessary.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Similar to a change made in the client's forward channel reply
handler: The xprt_release_rqst_cong() call is not necessary.
Also, release xprt->recv_lock when taking xprt->transport_lock
to avoid disabling and enabling BH's while holding another
spin lock.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Stable bugfixes:
- Reset credit grant properly after a disconnect
Other bugfixes and cleanups:
- xprt_release_rqst_cong is called outside of transport_lock
- Create more MRs at a time and toss out old ones during recovery
- Various improvements to the RDMA connection and disconnection code:
- Improve naming of trace events, functions, and variables
- Add documenting comments
- Fix metrics and stats reporting
- Fix a tracepoint sparse warning
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlvHmcUACgkQ18tUv7Cl
QOv5Mg//ZIL92L6WqW2C+Tddr4UcPg1YphBEwGo3+TrswSRg3ncDiTQ8ycOOrmoy
7m5Oe5I1uEM0Ejqu0lh0uoxJlxRtMF0pwpnTA2Mx6bb4GLSTXjJQomKBhZ3v6owo
RQaQZTnAT+T5w1jZMuImdZ+c1zNNiFonSdPO7Er5jbdczvY6N7bg84goLoXLZkjk
cuYFbBl3DAyoUJ1usgiuCZLbMcEe0isJEtFU45dLkxxFkvNk+gO8UtA48qe0rFNg
8LQMHqhXDhHbdqLFpIRdvaanRpi8VjhCukE+Af9z/y0XNPYItWKTm0clkZ1bu/D4
/Q6gUnCU8KeSzlPqrT7nATO6L5sHlqlE9vSbJpWguvgBg9JbMDZquh0gejVqGr4t
1YbJyNh/anl5Xm56CIADEbQK3QocyDRwk9tQhlOUEwBu7rgQrU7NO+1CHgXRD94c
iNI892W9FZZQxKOkWnb3DtgpnmuQ4k9tLND/SSnqOllADpztDag+czOxtOHOxlK0
sVh4U82JtYtP9ubhKzFvTDlFv3rcjE86Nn55mAgCFk/XBDLF3kMjjZcS527YLkWY
OqDVeKin7nKv1ZfeV9msWbaKp4w2sNhgtROGQr4g1/FktRXS6b8XsTxn7anyVYzM
l8SJx66q8XFaWcp0Kwp55oOyhh16dhhel34lWi+wAlF79ypMGlE=
=SxSD
-----END PGP SIGNATURE-----
Merge tag 'nfs-rdma-for-4.20-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
NFS RDMA client updates for Linux 4.20
Stable bugfixes:
- Reset credit grant properly after a disconnect
Other bugfixes and cleanups:
- xprt_release_rqst_cong is called outside of transport_lock
- Create more MRs at a time and toss out old ones during recovery
- Various improvements to the RDMA connection and disconnection code:
- Improve naming of trace events, functions, and variables
- Add documenting comments
- Fix metrics and stats reporting
- Fix a tracepoint sparse warning
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
On a fresh connection, an RPC/RDMA client is supposed to send only
one RPC Call until it gets a credit grant in the first RPC Reply
from the server [RFC 8166, Section 3.3.3].
There is a bug in the Linux client's credit accounting mechanism
introduced by commit e7ce710a88 ("xprtrdma: Avoid deadlock when
credit window is reset"). On connect, it simply dumps all pending
RPC Calls onto the new connection.
Servers have been tolerant of this bad behavior. Currently no server
implementation ever changes its credit grant over reconnects, and
servers always repost enough Receives before connections are fully
established.
To correct this issue, ensure that the client resets both the credit
grant _and_ the congestion window when handling a reconnect.
Fixes: e7ce710a88 ("xprtrdma: Avoid deadlock when credit ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Treat socket write space handling in the same way we now treat transport
congestion: by denying the XPRT_LOCK until the transport signals that it
has free buffer space.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When we shift to using the transmit queue, then the task that holds the
write lock will not necessarily be the same as the one being transmitted.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Highlights include:
Stable fixes:
- Fix a 1-byte stack overflow in nfs_idmap_read_and_verify_message
- Fix a hang due to incorrect error returns in rpcrdma_convert_iovs()
- Revert an incorrect change to the NFSv4.1 callback channel
- Fix a bug in the NFSv4.1 sequence error handling
Features and optimisations:
- Support for piggybacking a LAYOUTGET operation to the OPEN compound
- RDMA performance enhancements to deal with transport congestion
- Add proper SPDX tags for NetApp-contributed RDMA source
- Do not request delegated file attributes (size+change) from the server
- Optimise away a GETATTR in the lookup revalidate code when doing NFSv4 OPEN
- Optimise away unnecessary lookups for rename targets
- Misc performance improvements when freeing NFSv4 delegations
Bugfixes and cleanups:
- Try to fail quickly if proto=rdma
- Clean up RDMA receive trace points
- Fix sillyrename to return the delegation when appropriate
- Misc attribute revalidation fixes
- Immediately clear the pNFS layout on a file when the server returns ESTALE
- Return NFS4ERR_DELAY when delegation/layout recalls fail due to igrab()
- Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbH8gIAAoJEA4mA3inWBJcpzYQAJYY3ykt9oLQgm/2b/D/weDe
6890M9W5nIeuZq5soWSpYsZTxqIFbGV4laG/eCTW1gUN1TitSZsoOp7kqhRHXOjq
Rv3ZvjlZsP2qv2SnzsEmhJsynfyB46d19smSTJhgQ8dnXhaZv04Wsd4krLHx0z6p
uUUis5Q1m+vL7HsFPp3iUareO/DFKeSkw2cQ2V5ksTIEiAzX7GC+Ex/KKWf82nrJ
hm7+Nq7rLf1QHJkQvsc3fYCMR4gIzEwUu6F8RyxCoAVgD6O90Hx6NbxnINaHDD4N
U0nRP5LwCyN9hbPWvwcH7Sn4ePDTos2yj2tFO5NP9btTLDVLFSGYZ2a74d9PRdAf
9jn6f6juSDwI7T6NXvkHzzkJG6Or9ABAUZo+yX5JoD6lmgOcPUJpLRy6fu7UxAuN
a5OZ7d9edYpOi0Kys8sDSIlLlxZtFkvybOMVuI3dSHsI+c0g39w8oarpqT2wXWMs
/ZtFz0FCreHhKkNtz7Z49z1UQHDv/XYM0WkcO+eaeK58RLIEE0pZHoMvPKP63lkI
nbbgHvBRAu38Jtvvu65Hpb/VpBcqNGM5hjN1cfW/BOqAPKW23s4vWVj+/1silfW/
uw0MkNrDC9endoALp/YMCcMwPvEw9Awt9y4KjMgfVgSnKwXd0HaSZ2zE6aJU3Wry
Fy2Tv0e0OH3z9Bi/LNuJ
=YWSl
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- Fix a 1-byte stack overflow in nfs_idmap_read_and_verify_message
- Fix a hang due to incorrect error returns in rpcrdma_convert_iovs()
- Revert an incorrect change to the NFSv4.1 callback channel
- Fix a bug in the NFSv4.1 sequence error handling
Features and optimisations:
- Support for piggybacking a LAYOUTGET operation to the OPEN compound
- RDMA performance enhancements to deal with transport congestion
- Add proper SPDX tags for NetApp-contributed RDMA source
- Do not request delegated file attributes (size+change) from the
server
- Optimise away a GETATTR in the lookup revalidate code when doing
NFSv4 OPEN
- Optimise away unnecessary lookups for rename targets
- Misc performance improvements when freeing NFSv4 delegations
Bugfixes and cleanups:
- Try to fail quickly if proto=rdma
- Clean up RDMA receive trace points
- Fix sillyrename to return the delegation when appropriate
- Misc attribute revalidation fixes
- Immediately clear the pNFS layout on a file when the server returns
ESTALE
- Return NFS4ERR_DELAY when delegation/layout recalls fail due to
igrab()
- Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY"
* tag 'nfs-for-4.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (80 commits)
skip LAYOUTRETURN if layout is invalid
NFSv4.1: Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY
NFSv4: Fix a typo in nfs41_sequence_process
NFSv4: Revert commit 5f83d86cf5 ("NFSv4.x: Fix wraparound issues..")
NFSv4: Return NFS4ERR_DELAY when a layout recall fails due to igrab()
NFSv4: Return NFS4ERR_DELAY when a delegation recall fails due to igrab()
NFSv4.0: Remove transport protocol name from non-UCS client ID
NFSv4.0: Remove cl_ipaddr from non-UCS client ID
NFSv4: Fix a compiler warning when CONFIG_NFS_V4_1 is undefined
NFS: Filter cache invalidation when holding a delegation
NFS: Ignore NFS_INO_REVAL_FORCED in nfs_check_inode_attributes()
NFS: Improve caching while holding a delegation
NFS: Fix attribute revalidation
NFS: fix up nfs_setattr_update_inode
NFSv4: Ensure the inode is clean when we set a delegation
NFSv4: Ignore NFS_INO_REVAL_FORCED in nfs4_proc_access
NFSv4: Don't ask for delegated attributes when adding a hard link
NFSv4: Don't ask for delegated attributes when revalidating the inode
NFS: Pass the inode down to the getattr() callback
NFSv4: Don't request size+change attribute if they are delegated to us
...
While sending each RPC Reply, svc_rdma_sendto allocates and DMA-
maps a separate buffer where the RPC/RDMA transport header is
constructed. The buffer is unmapped and released in the Send
completion handler. This is significant per-RPC overhead,
especially for small RPCs.
Instead, allocate and DMA-map a buffer, and cache it in each
svc_rdma_send_ctxt. This buffer and its mapping can be re-used
for each RPC, saving the cost of memory allocation and DMA
mapping.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: Now that the send_wr is part of the svc_rdma_send_ctxt,
svc_rdma_post_send_wr is nearly empty.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
svc_rdma_op_ctxt's are pre-allocated and maintained on a per-xprt
free list. This eliminates the overhead of calling kmalloc / kfree,
both of which grab a globally shared lock that disables interrupts.
Introduce a replacement to svc_rdma_op_ctxt's that is built
especially for the svcrdma Send path.
Subsequent patches will take advantage of this new structure by
allocating real resources which are then cached in these objects.
The allocations are freed when the transport is torn down.
I've renamed the structure so that static type checking can be used
to ensure that uses of op_ctxt and send_ctxt are not confused. As an
additional clean up, structure fields are renamed to conform with
kernel coding conventions.
Additional clean ups:
- Handle svc_rdma_send_ctxt_get allocation failure at each call
site, rather than pre-allocating and hoping we guessed correctly
- All send_ctxt_put call-sites request page freeing, so remove
the @free_pages argument
- All send_ctxt_put call-sites unmap SGEs, so fold that into
svc_rdma_send_ctxt_put
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: Since there's already a svc_rdma_op_ctxt being passed
around with the running count of mapped SGEs, drop unneeded
parameters to svc_rdma_post_send_wr().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This includes:
* Posting on the Send and Receive queues
* Send, Receive, Read, and Write completion
* Connect upcalls
* QP errors
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Refactor: xprtrdma needs to have better control over when RPCs are
awoken from the backlog queue, so replace xprt_free_slot with a
transport op callout.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This change improves Receive efficiency by posting Receives only
on the same CPU that handles Receive completion. Improved latency
and throughput has been noted with this change.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
- fix a number of races in the NFSv4+ state code.
- fix some shutdown crashes in multiple-network-namespace cases.
- relax our 4.1 session limits; if you've an artificially low limit
to the number of 4.1 clients that can mount simultaneously, try
upgrading.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJaEH3oAAoJECebzXlCjuG++t0P/2t7RvRUunQa4pngCmg5QbOA
rldfEd1HM1F6+4fXzN0wcxWjphUNxs19VjEaWNjThYoGGTEdSOuFhBHgK18xmHjp
Cjz5IYJ0yS7PClCxMTmz5u3gfyExPR83whmNaNK69CGvn5xu97gDntOv/06Llw4Y
nCUJrEmVcMAOHek3tOD0Rlv8eYFyfLhF6zacp+qWFIlymU118iK1Or83M7pi6j51
yVVOvxktDLzkyDq5gQD/Py3rKHikOWFMCoseOPfMnOiGF/Bp7YDzWt6HT17mwyU4
xDeICbnfqve2SwT9NChpJOYtUAPuZDiQR6G2ZtnI8/JN7ob/wls/4CbDVlzYFN4r
dLsRlEC5spQmg34j6dscOKkt1vRK9vKXTC46wEMfXZLtiDLA/uZ/J0gNh3EXqpbt
LQQZI4B2MomYPcp64i4UHHO8BqSIX+lC5otVlAW105TQvZflJ8Mhtawmpu1O3nXZ
DSUhkZrImlBmb7/ulhjyXpmNAxQLXsqb0lP5tUYR5Re+A2lyea/pMJmtBLu3fv6h
tzHqq2JL13kblqJY+Frc1zqQGI5AAyKmdTTjmljBIGHxbVwAMzk1qO+VOI/f+J21
MWNmFkEqw+Tnvwy6sIm1eUGtTWIGc6ejvMxXguAfa+QjT4iHAL3F4PkpSihzIZnm
bzHDeJ87HRWWj/ICPQ1j
=PBs+
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.15' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Lots of good bugfixes, including:
- fix a number of races in the NFSv4+ state code
- fix some shutdown crashes in multiple-network-namespace cases
- relax our 4.1 session limits; if you've an artificially low limit
to the number of 4.1 clients that can mount simultaneously, try
upgrading"
* tag 'nfsd-4.15' of git://linux-nfs.org/~bfields/linux: (22 commits)
SUNRPC: Improve ordering of transport processing
nfsd: deal with revoked delegations appropriately
svcrdma: Enqueue after setting XPT_CLOSE in completion handlers
nfsd: use nfs->ns.inum as net ID
rpc: remove some BUG()s
svcrdma: Preserve CB send buffer across retransmits
nfds: avoid gettimeofday for nfssvc_boot time
fs, nfsd: convert nfs4_file.fi_ref from atomic_t to refcount_t
fs, nfsd: convert nfs4_cntl_odstate.co_odcount from atomic_t to refcount_t
fs, nfsd: convert nfs4_stid.sc_count from atomic_t to refcount_t
lockd: double unregister of inetaddr notifiers
nfsd4: catch some false session retries
nfsd4: fix cached replies to solo SEQUENCE compounds
sunrcp: make function _svc_create_xprt static
SUNRPC: Fix tracepoint storage issues with svc_recv and svc_rqst_status
nfsd: use ARRAY_SIZE
nfsd: give out fewer session slots as limit approaches
nfsd: increase DRC cache limit
nfsd: remove unnecessary nofilehandle checks
nfs_common: convert int to bool
...
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This further reduces contention with the transport_lock, and allows us
to convert to using a non-bh-safe spinlock, since the list is now never
accessed from a bh context.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
After transport instance creation, these function pointers never
change. Mark them as constant to prevent their use as an attack
vector for code injections.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Replace C structure-based XDR decoding with pointer arithmetic.
Pointer arithmetic is considered more portable.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The current svcrdma sendto code path posts one RDMA Write WR at a
time. Each of these Writes typically carries a small number of pages
(for instance, up to 30 pages for mlx4 devices). That means a 1MB
NFS READ reply requires 9 ib_post_send() calls for the Write WRs,
and one for the Send WR carrying the actual RPC Reply message.
Instead, use the new rdma_rw API. The details of Write WR chain
construction and memory registration are taken care of in the RDMA
core. svcrdma can focus on the details of the RPC-over-RDMA
protocol. This gives three main benefits:
1. All Write WRs for one RDMA segment are posted in a single chain.
As few as one ib_post_send() for each Write chunk.
2. The Write path can now use FRWR to register the Write buffers.
If the device's maximum page list depth is large, this means a
single Write WR is needed for each RPC's Write chunk data.
3. The new code introduces support for RPCs that carry both a Write
list and a Reply chunk. This combination can be used for an NFSv4
READ where the data payload is large, and thus is removed from the
Payload Stream, but the Payload Stream is still larger than the
inline threshold.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Introduce a helper to DMA-map a reply's transport header before
sending it. This will in part replace the map vector cache.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: Move the ib_send_wr off the stack, and move common code
to post a Send Work Request into a helper.
This is a refactoring change only.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
bugfixes.
A couple changes could theoretically break working setups on upgrade. I
don't expect complaints in practice, but they seem worth calling out
just in case:
- NFS security labels are now off by default; a new
security_label export flag reenables it per export. But,
having them on by default is a disaster, as it generally only
makes sense if all your clients and servers have similar
enough selinux policies. Thanks to Jason Tibbitts for
pointing this out.
- NFSv4/UDP support is off. It was never really supported, and
the spec explicitly forbids it. We only ever left it on out
of laziness; thanks to Jeff Layton for finally fixing that.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYtejbAAoJECebzXlCjuG+JhEQAK3YTYYNrPY26Pfiu0FghLFV
4qOHK4DOkJzrWIom5uWyBo7yOwH6WnQtTe/gCx/voOEW3lsJO7F3IfTnTVp+Smp6
GJeVtsr1vI9EBnwhMlyoJ5hZ2Ju5kX3MBVnew6+momt6620ZO7a+EtT+74ePaY8Y
jxLzWVA1UqbWYoMabNQpqgKypKvNrhwst72iYyBhNuL/qtGeBDQWwcrA+TFeE9tv
Ad7qB53xL1mr0Wn1CNIOR/IzVAj4o2H0vdjqrPjAdvbfwf8YYLNpJXt5k591wx/j
1TpiWIPqnwLjMT3X5NkQN3agZKeD+2ZWPrClr35TgRRe62CK6JblK9/Wwc5BRSzV
paMP3hOm6/dQOBA5C+mqPaHdEI8VqcHyZpxU4VC/ttsVEGTgaLhGwHGzSn+5lYiM
Qx9Sh50yFV3oiBW/sb/y8lBDwYm/Cq0OyqAU277idbdzjcFerMg1qt06tjEQhYMY
K2V7rS8NuADUF6F1BwOONZzvg7Rr7iWHmLh+iSM9TeQoEmz2jIHSaLIFaYOET5Jr
PIZS3rOYoa0FaKOYVYnZMC74n/LqP/Aou8B+1rRcLy5YEdUIIIVpFvqpg1nGv6PI
sA3zx/f13IRte3g0CuQiY0/2cx7uXk/gXJ7s5+ejEzljF/aYWiomx3mr6HqPQETn
CWEtXlfyJCyX+A8hbO+U
=iLLz
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.11' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"The nfsd update this round is mainly a lot of miscellaneous cleanups
and bugfixes.
A couple changes could theoretically break working setups on upgrade.
I don't expect complaints in practice, but they seem worth calling out
just in case:
- NFS security labels are now off by default; a new security_label
export flag reenables it per export. But, having them on by default
is a disaster, as it generally only makes sense if all your clients
and servers have similar enough selinux policies. Thanks to Jason
Tibbitts for pointing this out.
- NFSv4/UDP support is off. It was never really supported, and the
spec explicitly forbids it. We only ever left it on out of
laziness; thanks to Jeff Layton for finally fixing that"
* tag 'nfsd-4.11' of git://linux-nfs.org/~bfields/linux: (34 commits)
nfsd: Fix display of the version string
nfsd: fix configuration of supported minor versions
sunrpc: don't register UDP port with rpcbind when version needs congestion control
nfs/nfsd/sunrpc: enforce transport requirements for NFSv4
sunrpc: flag transports as having congestion control
sunrpc: turn bitfield flags in svc_version into bools
nfsd: remove superfluous KERN_INFO
nfsd: special case truncates some more
nfsd: minor nfsd_setattr cleanup
NFSD: Reserve adequate space for LOCKT operation
NFSD: Get response size before operation for all RPCs
nfsd/callback: Drop a useless data copy when comparing sessionid
nfsd/callback: skip the callback tag
nfsd/callback: Cleanup callback cred on shutdown
nfsd/idmap: return nfserr_inval for 0-length names
SUNRPC/Cache: Always treat the invalid cache as unexpired
SUNRPC: Drop all entries from cache_detail when cache_purge()
svcrdma: Poll CQs in "workqueue" mode
svcrdma: Combine list fields in struct svc_rdma_op_ctxt
svcrdma: Remove unused sc_dto_q field
...
Removing linux/phy.h from net/dsa.h reveals a build error in the sunrpc
code:
net/sunrpc/xprtrdma/svc_rdma_backchannel.c: In function 'xprt_rdma_bc_put':
net/sunrpc/xprtrdma/svc_rdma_backchannel.c:277:2: error: implicit declaration of function 'module_put' [-Werror=implicit-function-declaration]
net/sunrpc/xprtrdma/svc_rdma_backchannel.c: In function 'xprt_setup_rdma_bc':
net/sunrpc/xprtrdma/svc_rdma_backchannel.c:348:7: error: implicit declaration of function 'try_module_get' [-Werror=implicit-function-declaration]
Fix this by adding linux/module.h to svc_rdma_backchannel.c
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace C structure-based XDR decoding with pointer arithmetic.
Pointer arithmetic is considered more portable.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up.
/linux-2.6/net/sunrpc/xprtrdma/svc_rdma_backchannel.c: In function
‘xprt_rdma_bc_allocate’:
linux-2.6/net/sunrpc/xprtrdma/svc_rdma_backchannel.c:169:23: warning:
variable ‘rdma’ set but not used [-Wunused-but-set-variable]
struct svcxprt_rdma *rdma;
^
Fixes: 5d252f90a8 ("svcrdma: Add class for RDMA backwards ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Logic copied from xs_setup_bc_tcp().
Fixes: 39a9beab5a ('rpc: share one xps between all backchannels')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>