Highlights include:
Stable fixes:
- pNFS: Don't return layout segments that are being used for I/O
- pNFS: Don't move layout segments off the active list when being used for I/O
Features:
- NFS: Add support for user xattrs through the NFSv4.2 protocol
- NFS: Allow applications to speed up readdir+statx() using AT_STATX_DONT_SYNC
- NFSv4.0 allow nconnect for v4.0
Bugfixes and cleanups:
- nfs: ensure correct writeback errors are returned on close()
- nfs: nfs_file_write() should check for writeback errors
- nfs: Fix getxattr kernel panic and memory overflow
- NFS: Fix the pNFS/flexfiles mirrored read failover code
- SUNRPC: dont update timeout value on connection reset
- freezer: Add unsafe versions of freezable_schedule_timeout_interruptible for NFS
- sunrpc: destroy rpc_inode_cachep after unregister_filesystem
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl83CYEACgkQZwvnipYK
APLx4g/9EQNTG5VkUToElRHs4b3vFxbmh9Odnk/JwPHxY5GQ8/AyGqwWHBgMZc7/
2AV/C83pk7pJsDNsKbVAaFCT1cwjmItHM63vKJYGBYbE8LhkZ/1sYkdtPBYwHoVl
7CWfpVY/4NjYw5GJfrVA5Y0m7lrQInRtIMzfaENw2tpw+/cKUpadxgEJltzFNvpa
Ploinr1ZRBl1tvfeHNRP5ZMPk2AfgGWtQKQ/b2UWUk5tXALoQm2Eu+/oku39uqhy
hW5tCbU2BzR91gg5JwF9n7VowkCHXfe7lMzDBTVfwZOELOmoyys/1wKv550FWcWi
yymljWiPGZOnXGT1vfKptPESQjdtElMfanvEZ0BzS+yNR0HZGnIupaxGlYlG9ZGU
2sXHQPp3mk2Q+L1IgbTSCnSju0YlZo32JQpYCZiROjIXnPWPQ50YNhr8GL18M1FW
hTeShg2avWH+59GB6moEBmsuvui7Dy1jkimblToLEoGJ4kbvEl72FYSqTCkAXXbB
rVUzhmJFgfk/EOS4d+QKJoBqNzw3aw79wyT7PLkoCYBqPZBHQexlmI+ktMbgUEdw
c/fM7l5/Vcb9weIHKzul2Jbk5q6bFME/xPnIkr3v/oKIFlLFwQ04BX6R/42AMJHw
V5Q9Wp81Vy6RXQMn8P0ZMeY0WQC/rhpijOEVMUC+Ni+spz44AdM=
=k4pE
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Stable fixes:
- pNFS: Don't return layout segments that are being used for I/O
- pNFS: Don't move layout segments off the active list when being used for I/O
Features:
- NFS: Add support for user xattrs through the NFSv4.2 protocol
- NFS: Allow applications to speed up readdir+statx() using AT_STATX_DONT_SYNC
- NFSv4.0 allow nconnect for v4.0
Bugfixes and cleanups:
- nfs: ensure correct writeback errors are returned on close()
- nfs: nfs_file_write() should check for writeback errors
- nfs: Fix getxattr kernel panic and memory overflow
- NFS: Fix the pNFS/flexfiles mirrored read failover code
- SUNRPC: dont update timeout value on connection reset
- freezer: Add unsafe versions of freezable_schedule_timeout_interruptible for NFS
- sunrpc: destroy rpc_inode_cachep after unregister_filesystem"
* tag 'nfs-for-5.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (32 commits)
NFS: Fix flexfiles read failover
fs: nfs: delete repeated words in comments
rpc_pipefs: convert comma to semicolon
nfs: Fix getxattr kernel panic and memory overflow
NFS: Don't return layout segments that are in use
NFS: Don't move layouts to plh_return_segs list while in use
NFS: Add layout segment info to pnfs read/write/commit tracepoints
NFS: Add tracepoints for layouterror and layoutstats.
NFS: Report the stateid + status in trace_nfs4_layoutreturn_on_close()
SUNRPC dont update timeout value on connection reset
nfs: nfs_file_write() should check for writeback errors
nfs: ensure correct writeback errors are returned on close()
NFSv4.2: xattr cache: get rid of cache discard work queue
NFS: remove redundant initialization of variable result
NFSv4.0 allow nconnect for v4.0
freezer: Add unsafe versions of freezable_schedule_timeout_interruptible for NFS
sunrpc: destroy rpc_inode_cachep after unregister_filesystem
NFSv4.2: add client side xattr caching.
NFSv4.2: hook in the user extended attribute handlers
NFSv4.2: add the extended attribute proc functions.
...
Current behaviour: every time a v3 operation is re-sent to the server
we update (double) the timeout. There is no distinction between whether
or not the previous timer had expired before the re-sent happened.
Here's the scenario:
1. Client sends a v3 operation
2. Server RST-s the connection (prior to the timeout) (eg., connection
is immediately reset)
3. Client re-sends a v3 operation but the timeout is now 120sec.
As a result, an application sees 2mins pause before a retry in case
server again does not reply.
Instead, this patch proposes to keep track off when the minor timeout
should happen and if it didn't, then don't update the new timeout.
Value is updated based on the previous value to make timeouts
predictable.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
First, refactor: Dereference the svc_rdma_send_ctxt inside
svc_rdma_send() instead of at every call site.
Then, it can be passed into trace_svcrdma_post_send() to get the
proper completion ID.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Set up a completion ID in each svc_rdma_send_ctxt. The ID is used
to match an incoming Send completion to a transport and to a
previous ib_post_send().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Set up a completion ID in each svc_rdma_recv_ctxt. The ID is used
to match an incoming Receive completion to a transport and to a
previous ib_post_recv().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The goal is to replace CQE kernel memory addresses in completion-
related tracepoints.
Each completion ID matches an incoming Send or Receive completion
to a Completion Queue and to a previous ib_post_*(). The ID can
then be displayed in an error message or recorded in a trace
record.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Use these helpers in a few spots to demonstrate their use.
The remaining open-coded discriminator checks in rpcrdma will be
addressed in subsequent patches.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
New features and improvements:
- Sunrpc receive buffer sizes only change when establishing a GSS credentials
- Add more sunrpc tracepoints
- Improve on tracepoints to capture internal NFS I/O errors
Other bugfixes and cleanups:
- Move a dprintk() to after a call to nfs_alloc_fattr()
- Fix off-by-one issues in rpc_ntop6
- Fix a few coccicheck warnings
- Use the correct SPDX license identifiers
- Fix rpc_call_done assignment for BIND_CONN_TO_SESSION
- Replace zero-length array with flexible array
- Remove duplicate headers
- Set invalid blocks after NFSv4 writes to update space_used attribute
- Fix direct WRITE throughput regression
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl7ibyIACgkQ18tUv7Cl
QOsOHBAA1A1stYld0gOhKZtMqxRJi3fnJ5mgroLGtyVQe8uAjpD8Ib1oRleC4MJq
ifpYPozIhMZQCvDiGTAKJ8629OYiXGrN8D5nV6Y2tEGpu5wYv98MyZlU9Y8rVzCP
5vsIMUp5XH8y2wYO8k7fDPPxWNH9Ax89wz5OI16mZxgY/LDm4ojZq+pGbYnWZa4w
oK6Efa66z7yQkPV8oIWuvLe1zZYWGAPibBEwJbrvUWyfygB3owI36sc6nuiEQM+4
hD3h5UtVn8BnudUqvLLa21rnQROMFpgYf4Q/2A1UaNfyRAPoPXMztECBSEYXO0L4
saiMc5o/yTTBCC0ZjV1F+xuGQzMgSQ83KOdbr+a+upvBeFpBynJxccdvMTDEam+q
rl7Ypdc42CsTZ1aVWG/AoIk6GENzR0tXqNR6BcDjYG/yRWvnt/RIZlp6G67IbtRH
b9we+3MbI/lTBoCFGahkkBYO3elTNwilxH3pWcRi8ehNn0GPjlLqHePR17Tmq1tL
QycDlm7QB1m5xNsOOLaBoB4SyguPV0SBprZJ4yYU1B3KC3bGurZVK3+TSLXQrO9V
12RLDt4AOGr0TlctBIhNbkGp8xHY6Dg7HgbdjdrVq8Y9YCfg0C37789BnZA5nVxF
4L101lsTI0puymh+MwmhiyOvCldn30f+MjuWJSm17Id+eRIxYj4=
=a84h
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.8-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"New features and improvements:
- Sunrpc receive buffer sizes only change when establishing a GSS credentials
- Add more sunrpc tracepoints
- Improve on tracepoints to capture internal NFS I/O errors
Other bugfixes and cleanups:
- Move a dprintk() to after a call to nfs_alloc_fattr()
- Fix off-by-one issues in rpc_ntop6
- Fix a few coccicheck warnings
- Use the correct SPDX license identifiers
- Fix rpc_call_done assignment for BIND_CONN_TO_SESSION
- Replace zero-length array with flexible array
- Remove duplicate headers
- Set invalid blocks after NFSv4 writes to update space_used attribute
- Fix direct WRITE throughput regression"
* tag 'nfs-for-5.8-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (27 commits)
NFS: Fix direct WRITE throughput regression
SUNRPC: rpc_xprt lifetime events should record xprt->state
xprtrdma: Make xprt_rdma_slot_table_entries static
nfs: set invalid blocks after NFSv4 writes
NFS: remove redundant initialization of variable result
sunrpc: add missing newline when printing parameter 'auth_hashtable_size' by sysfs
NFS: Add a tracepoint in nfs_set_pgio_error()
NFS: Trace short NFS READs
NFS: nfs_xdr_status should record the procedure name
SUNRPC: Set SOFTCONN when destroying GSS contexts
SUNRPC: rpc_call_null_helper() should set RPC_TASK_SOFT
SUNRPC: rpc_call_null_helper() already sets RPC_TASK_NULLCREDS
SUNRPC: trace RPC client lifetime events
SUNRPC: Trace transport lifetime events
SUNRPC: Split the xdr_buf event class
SUNRPC: Add tracepoint to rpc_call_rpcerror()
SUNRPC: Update the RPC_SHOW_SOCKET() macro
SUNRPC: Update the rpc_show_task_flags() macro
SUNRPC: Trace GSS context lifetimes
SUNRPC: receive buffer size estimation values almost never change
...
Avoid unnecessary cache sloshing by placing the buffer size
estimation update logic behind an atomic bit flag.
The size of GSS information included in each wrapped Reply does
not change during the lifetime of a GSS context. Therefore, the
au_rslack and au_ralign fields need to be updated only once after
establishing a fresh GSS credential.
Thus a slack size update must occur after a cred is created,
duplicated, renewed, or expires. I'm not sure I have this exactly
right. A trace point is introduced to track updates to these
variables to enable troubleshooting the problem if I missed a spot.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
gss_mech_register() calls svcauth_gss_register_pseudoflavor() for each
flavour, but gss_mech_unregister() does not call auth_domain_put().
This is unbalanced and makes it impossible to reload the module.
Change svcauth_gss_register_pseudoflavor() to return the registered
auth_domain, and save it for later release.
Cc: stable@vger.kernel.org (v2.6.12+)
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206651
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Highlights of this series:
* Remove serialization of sending RPC/RDMA Replies
* Convert the TCP socket send path to use xdr_buf::bvecs (pre-requisite for
RPC-on-TLS)
* Fix svcrdma backchannel sendto return code
* Convert a number of dprintk call sites to use tracepoints
* Fix the "suggest braces around empty body in an 'else' statement" warning
This function is not currently "generic" so remove the documenting
comment and rename it appropriately. Its internals are converted to
use bio_vecs for reading from the transport socket.
In existing typical sunrpc uses of bio_vecs, the bio_vec array is
allocated dynamically. Here, instead, an array of bio_vecs is added
to svc_rqst. The lifetime of this array can be greater than one call
to xpo_recvfrom():
- Multiple calls to xpo_recvfrom() might be needed to read an RPC
message completely.
- At some later point, rq_arg.bvecs will point to this array and it
will carry the received message into svc_process().
I also expect that a future optimization will remove either the
rq_vec or rq_pages array in favor of rq_bvec, thus conserving the
size of struct svc_rqst.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up. I find the name of the svc_sock::sk_reclen field
confusing, so I've changed it to better reflect its function. This
field is not read directly to get the record length. Rather, it is
a buffer containing a record marker that needs to be decoded.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up: Commit d21b05f101 ("rdma: SVCRMDA Header File")
introduced the SVCRDMA_DEBUG macro, but it doesn't seem to have been
used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Way back when I was writing the RPC/RDMA server-side backchannel
code, I misread the TCP backchannel reply handler logic. When
svc_tcp_recvfrom() successfully receives a backchannel reply, it
does not return -EAGAIN. It sets XPT_DATA and returns zero.
Update svc_rdma_recvfrom() to return zero. Here, XPT_DATA doesn't
need to be set again: it is set whenever a new message is received,
behind a spin lock in a single threaded context.
Also, if handling the cb reply is not successful, the message is
simply dropped. There's no special message framing to deal with as
there is in the TCP case.
Now that the handle_bc_reply() return value is ignored, I've removed
the dprintk call sites in the error exit of handle_bc_reply() in
favor of trace points in other areas that already report the error
cases.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
It appears that the RPC/RDMA transport does not need serialization
of calls to its xpo_sendto method. Move the mutex into the socket
methods that still need that serialization.
Tail latencies are unambiguously better with this patch applied.
fio randrw 8KB 70/30 on NFSv3, smaller numbers are better:
clat percentiles (usec):
With xpt_mutex:
r | 99.99th=[ 8848]
w | 99.99th=[ 9634]
Without xpt_mutex:
r | 99.99th=[ 8586]
w | 99.99th=[ 8979]
Serializing the construction of RPC/RDMA transport headers is not
really necessary at this point, because the Linux NFS server
implementation never changes its credit grant on a connection. If
that should change, then svc_rdma_sendto will need to serialize
access to the transport's credit grant fields.
Reported-by: kbuild test robot <lkp@intel.com>
[ cel: fix uninitialized variable warning ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
- Resolve a data integrity problem with NFSD that I inadvertently
introduced last year. The change I made makes the NFS server's
duplicate reply cache ineffective when krb5i or krb5p are in use,
thus allowing the replay of non-idempotent NFS requests such as
RENAME, SETATTR, or even WRITEs.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJerCuyAAoJEDNqszNvZn+XxvAQAJmUW5412OO7mkI2IW5PDP71
ZnBAuTs4UpLBgp1VpS3ai0LYnOX9o8WLqolzGuFxGfK69ZZdh7U7fzX2aEytoTSP
KkW3dNo+NzRppWOhMBEfMBLnAu22YF+F689RvwEqd0C1AgGugaFfzlF1ECrJVpA7
g1WVhTi0ihfArhzSWTWO4LiuwjRd5TNF8gEci2j3DuHn1Hp6BagbKOv0rFdgK99X
BbK8IaEalBUjtpGAPgRU/WY/WznzhgARVeOX7Rh/P/zFdFB1G1M4kycaadBk6uaU
SHbdWBwDsYatDNuhZUI3Wv2g+DQ5LJRrjNNesLRot+kC3XD12sBCMsSI3owoz7Jt
u0s48YmOJO8uWi4kDenR9XV8bAaDmX7R/+XGZm1lethNrpBKat9EIrqSHNvqAXZ4
b3cC8/A/aCcOrWXtZnWqvJdqjx2EgL6DbcpaFheaPEekRofuiyOaAbXdlJQvzcwY
Sv4EC4ymABpQRg0si+Sya5Int7bZ9ryLZTSCMiLA+L1TnoW26XjMlGAaRqYi7Tx7
Qg4Bt400IIDE0FlE/76vE7b7YWQj7GfErA6moIyDio5AInRU9sHDFyB8iCfdpKxh
ajNl1NuEO/FSoXOGQvOo1uHD0vKvNVK21T6vQsRCT1f6JXtpiwTn6eLX4Wn9YLdI
iKqg2YXfdCbJnAuoxzGi
=hT3x
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.7-rc-2' of git://git.linux-nfs.org/projects/cel/cel-2.6
Pull nfsd fixes from Chuck Lever:
"Resolve a data integrity problem with NFSD that I inadvertently
introduced last year.
The change I made makes the NFS server's duplicate reply cache
ineffective when krb5i or krb5p are in use, thus allowing the replay
of non-idempotent NFS requests such as RENAME, SETATTR, or even
WRITEs"
* tag 'nfsd-5.7-rc-2' of git://git.linux-nfs.org/projects/cel/cel-2.6:
SUNRPC: Revert 241b1f419f ("SUNRPC: Remove xdr_buf_trim()")
SUNRPC: Fix GSS privacy computation of auth->au_ralign
SUNRPC: Add "@len" parameter to gss_unwrap()
We currently revoke read delegations on any write open or any operation
that modifies file data or metadata (including rename, link, and
unlink). But if the delegation in question is the only read delegation
and is held by the client performing the operation, that's not really
necessary.
It's not always possible to prevent this in the NFSv4.0 case, because
there's not always a way to determine which client an NFSv4.0 delegation
came from. (In theory we could try to guess this from the transport
layer, e.g., by assuming all traffic on a given TCP connection comes
from the same client. But that's not really correct.)
In the NFSv4.1 case the session layer always tells us the client.
This patch should remove such self-conflicts in all cases where we can
reliably determine the client from the compound.
To do that we need to track "who" is performing a given (possibly
lease-breaking) file operation. We're doing that by storing the
information in the svc_rqst and using kthread_data() to map the current
task back to a svc_rqst.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Highlights include:
Stable fixes
- fix handling of backchannel binding in BIND_CONN_TO_SESSION
Bugfixes
- Fix a credential use-after-free issue in pnfs_roc()
- Fix potential posix_acl refcnt leak in nfs3_set_acl
- defer slow parts of rpc_free_client() to a workqueue
- Fix an Oopsable race in __nfs_list_for_each_server()
- Fix trace point use-after-free race
- Regression: the RDMA client no longer responds to server disconnect requests
- Fix return values of xdr_stream_encode_item_{present, absent}
- _pnfs_return_layout() must always wait for layoutreturn completion
Cleanups
- Remove unreachable error conditions
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl6tczsACgkQZwvnipYK
APKHWg//QGx2Tolj5dh2jBHa47A5/SYnJxCZAA0/fWdwRtFkW3HyyGne1jU86do2
SMAVpBpri1WJPt5d3DH66gu4l4UxG1h84s7QP4lGfSa85EmtLh+LoZQCZRqYoDOo
JAMzWctELu1TUpaa1N5Dhg/qMtMy6ulRMWgzTLqB9a/pQa3onugTK6W7xiut2prj
PBfFq7N9XXmPboSeGV9bR4L8XKSbTCLEt3U1F2zAGU7UUINvDfpjEXq7BHYCewKL
ObPW6EWZksyna16H8i/xGWoKgE4JFVjMwQAP7UdDBi+FW9RI6UpTBoR6z9N748j0
jEocDbI21wgnwmtrVTbzsYm6ttHl4D4egoNxn7m5zjxTU4Ba/RQG2aaHUGFOYpJj
1FI1f6V1Y5v4mJajdsEH+pGW/4vK/4YMR+7YHJ/hYU/WiXjLf7onIIifdWt4SQdo
lvZbGcx6IAHYUA4lI7hkcvrK4bbqAnPLFq28nlUWEID5q5D+nA1ZR9iN0FToviDy
FYyhQzyfD1kt98SV1DjWUqvDDd6IB64iDZTXGmtWvj6c2nbezGiFffvtzUL5LFxY
QfI8lkpmUyt1EiWlZWhtOh4zsiM5yMZkJB/3RJv3RMmswizSSAHdgCKWhdLpX0bl
TG1L8yEmcTc5ANS37EhlpcBNbfYw7oIF/OXuReTSRoMQl5hxjfY=
=w0zk
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable fixes:
- fix handling of backchannel binding in BIND_CONN_TO_SESSION
Bugfixes:
- Fix a credential use-after-free issue in pnfs_roc()
- Fix potential posix_acl refcnt leak in nfs3_set_acl
- defer slow parts of rpc_free_client() to a workqueue
- Fix an Oopsable race in __nfs_list_for_each_server()
- Fix trace point use-after-free race
- Regression: the RDMA client no longer responds to server disconnect
requests
- Fix return values of xdr_stream_encode_item_{present, absent}
- _pnfs_return_layout() must always wait for layoutreturn completion
Cleanups:
- Remove unreachable error conditions"
* tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: Fix a race in __nfs_list_for_each_server()
NFSv4.1: fix handling of backchannel binding in BIND_CONN_TO_SESSION
SUNRPC: defer slow parts of rpc_free_client() to a workqueue.
NFSv4: Remove unreachable error condition due to rpc_run_task()
SUNRPC: Remove unreachable error condition
xprtrdma: Fix use of xdr_stream_encode_item_{present, absent}
xprtrdma: Fix trace point use-after-free race
xprtrdma: Restore wake-up-all to rpcrdma_cm_event_handler()
nfs: Fix potential posix_acl refcnt leak in nfs3_set_acl
NFS/pnfs: Fix a credential use-after-free issue in pnfs_roc()
NFS/pnfs: Ensure that _pnfs_return_layout() waits for layoutreturn completion
Currently, if the client sends BIND_CONN_TO_SESSION with
NFS4_CDFC4_FORE_OR_BOTH but only gets NFS4_CDFS4_FORE back it ignores
that it wasn't able to enable a backchannel.
To make sure, the client sends BIND_CONN_TO_SESSION as the first
operation on the connections (ie., no other session compounds haven't
been sent before), and if the client's request to bind the backchannel
is not satisfied, then reset the connection and retry.
Cc: stable@vger.kernel.org
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The rpciod workqueue is on the write-out path for freeing dirty memory,
so it is important that it never block waiting for memory to be
allocated - this can lead to a deadlock.
rpc_execute() - which is often called by an rpciod work item - calls
rcp_task_release_client() which can lead to rpc_free_client().
rpc_free_client() makes two calls which could potentially block wating
for memory allocation.
rpc_clnt_debugfs_unregister() calls into debugfs and will block while
any of the debugfs files are being accessed. In particular it can block
while any of the 'open' methods are being called and all of these use
malloc for one thing or another. So this can deadlock if the memory
allocation waits for NFS to complete some writes via rpciod.
rpc_clnt_remove_pipedir() can take the inode_lock() and while it isn't
obvious that memory allocations can happen while the lock it held, it is
safer to assume they might and to not let rpciod call
rpc_clnt_remove_pipedir().
So this patch moves these two calls (together with the final kfree() and
rpciod_down()) into a work-item to be run from the system work-queue.
rpciod can continue its important work, and the final stages of the free
can happen whenever they happen.
I have seen this deadlock on a 4.12 based kernel where debugfs used
synchronize_srcu() when removing objects. synchronize_srcu() requires a
workqueue and there were no free workther threads and none could be
allocated. While debugsfs no longer uses SRCU, I believe the deadlock
is still possible.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
I've noticed that when krb5i or krb5p security is in use,
retransmitted requests are missing the server's duplicate reply
cache. The computed checksum on the retransmitted request does not
match the cached checksum, resulting in the server performing the
retransmitted request again instead of returning the cached reply.
The assumptions made when removing xdr_buf_trim() were not correct.
In the send paths, the upper layer has already set the segment
lengths correctly, and shorting the buffer's content is simply a
matter of reducing buf->len.
xdr_buf_trim() is the right answer in the receive/unwrap path on
both the client and the server. The buffer segment lengths have to
be shortened one-by-one.
On the server side in particular, head.iov_len needs to be updated
correctly to enable nfsd_cache_csum() to work correctly. The simple
buf->len computation doesn't do that, and that results in
checksumming stale data in the buffer.
The problem isn't noticed until there's significant instability of
the RPC transport. At that point, the reliability of retransmit
detection on the server becomes crucial.
Fixes: 241b1f419f ("SUNRPC: Remove xdr_buf_trim()")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
When the au_ralign field was added to gss_unwrap_resp_priv, the
wrong calculation was used. Setting au_rslack == au_ralign is
probably correct for kerberos_v1 privacy, but kerberos_v2 privacy
adds additional GSS data after the clear text RPC message.
au_ralign needs to be smaller than au_rslack in that fairly common
case.
When xdr_buf_trim() is restored to gss_unwrap_kerberos_v2(), it does
exactly what I feared it would: it trims off part of the clear text
RPC message. However, that's because rpc_prepare_reply_pages() does
not set up the rq_rcv_buf's tail correctly because au_ralign is too
large.
Fixing the au_ralign computation also corrects the alignment of
rq_rcv_buf->pages so that the client does not have to shift reply
data payloads after they are received.
Fixes: 35e77d21ba ("SUNRPC: Add rpc_auth::au_ralign field")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor: This is a pre-requisite to fixing the client-side ralign
computation in gss_unwrap_resp_priv().
The length value is passed in explicitly rather that as the value
of buf->len. This will subsequently allow gss_unwrap_kerberos_v1()
to compute a slack and align value, instead of computing it in
gss_unwrap_resp_priv().
Fixes: 35e77d21ba ("SUNRPC: Add rpc_auth::au_ralign field")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Utilize the xpo_release_rqst transport method to ensure that each
rqstp's svc_rdma_recv_ctxt object is released even when the server
cannot return a Reply for that rqstp.
Without this fix, each RPC whose Reply cannot be sent leaks one
svc_rdma_recv_ctxt. This is a 2.5KB structure, a 4KB DMA-mapped
Receive buffer, and any pages that might be part of the Reply
message.
The leak is infrequent unless the network fabric is unreliable or
Kerberos is in use, as GSS sequence window overruns, which result
in connection loss, are more common on fast transports.
Fixes: 3a88092ee3 ("svcrdma: Preserve Receive buffer until svc_rdma_sendto")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Highlights include:
Stable fixes:
- Fix a page leak in nfs_destroy_unlinked_subrequests()
- Fix use-after-free issues in nfs_pageio_add_request()
- Fix new mount code constant_table array definitions
- finish_automount() requires us to hold 2 refs to the mount record
Features:
- Improve the accuracy of telldir/seekdir by using 64-bit cookies when
possible.
- Allow one RDMA active connection and several zombie connections to
prevent blocking if the remote server is unresponsive.
- Limit the size of the NFS access cache by default
- Reduce the number of references to credentials that are taken by NFS
- pNFS files and flexfiles drivers now support per-layout segment
COMMIT lists.
- Enable partial-file layout segments in the pNFS/flexfiles driver.
- Add support for CB_RECALL_ANY to the pNFS flexfiles layout type
- pNFS/flexfiles Report NFS4ERR_DELAY and NFS4ERR_GRACE errors from
the DS using the layouterror mechanism.
Bugfixes and cleanups:
- SUNRPC: Fix krb5p regressions
- Don't specify NFS version in "UDP not supported" error
- nfsroot: set tcp as the default transport protocol
- pnfs: Return valid stateids in nfs_layout_find_inode_by_stateid()
- alloc_nfs_open_context() must use the file cred when available
- Fix locking when dereferencing the delegation cred
- Fix memory leaks in O_DIRECT when nfs_get_lock_context() fails
- Various clean ups of the NFS O_DIRECT commit code
- Clean up RDMA connect/disconnect
- Replace zero-length arrays with C99-style flexible arrays
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl6LhhsACgkQZwvnipYK
APIOJxAAiQOgmIg1CV4mrlcVhkwy09N5JAia6AENtoTmwm08nAYg5Y8REb9uX46a
/MJsM2WG8hBCgI6eYmRY8LTr4Ft9rTQEJM9DRMuwQREXwMWwBhUv/QakCeqY1lHE
lyB1z4hj5XKeUoN/OcfALC/GXFFf56A0UyN05nMzeCkBTdd3+qu+hW8Ge1wkAXcr
f0pyLbzdFZlJuTmI4tr8F93g9p3ezuFBuEroT7XPIVJylAdZVumHqnOnz/Mvb99x
rNTsX2dc44GhSAfRnTzPumU3MT6BOLvUzNH1xzdiqKzJrbOnG8WjFodrGr3JWpfp
HkeyYQxJ+Hnfb2LiZBjvMQE8M7kVMZ1jVbrGJEbCxfSqgTly8lOHboqAeKsFaReK
LStnusizdA1LHQVZxPdvn+oL49RDxnzm9dY+DkrXK1qT0GE+icN1CyTyLLfkSCp8
tYvZSJ/qPk5BNZegqH1nBqXkMDkOJ4eEA7+luXDmajRkdRrZ3IWY2M1DpMEoueJ2
j/zoj/NFr1oErU4o7PV9oolA1Euhn1L3wIDuzsbVtjySmbXJNQTtaVVRFpGw3SsZ
7rbqi4BB0SzOooNhQ4q8mLNi4qT7bl/3D04eL8UVzEM73plexhQ8XiOEz/VrIRX7
L9viXH49g4DHQ0rZIaWefxFueqpgbNvQwnlLZl2uQotG9hwhTts=
=YUcP
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- Fix a page leak in nfs_destroy_unlinked_subrequests()
- Fix use-after-free issues in nfs_pageio_add_request()
- Fix new mount code constant_table array definitions
- finish_automount() requires us to hold 2 refs to the mount record
Features:
- Improve the accuracy of telldir/seekdir by using 64-bit cookies
when possible.
- Allow one RDMA active connection and several zombie connections to
prevent blocking if the remote server is unresponsive.
- Limit the size of the NFS access cache by default
- Reduce the number of references to credentials that are taken by
NFS
- pNFS files and flexfiles drivers now support per-layout segment
COMMIT lists.
- Enable partial-file layout segments in the pNFS/flexfiles driver.
- Add support for CB_RECALL_ANY to the pNFS flexfiles layout type
- pNFS/flexfiles Report NFS4ERR_DELAY and NFS4ERR_GRACE errors from
the DS using the layouterror mechanism.
Bugfixes and cleanups:
- SUNRPC: Fix krb5p regressions
- Don't specify NFS version in "UDP not supported" error
- nfsroot: set tcp as the default transport protocol
- pnfs: Return valid stateids in nfs_layout_find_inode_by_stateid()
- alloc_nfs_open_context() must use the file cred when available
- Fix locking when dereferencing the delegation cred
- Fix memory leaks in O_DIRECT when nfs_get_lock_context() fails
- Various clean ups of the NFS O_DIRECT commit code
- Clean up RDMA connect/disconnect
- Replace zero-length arrays with C99-style flexible arrays"
* tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (86 commits)
NFS: Clean up process of marking inode stale.
SUNRPC: Don't start a timer on an already queued rpc task
NFS/pnfs: Reference the layout cred in pnfs_prepare_layoutreturn()
NFS/pnfs: Fix dereference of layout cred in pnfs_layoutcommit_inode()
NFS: Beware when dereferencing the delegation cred
NFS: Add a module parameter to set nfs_mountpoint_expiry_timeout
NFS: finish_automount() requires us to hold 2 refs to the mount record
NFS: Fix a few constant_table array definitions
NFS: Try to join page groups before an O_DIRECT retransmission
NFS: Refactor nfs_lock_and_join_requests()
NFS: Reverse the submission order of requests in __nfs_pageio_add_request()
NFS: Clean up nfs_lock_and_join_requests()
NFS: Remove the redundant function nfs_pgio_has_mirroring()
NFS: Fix memory leaks in nfs_pageio_stop_mirroring()
NFS: Fix a request reference leak in nfs_direct_write_clear_reqs()
NFS: Fix use-after-free issues in nfs_pageio_add_request()
NFS: Fix races nfs_page_group_destroy() vs nfs_destroy_unlinked_subrequests()
NFS: Fix a page leak in nfs_destroy_unlinked_subrequests()
NFS: Remove unused FLUSH_SYNC support in nfs_initiate_pgio()
pNFS/flexfiles: Specify the layout segment range in LAYOUTGET
...
Trond points out in commit 277f27e2f2 ("SUNRPC/cache: Allow
garbage collection of invalid cache entries") that we allow invalid
cache entries to persist indefinitely. That fix, however,
reintroduces the problem fixed by Kinglong Mee's commit d6fc8821c2
("SUNRPC/Cache: Always treat the invalid cache as unexpired"), where
an invalid cache entry is immediately removed by a flush before
mountd responds to it. The result is that the server thread that
should be waiting for mountd to fill in that entry instead gets an
-ETIMEDOUT return from cache_check(). Symptoms are the server
becoming unresponsive after a restart, reproduceable by running
pynfs 4.1 test REBT5.
Instead, take a compromise approach: allow invalid cache entries to
be removed after they expire, but not to be removed by a cache
flush.
Fixes: 277f27e2f2 ("SUNRPC/cache: Allow garbage collection ... ")
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If the cache entry never gets initialised, we want the garbage
collector to be able to evict it. Otherwise if the upcall daemon
fails to initialise the entry, we end up never expiring it.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
[ cel: resolved a merge conflict ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If the rpc.mountd daemon goes down, then that should not cause all
exports to start failing with ESTALE errors. Let's explicitly
distinguish between the cache upcall cases that need to time out,
and those that do not.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
On some platforms, DMA mapping part of a page is more costly than
copying bytes. Indeed, not involving the I/O MMU can help the
RPC/RDMA transport scale better for tiny I/Os across more RDMA
devices. This is because interaction with the I/O MMU is eliminated
for each of these small I/Os. Without the explicit unmapping, the
NIC no longer needs to do a costly internal TLB shoot down for
buffers that are just a handful of bytes.
Since pull-up is now a more a frequent operation, I've introduced a
trace point in the pull-up path. It can be used for debugging or
user-space tools that count pull-up frequency.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Performance optimization: Avoid syncing the transport buffer twice
when Reply buffer pull-up is necessary.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Same idea as the receive-side changes I did a while back: use
xdr_stream helpers rather than open-coding the XDR chunk list
encoders. This builds the Reply transport header from beginning to
end without backtracking.
As additional clean-ups, fill in documenting comments for the XDR
encoders and sprinkle some trace points in the new encoding
functions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up. These are taken from the client-side RPC/RDMA transport
to a more global header file so they can be used elsewhere.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Preparing for subsequent patches, no behavior change expected.
Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto()
path. This enables passing more information about Requester-
provided Write and Reply chunks into those lower-level functions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Preparing for subsequent patches, no behavior change expected.
Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto()
path. This enables passing more information about Requester-
provided Write and Reply chunks into the lower-level send
functions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cache the locations of the Requester-provided Write list and Reply
chunk so that the Send path doesn't need to parse the Call header
again.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The logic that checks incoming network headers has to be scrupulous.
De-duplicate: replace open-coded buffer overflow checks with the use
of xdr_stream helpers that are used most everywhere else XDR
decoding is done.
One minor change to the sanity checks: instead of checking the
length of individual segments, cap the length of the whole chunk
to be sure it can fit in the set of pages available in rq_pages.
This should be a better test of whether the server can handle the
chunks in each request.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Introduce a helper function to compute the XDR pad size of a
variable-length XDR object.
Clean up: Replace open-coded calculation of XDR pad sizes.
I'm sure I haven't found every instance of this calculation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svcrdma expects that the payload falls precisely into the xdr_buf
page vector. This does not seem to be the case for
nfsd4_encode_readv().
This code is called only when fops->splice_read is missing or when
RQ_SPLICE_OK is clear, so it's not a noticeable problem in many
common cases.
Add new transport method: ->xpo_read_payload so that when a READ
payload does not fit exactly in rq_res's page vector, the XDR
encoder can inform the RPC transport exactly where that payload is,
without the payload's XDR pad.
That way, when a Write chunk is present, the transport knows what
byte range in the Reply message is supposed to be matched with the
chunk.
Note that the Linux NFS server implementation of NFS/RDMA can
currently handle only one Write chunk per RPC-over-RDMA message.
This simplifies the implementation of this fix.
Fixes: b042098063 ("nfsd4: allow exotic read compounds")
Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=198053
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Clean up: this function is no longer used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add a flag to signal to the RPC layer that the credential is already
pinned for the duration of the RPC call.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Stable bugfixes:
- Fix memory leaks and corruption in readdir # v2.6.37+
- Directory page cache needs to be locked when read # v2.6.37+
New features:
- Convert NFS to use the new mount API
- Add "softreval" mount option to let clients use cache if server goes down
- Add a config option to compile without UDP support
- Limit the number of inactive delegations the client can cache at once
- Improved readdir concurrency using iterate_shared()
Other bugfixes and cleanups:
- More 64-bit time conversions
- Add additional diagnostic tracepoints
- Check for holes in swapfiles, and add dependency on CONFIG_SWAP
- Various xprtrdma cleanups to prepare for 5.7's changes
- Several fixes for NFS writeback and commit handling
- Fix acls over krb5i/krb5p mounts
- Recover from premature loss of openstateids
- Fix NFS v3 chacl and chmod bug
- Compare creds using cred_fscmp()
- Use kmemdup_nul() in more places
- Optimize readdir cache page invalidation
- Lease renewal and recovery fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl48kMUACgkQ18tUv7Cl
QOs/bA/+KAHaee+1jWdgRS88CnNDfeokU2sGWuyXWrVTmiKZ+IjnIUIWqmeKhVyg
RTbaG4PGTIwiLDFibgzdnc3cTOQEgLnVGWWZ50Xh3b7ubock7+/4JHxqZS+/f3vf
yqwM0dZaXi5Kcx1kEJ+niBxuzkc9mFI+nHh+wLIlin/kaaUdLKu7mP3NXj2cmWxN
NoRaKc2gEvkPHhPSH4Z1DVXTHxvH2REFvt9APPUgfLfqcUVHV9b7V/wI/roiGWMn
53h6f38IdqoNQIpzMog/k/va67NLmEvUZOlpCYPyanPOjuxTrmi8iC2S6gLEOjtc
GGnQnc5skVL31seFR1NbOJiiN3hTLTncnoXza0cKtYxmo7a/FjXApw4jCu3Rkrav
UXpCI4O6+2AVVG+pEPbjQy3/GEImeoGvp+xr57jBSZBHoDZU9LDwag65qvZ1btIq
KOBx2gweQz0aB2heXmfee7qzxFdftHmtMWhIMnJASKNuAWGL23Scqem+d97i2T6H
7y9OJ3aOXiYxFMLYJCsLWjUJxYiaIANNBmHMjf27mZzcdDuxGFms277CMpNPr3SU
WZk6/oKw9jaRSzHzaKgVDXiULLXQE1/xZ/mvgR/zk1QAusyeXPvVnMdxoRdxFdXb
QGZHgUqvFvYi8Lufvs+ZLGS4sAp7oD/Q+lNPXn7cniSwfY4uJiw=
=b6+F
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Puyll NFS client updates from Anna Schumaker:
"Stable bugfixes:
- Fix memory leaks and corruption in readdir # v2.6.37+
- Directory page cache needs to be locked when read # v2.6.37+
New features:
- Convert NFS to use the new mount API
- Add "softreval" mount option to let clients use cache if server goes down
- Add a config option to compile without UDP support
- Limit the number of inactive delegations the client can cache at once
- Improved readdir concurrency using iterate_shared()
Other bugfixes and cleanups:
- More 64-bit time conversions
- Add additional diagnostic tracepoints
- Check for holes in swapfiles, and add dependency on CONFIG_SWAP
- Various xprtrdma cleanups to prepare for 5.7's changes
- Several fixes for NFS writeback and commit handling
- Fix acls over krb5i/krb5p mounts
- Recover from premature loss of openstateids
- Fix NFS v3 chacl and chmod bug
- Compare creds using cred_fscmp()
- Use kmemdup_nul() in more places
- Optimize readdir cache page invalidation
- Lease renewal and recovery fixes"
* tag 'nfs-for-5.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (93 commits)
NFSv4.0: nfs4_do_fsinfo() should not do implicit lease renewals
NFSv4: try lease recovery on NFS4ERR_EXPIRED
NFS: Fix memory leaks
nfs: optimise readdir cache page invalidation
NFS: Switch readdir to using iterate_shared()
NFS: Use kmemdup_nul() in nfs_readdir_make_qstr()
NFS: Directory page cache pages need to be locked when read
NFS: Fix memory leaks and corruption in readdir
SUNRPC: Use kmemdup_nul() in rpc_parse_scope_id()
NFS: Replace various occurrences of kstrndup() with kmemdup_nul()
NFSv4: Limit the total number of cached delegations
NFSv4: Add accounting for the number of active delegations held
NFSv4: Try to return the delegation immediately when marked for return on close
NFS: Clear NFS_DELEGATION_RETURN_IF_CLOSED when the delegation is returned
NFSv4: nfs_inode_evict_delegation() should set NFS_DELEGATION_RETURNING
NFS: nfs_find_open_context() should use cred_fscmp()
NFS: nfs_access_get_cached_rcu() should use cred_fscmp()
NFSv4: pnfs_roc() must use cred_fscmp() to compare creds
NFS: remove unused macros
nfs: Return EINVAL rather than ERANGE for mount parse errors
...
Remove gss_mech_list_pseudoflavors() and its callers. This is part of
an unused API, and could leak an RCU reference if it were ever called.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>