If we just set the mirror count to 1 without first clearing out
the mirrors, we can leak queued up requests.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
nfs_direct_write_scan_commit_list() will lock the request and bump
the reference count, but we also need to account for the reference
that was taken when we initially added the request to the commit list.
Fixes: fb5f7f20cd ("NFS: commit errors should be fatal")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We need to ensure that we create the mirror requests before calling
nfs_pageio_add_request_mirror() on the request we are adding.
Otherwise, we can end up with a use-after-free if the call to
nfs_pageio_add_request_mirror() triggers I/O.
Fixes: c917cfaf9b ("NFS: Fix up NFS I/O subrequest creation")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When a subrequest is being detached from the subgroup, we want to
ensure that it is not holding the group lock, or in the process
of waiting for the group lock.
Fixes: 5b2b5187fa ("NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests() race cases")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When we detach a subrequest from the list, we must also release the
reference it holds to the parent.
Fixes: 5b2b5187fa ("NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests() race cases")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
New Features:
- Allow one active connection and several zombie connections to prevent
blocking if the remote server is unresponsive.
Bugfixes and Cleanups:
- Enhance MR-related trace points
- Refactor connection set-up and disconnect functions
- Make Protection Domains per-connection instead of per-transport
- Merge struct rpcrdma_ia into rpcrdma_ep
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl5+GKEACgkQ18tUv7Cl
QOvJVBAA4MNSvzo0856X414uUMe1YK1HXblqYbNWSe0bRsfaakp8mscMFmd4/IuG
YSOlQPYcTErdfeEelS1ayMZA3qy5Ke62hKFFfv0fk11lN3VoT4XPiIqY6Ry2OJaR
7XoYyH3YKJxgqOYhQwSo4rvFRNtPKbvEtcQn8uLXnMclomlZ1mli4F9rYs23SvLL
Jbq1q/SKx0w+GAOUj418dYLjMnQU7L+TotplTBGzE7mkhBR0JarBiKZfLxbg2/md
Fva1EwY4Tgef+4ofVr5KP8raGVGl6zy8TIZezlijeUv4rmk9JudrYjeUNqyCWufw
ffkrM+wPr8kQViQoei5ByEvVG2pQK9nfxccx6rxiIOoG8Mb1aK8Wh75e2kjfZZeY
bsEx8wSfRvI54ukk/NfGpP21OP1DDKU4ET6hsW+b6onqmDfYPZ/LinUvrvIHietQ
IsMOG+QE+G1mXFiQR1P9dZcIwTNvE/SyUoaykbeslcYCkQt5yEk25UK+38TornbV
hBxrm9KBd25RUu8prFJadmLBjPx3tLLd451xxppCyrIlpw0b+Q4yHlF7Jr75DYBT
xjW5Xpr6tJZuoybbSyT9B/tiV4gIfyJZCU5S95nFlluKZ0gQd2jIGj/JagNWppke
XSEqRR/Ja0OnB7yDz60gQrrZnjw/Em2jx7xbHIXFK+rs45dkdFc=
=piIM
-----END PGP SIGNATURE-----
Merge tag 'nfs-rdma-for-5.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
NFSoRDMA Client Updates for Linux 5.7
New Features:
- Allow one active connection and several zombie connections to prevent
blocking if the remote server is unresponsive.
Bugfixes and Cleanups:
- Enhance MR-related trace points
- Refactor connection set-up and disconnect functions
- Make Protection Domains per-connection instead of per-transport
- Merge struct rpcrdma_ia into rpcrdma_ep
If the FLUSH_SYNC flag is set, nfs_initiate_pgio() will currently
wait for completion, and then return the status of the I/O operation.
What we actually want to report in nfs_pageio_doio() is whether or
not the RPC call was launched successfully, whereas actual I/O
status is intended handled in the reply callbacks.
Since FLUSH_SYNC is never set by any of the callers anyway, let's
just remove that code altogether.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Move from requesting only full file layout segments, to requesting
layout segments that match our I/O size. This means the server is
still free to return a full file layout, but we will no longer
error out if it does not.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When starting to read or write with a layout segment, check that the
range matches our request.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Check that the number of mirrors, and the mirror information matches
before deciding to merge layout segments in pNFS/flexfiles.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Fix up pnfs_layout_mark_request_commit() to alway reschedule the write
if the layout segment is invalid. Also minor cleanup.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Move the pNFS commit related operations into a separate structure
that can be carried by the pnfs_ds_commit_info.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Lift filelayout_search_commit_reqs() into the generic pnfs/nfs code,
and add support for commit arrays.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Enable adding and lookup of per-layout segment commits in filelayout
and flexfilelayout.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Ensure that both the file and flexfiles layout types clean up when
freeing the layout segments.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add support for scanning the full list of per-layout segment commit
arrays to nfs_clear_pnfs_ds_commit_verifiers().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Instead of trying to save the commit verifiers and checking them against
previous writes, adopt the same strategy as for buffered writes, of
just checking the verifiers at commit time.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add a pNFS callback to allow the O_DIRECT code to release the DS
commitinfo when freeing the dreq.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add support for scanning the full list of per-layout segment commit
arrays to pnfs_generic_commit_pagelist().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add support for scanning the full list of per-layout segment commit
arrays to pnfs_generic_recover_commit_reqs().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add support for scanning the full list of per-layout segment commit
arrays to pnfs_generic_scan_commit_lists()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When we have multiple layout segments with different lists of mirrored
data, we need to track the commits on a per layout segment basis.
This patch adds a list to support this tracking in struct
pnfs_ds_commit_info.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Change the rpcrdma_xprt_disconnect() function so that it no longer
waits for the DISCONNECTED event. This prevents blocking if the
remote is unresponsive.
In rpcrdma_xprt_disconnect(), the transport's rpcrdma_ep is
detached. Upon return from rpcrdma_xprt_disconnect(), the transport
(r_xprt) is ready immediately for a new connection.
The RDMA_CM_DEVICE_REMOVAL and RDMA_CM_DISCONNECTED events are now
handled almost identically.
However, because the lifetimes of rpcrdma_xprt structures and
rpcrdma_ep structures are now independent, creating an rpcrdma_ep
needs to take a module ref count. The ep now owns most of the
hardware resources for a transport.
Also, a kref is needed to ensure that rpcrdma_ep sticks around
long enough for the cm_event_handler to finish.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
rpcrdma_cm_event_handler() is always passed an @id pointer that is
valid. However, in a subsequent patch, we won't be able to extract
an r_xprt in every case. So instead of using the r_xprt's
presentation address strings, extract them from struct rdma_cm_id.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
I eventually want to allocate rpcrdma_ep separately from struct
rpcrdma_xprt so that on occasion there can be more than one ep per
xprt.
The new struct rpcrdma_ep will contain all the fields currently in
rpcrdma_ia and in rpcrdma_ep. This is all the device and CM settings
for the connection, in addition to per-connection settings
negotiated with the remote.
Take this opportunity to rename the existing ep fields from rep_* to
re_* to disambiguate these from struct rpcrdma_rep.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Completion errors after a disconnect often occur much sooner than a
CM_DISCONNECT event. Use this to try to detect connection loss more
quickly.
Note that other kernel ULPs do take care to disconnect explicitly
when a WR is flushed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up:
The upper layer serializes calls to xprt_rdma_close, so there is no
need for an atomic bit operation, saving 8 bytes in rpcrdma_ia.
This enables merging rpcrdma_ia_remove directly into the disconnect
logic.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Move rdma_cm_id creation into rpcrdma_ep_create() so that it is now
responsible for allocating all per-connection hardware resources.
With this clean-up, all three arms of the switch statement in
rpcrdma_ep_connect are exactly the same now, thus the switch can be
removed.
Because device removal behaves a little differently than
disconnection, there is a little more work to be done before
rpcrdma_ep_destroy() can release the connection's rdma_cm_id. So
it is not quite symmetrical with rpcrdma_ep_create() yet.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Make a Protection Domain (PD) a per-connection resource rather than
a per-transport resource. In other words, when the connection
terminates, the PD is destroyed.
Thus there is one less HW resource that remains allocated to a
transport after a connection is closed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Simplify the synopses of functions in the connect and
disconnect paths in preparation for combining the rpcrdma_ia and
struct rpcrdma_ep structures.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Simplify the synopses of functions in the post_send path
by combining the struct rpcrdma_ia and struct rpcrdma_ep arguments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: prepare for combining the rpcrdma_ia and rpcrdma_ep
structures. Take the opportunity to rename the function to be
consistent with the "subsystem _ object _ verb" naming scheme.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Refactor rpcrdma_ep_create(), rpcrdma_ep_disconnect(), and
rpcrdma_ep_destroy().
rpcrdma_ep_create will be invoked at connect time instead of at
transport set-up time. It will be responsible for allocating per-
connection resources. In this patch it allocates the CQs and
creates a QP. More to come.
rpcrdma_ep_destroy() is the inverse functionality that is
invoked at disconnect time. It will be responsible for releasing
the CQs and QP.
These changes should be safe to do because both connect and
disconnect is guaranteed to be serialized by the transport send
lock.
This takes us another step closer to resolving the address and route
only at connect time so that connection failover to another device
will work correctly.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Two changes:
- Show the number of SG entries that were mapped. This helps debug
DMA-related problems.
- Record the MR's resource ID instead of its memory address. This
groups each MR with its associated rdma-tool output, and reduces
needless exposure of memory addresses.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Refactor pnfs_generic_commit_pagelist() to simplify the conversion
to layout segment based commit lists.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Just allocate the array at the end of the layout segment structure,
instead of allocating it as a separate array of pointers.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Ever since commit 2c94b8eca1 ("SUNRPC: Use au_rslack when computing
reply buffer size"). It changed how "req->rq_rcvsize" is calculated. It
used to use au_cslack value which was nice and large and changed it to
au_rslack value which turns out to be too small.
Since 5.1, v3 mount with sec=krb5p fails against an Ontap server
because client's receive buffer it too small.
For gss krb5p, we need to account for the mic token in the verifier,
and the wrap token in the wrap token.
RFC 4121 defines:
mic token
Octet no Name Description
--------------------------------------------------------------
0..1 TOK_ID Identification field. Tokens emitted by
GSS_GetMIC() contain the hex value 04 04
expressed in big-endian order in this
field.
2 Flags Attributes field, as described in section
4.2.2.
3..7 Filler Contains five octets of hex value FF.
8..15 SND_SEQ Sequence number field in clear text,
expressed in big-endian order.
16..last SGN_CKSUM Checksum of the "to-be-signed" data and
octet 0..15, as described in section 4.2.4.
that's 16bytes (GSS_KRB5_TOK_HDR_LEN) + chksum
wrap token
Octet no Name Description
--------------------------------------------------------------
0..1 TOK_ID Identification field. Tokens emitted by
GSS_Wrap() contain the hex value 05 04
expressed in big-endian order in this
field.
2 Flags Attributes field, as described in section
4.2.2.
3 Filler Contains the hex value FF.
4..5 EC Contains the "extra count" field, in big-
endian order as described in section 4.2.3.
6..7 RRC Contains the "right rotation count" in big-
endian order, as described in section
4.2.5.
8..15 SND_SEQ Sequence number field in clear text,
expressed in big-endian order.
16..last Data Encrypted data for Wrap tokens with
confidentiality, or plaintext data followed
by the checksum for Wrap tokens without
confidentiality, as described in section
4.2.4.
Also 16bytes of header (GSS_KRB5_TOK_HDR_LEN), encrypted data, and cksum
(other things like padding)
RFC 3961 defines known cksum sizes:
Checksum type sumtype checksum section or
value size reference
---------------------------------------------------------------------
CRC32 1 4 6.1.3
rsa-md4 2 16 6.1.2
rsa-md4-des 3 24 6.2.5
des-mac 4 16 6.2.7
des-mac-k 5 8 6.2.8
rsa-md4-des-k 6 16 6.2.6
rsa-md5 7 16 6.1.1
rsa-md5-des 8 24 6.2.4
rsa-md5-des3 9 24 ??
sha1 (unkeyed) 10 20 ??
hmac-sha1-des3-kd 12 20 6.3
hmac-sha1-des3 13 20 ??
sha1 (unkeyed) 14 20 ??
hmac-sha1-96-aes128 15 20 [KRB5-AES]
hmac-sha1-96-aes256 16 20 [KRB5-AES]
[reserved] 0x8003 ? [GSS-KRB5]
Linux kernel now mainly supports type 15,16 so max cksum size is 20bytes.
(GSS_KRB5_MAX_CKSUM_LEN)
Re-use already existing define of GSS_KRB5_MAX_SLACK_NEEDED that's used
for encoding the gss_wrap tokens (same tokens are used in reply).
Fixes: 2c94b8eca1 ("SUNRPC: Use au_rslack when computing reply buffer size")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
UDP was originally disabled in 6da1a03436 for NFSv4. Later in
b24ee6c64c UDP is by default disabled by NFS_DISABLE_UDP_SUPPORT=y for
all NFS versions. Therefore remove v4 from error message.
Fixes: b24ee6c64c ("NFS: allow deprecation of NFS UDP protocol")
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
UDP is disabled by default in commit b24ee6c64c ("NFS: allow
deprecation of NFS UDP protocol"), but the default mount options
is still udp, change it to tcp to avoid the "Unsupported transport
protocol udp" error if no protocol is specified when mount nfs.
Fixes: b24ee6c64c ("NFS: allow deprecation of NFS UDP protocol")
Signed-off-by: Liwei Song <liwei.song@windriver.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When dreq is allocated by nfs_direct_req_alloc(), dreq->kref is
initialized to 2. Therefore we need to call nfs_direct_req_release()
twice to release the allocated dreq. Usually it is called in
nfs_file_direct_{read, write}() and nfs_direct_complete().
However, current code only calls nfs_direct_req_relese() once if
nfs_get_lock_context() fails in nfs_file_direct_{read, write}().
So, that case would result in memory leak.
Fix this by adding the missing call.
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
By preventing compiler inlining of the integrity and privacy
helpers, stack utilization for the common case (authentication only)
goes way down.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Clean up: this function is no longer used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
xdr_buf_read_mic() tries to find unused contiguous space in a
received xdr_buf in order to linearize the checksum for the call
to gss_verify_mic. However, the corner cases in this code are
numerous and we seem to keep missing them. I've just hit yet
another buffer overrun related to it.
This overrun is at the end of xdr_buf_read_mic():
1284 if (buf->tail[0].iov_len != 0)
1285 mic->data = buf->tail[0].iov_base + buf->tail[0].iov_len;
1286 else
1287 mic->data = buf->head[0].iov_base + buf->head[0].iov_len;
1288 __read_bytes_from_xdr_buf(&subbuf, mic->data, mic->len);
1289 return 0;
This logic assumes the transport has set the length of the tail
based on the size of the received message. base + len is then
supposed to be off the end of the message but still within the
actual buffer.
In fact, the length of the tail is set by the upper layer when the
Call is encoded so that the end of the tail is actually the end of
the allocated buffer itself. This causes the logic above to set
mic->data to point past the end of the receive buffer.
The "mic->data = head" arm of this if statement is no less fragile.
As near as I can tell, this has been a problem forever. I'm not sure
that minimizing au_rslack recently changed this pathology much.
So instead, let's use a more straightforward approach: kmalloc a
separate buffer to linearize the checksum. This is similar to
how gss_validate() currently works.
Coming back to this code, I had some trouble understanding what
was going on. So I've cleaned up the variable naming and added
a few comments that point back to the XDR definition in RFC 2203
to help guide future spelunkers, including myself.
As an added clean up, the functionality that was in
xdr_buf_read_mic() is folded directly into gss_unwrap_resp_integ(),
as that is its only caller.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>