2007-09-11 01:50:12 +08:00
|
|
|
/*
|
2017-10-31 04:22:14 +08:00
|
|
|
* Copyright (c) 2014-2017 Oracle. All rights reserved.
|
2007-09-11 01:50:12 +08:00
|
|
|
* Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
|
|
|
|
*
|
|
|
|
* This software is available to you under a choice of one of two
|
|
|
|
* licenses. You may choose to be licensed under the terms of the GNU
|
|
|
|
* General Public License (GPL) Version 2, available from the file
|
|
|
|
* COPYING in the main directory of this source tree, or the BSD-type
|
|
|
|
* license below:
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
*
|
|
|
|
* Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
*
|
|
|
|
* Redistributions in binary form must reproduce the above
|
|
|
|
* copyright notice, this list of conditions and the following
|
|
|
|
* disclaimer in the documentation and/or other materials provided
|
|
|
|
* with the distribution.
|
|
|
|
*
|
|
|
|
* Neither the name of the Network Appliance, Inc. nor the names of
|
|
|
|
* its contributors may be used to endorse or promote products
|
|
|
|
* derived from this software without specific prior written
|
|
|
|
* permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
|
|
|
|
* "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
|
|
|
|
* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
|
|
|
|
* A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
|
|
|
|
* OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
|
|
|
|
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
|
|
|
|
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
|
|
|
|
* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
|
|
|
|
* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
|
|
|
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
|
|
|
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* transport.c
|
|
|
|
*
|
|
|
|
* This file contains the top-level implementation of an RPC RDMA
|
|
|
|
* transport.
|
|
|
|
*
|
|
|
|
* Naming convention: functions beginning with xprt_ are part of the
|
|
|
|
* transport switch. All others are RPC RDMA internal.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/module.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
|
|
|
#include <linux/slab.h>
|
2007-09-11 01:50:12 +08:00
|
|
|
#include <linux/seq_file.h>
|
2013-02-05 01:50:00 +08:00
|
|
|
#include <linux/sunrpc/addr.h>
|
2007-09-11 01:50:12 +08:00
|
|
|
|
|
|
|
#include "xprt_rdma.h"
|
|
|
|
|
2014-11-18 05:58:04 +08:00
|
|
|
#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
|
2007-09-11 01:50:12 +08:00
|
|
|
# define RPCDBG_FACILITY RPCDBG_TRANS
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* tunables
|
|
|
|
*/
|
|
|
|
|
|
|
|
static unsigned int xprt_rdma_slot_table_entries = RPCRDMA_DEF_SLOT_TABLE;
|
2016-01-08 03:50:10 +08:00
|
|
|
unsigned int xprt_rdma_max_inline_read = RPCRDMA_DEF_INLINE;
|
2007-09-11 01:50:12 +08:00
|
|
|
static unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
|
|
|
|
static unsigned int xprt_rdma_inline_write_padding;
|
2017-04-12 01:22:54 +08:00
|
|
|
unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRMR;
|
|
|
|
int xprt_rdma_pad_optimize;
|
2007-09-11 01:50:12 +08:00
|
|
|
|
2014-11-18 05:58:04 +08:00
|
|
|
#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
|
2007-09-11 01:50:12 +08:00
|
|
|
|
|
|
|
static unsigned int min_slot_table_size = RPCRDMA_MIN_SLOT_TABLE;
|
|
|
|
static unsigned int max_slot_table_size = RPCRDMA_MAX_SLOT_TABLE;
|
2016-05-03 02:40:48 +08:00
|
|
|
static unsigned int min_inline_size = RPCRDMA_MIN_INLINE;
|
|
|
|
static unsigned int max_inline_size = RPCRDMA_MAX_INLINE;
|
2007-09-11 01:50:12 +08:00
|
|
|
static unsigned int zero;
|
|
|
|
static unsigned int max_padding = PAGE_SIZE;
|
|
|
|
static unsigned int min_memreg = RPCRDMA_BOUNCEBUFFERS;
|
|
|
|
static unsigned int max_memreg = RPCRDMA_LAST - 1;
|
|
|
|
|
|
|
|
static struct ctl_table_header *sunrpc_table_header;
|
|
|
|
|
2013-06-12 14:04:25 +08:00
|
|
|
static struct ctl_table xr_tunables_table[] = {
|
2007-09-11 01:50:12 +08:00
|
|
|
{
|
|
|
|
.procname = "rdma_slot_table_entries",
|
|
|
|
.data = &xprt_rdma_slot_table_entries,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2009-11-16 19:11:48 +08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2007-09-11 01:50:12 +08:00
|
|
|
.extra1 = &min_slot_table_size,
|
|
|
|
.extra2 = &max_slot_table_size
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "rdma_max_inline_read",
|
|
|
|
.data = &xprt_rdma_max_inline_read,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2016-09-15 22:57:32 +08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2016-05-03 02:40:48 +08:00
|
|
|
.extra1 = &min_inline_size,
|
|
|
|
.extra2 = &max_inline_size,
|
2007-09-11 01:50:12 +08:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "rdma_max_inline_write",
|
|
|
|
.data = &xprt_rdma_max_inline_write,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2016-09-15 22:57:32 +08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2016-05-03 02:40:48 +08:00
|
|
|
.extra1 = &min_inline_size,
|
|
|
|
.extra2 = &max_inline_size,
|
2007-09-11 01:50:12 +08:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "rdma_inline_write_padding",
|
|
|
|
.data = &xprt_rdma_inline_write_padding,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2009-11-16 19:11:48 +08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2007-09-11 01:50:12 +08:00
|
|
|
.extra1 = &zero,
|
|
|
|
.extra2 = &max_padding,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.procname = "rdma_memreg_strategy",
|
|
|
|
.data = &xprt_rdma_memreg_strategy,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2009-11-16 19:11:48 +08:00
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2007-09-11 01:50:12 +08:00
|
|
|
.extra1 = &min_memreg,
|
|
|
|
.extra2 = &max_memreg,
|
|
|
|
},
|
2008-10-10 03:01:11 +08:00
|
|
|
{
|
|
|
|
.procname = "rdma_pad_optimize",
|
|
|
|
.data = &xprt_rdma_pad_optimize,
|
|
|
|
.maxlen = sizeof(unsigned int),
|
|
|
|
.mode = 0644,
|
2009-11-16 19:11:48 +08:00
|
|
|
.proc_handler = proc_dointvec,
|
2008-10-10 03:01:11 +08:00
|
|
|
},
|
2009-11-06 05:32:03 +08:00
|
|
|
{ },
|
2007-09-11 01:50:12 +08:00
|
|
|
};
|
|
|
|
|
2013-06-12 14:04:25 +08:00
|
|
|
static struct ctl_table sunrpc_table[] = {
|
2007-09-11 01:50:12 +08:00
|
|
|
{
|
|
|
|
.procname = "sunrpc",
|
|
|
|
.mode = 0555,
|
|
|
|
.child = xr_tunables_table
|
|
|
|
},
|
2009-11-06 05:32:03 +08:00
|
|
|
{ },
|
2007-09-11 01:50:12 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
#endif
|
|
|
|
|
2017-08-02 00:00:39 +08:00
|
|
|
static const struct rpc_xprt_ops xprt_rdma_procs;
|
2007-09-11 01:50:12 +08:00
|
|
|
|
2015-03-31 02:33:43 +08:00
|
|
|
static void
|
|
|
|
xprt_rdma_format_addresses4(struct rpc_xprt *xprt, struct sockaddr *sap)
|
|
|
|
{
|
|
|
|
struct sockaddr_in *sin = (struct sockaddr_in *)sap;
|
|
|
|
char buf[20];
|
|
|
|
|
|
|
|
snprintf(buf, sizeof(buf), "%08x", ntohl(sin->sin_addr.s_addr));
|
|
|
|
xprt->address_strings[RPC_DISPLAY_HEX_ADDR] = kstrdup(buf, GFP_KERNEL);
|
|
|
|
|
|
|
|
xprt->address_strings[RPC_DISPLAY_NETID] = RPCBIND_NETID_RDMA;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
xprt_rdma_format_addresses6(struct rpc_xprt *xprt, struct sockaddr *sap)
|
|
|
|
{
|
|
|
|
struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)sap;
|
|
|
|
char buf[40];
|
|
|
|
|
|
|
|
snprintf(buf, sizeof(buf), "%pi6", &sin6->sin6_addr);
|
|
|
|
xprt->address_strings[RPC_DISPLAY_HEX_ADDR] = kstrdup(buf, GFP_KERNEL);
|
|
|
|
|
|
|
|
xprt->address_strings[RPC_DISPLAY_NETID] = RPCBIND_NETID_RDMA6;
|
|
|
|
}
|
|
|
|
|
2016-01-08 03:50:10 +08:00
|
|
|
void
|
2015-08-04 01:02:41 +08:00
|
|
|
xprt_rdma_format_addresses(struct rpc_xprt *xprt, struct sockaddr *sap)
|
2007-09-11 01:50:12 +08:00
|
|
|
{
|
2015-03-31 02:33:43 +08:00
|
|
|
char buf[128];
|
|
|
|
|
|
|
|
switch (sap->sa_family) {
|
|
|
|
case AF_INET:
|
|
|
|
xprt_rdma_format_addresses4(xprt, sap);
|
|
|
|
break;
|
|
|
|
case AF_INET6:
|
|
|
|
xprt_rdma_format_addresses6(xprt, sap);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
pr_err("rpcrdma: Unrecognized address family\n");
|
|
|
|
return;
|
|
|
|
}
|
2007-09-11 01:50:12 +08:00
|
|
|
|
2009-08-10 03:09:36 +08:00
|
|
|
(void)rpc_ntop(sap, buf, sizeof(buf));
|
|
|
|
xprt->address_strings[RPC_DISPLAY_ADDR] = kstrdup(buf, GFP_KERNEL);
|
2007-09-11 01:50:12 +08:00
|
|
|
|
2010-03-09 04:15:59 +08:00
|
|
|
snprintf(buf, sizeof(buf), "%u", rpc_get_port(sap));
|
2009-08-10 03:09:36 +08:00
|
|
|
xprt->address_strings[RPC_DISPLAY_PORT] = kstrdup(buf, GFP_KERNEL);
|
2007-09-11 01:50:12 +08:00
|
|
|
|
2010-03-09 04:15:59 +08:00
|
|
|
snprintf(buf, sizeof(buf), "%4hx", rpc_get_port(sap));
|
2009-08-10 03:09:36 +08:00
|
|
|
xprt->address_strings[RPC_DISPLAY_HEX_PORT] = kstrdup(buf, GFP_KERNEL);
|
2007-09-11 01:50:12 +08:00
|
|
|
|
2015-03-31 02:33:43 +08:00
|
|
|
xprt->address_strings[RPC_DISPLAY_PROTO] = "rdma";
|
2007-09-11 01:50:12 +08:00
|
|
|
}
|
|
|
|
|
2016-01-08 03:50:10 +08:00
|
|
|
void
|
2007-09-11 01:50:12 +08:00
|
|
|
xprt_rdma_free_addresses(struct rpc_xprt *xprt)
|
|
|
|
{
|
2008-01-15 01:32:20 +08:00
|
|
|
unsigned int i;
|
|
|
|
|
|
|
|
for (i = 0; i < RPC_DISPLAY_MAX; i++)
|
|
|
|
switch (i) {
|
|
|
|
case RPC_DISPLAY_PROTO:
|
|
|
|
case RPC_DISPLAY_NETID:
|
|
|
|
continue;
|
|
|
|
default:
|
|
|
|
kfree(xprt->address_strings[i]);
|
|
|
|
}
|
2007-09-11 01:50:12 +08:00
|
|
|
}
|
|
|
|
|
2016-11-29 23:53:37 +08:00
|
|
|
void
|
|
|
|
rpcrdma_conn_func(struct rpcrdma_ep *ep)
|
|
|
|
{
|
|
|
|
schedule_delayed_work(&ep->rep_connect_worker, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
rpcrdma_connect_worker(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct rpcrdma_ep *ep =
|
|
|
|
container_of(work, struct rpcrdma_ep, rep_connect_worker.work);
|
|
|
|
struct rpcrdma_xprt *r_xprt =
|
|
|
|
container_of(ep, struct rpcrdma_xprt, rx_ep);
|
|
|
|
struct rpc_xprt *xprt = &r_xprt->rx_xprt;
|
|
|
|
|
|
|
|
spin_lock_bh(&xprt->transport_lock);
|
|
|
|
if (++xprt->connect_cookie == 0) /* maintain a reserved value */
|
|
|
|
++xprt->connect_cookie;
|
|
|
|
if (ep->rep_connected > 0) {
|
|
|
|
if (!xprt_test_and_set_connected(xprt))
|
|
|
|
xprt_wake_pending_tasks(xprt, 0);
|
|
|
|
} else {
|
|
|
|
if (xprt_test_and_clear_connected(xprt))
|
|
|
|
xprt_wake_pending_tasks(xprt, -ENOTCONN);
|
|
|
|
}
|
|
|
|
spin_unlock_bh(&xprt->transport_lock);
|
|
|
|
}
|
|
|
|
|
2007-09-11 01:50:12 +08:00
|
|
|
static void
|
|
|
|
xprt_rdma_connect_worker(struct work_struct *work)
|
|
|
|
{
|
2015-01-22 00:02:37 +08:00
|
|
|
struct rpcrdma_xprt *r_xprt = container_of(work, struct rpcrdma_xprt,
|
|
|
|
rx_connect_worker.work);
|
|
|
|
struct rpc_xprt *xprt = &r_xprt->rx_xprt;
|
2007-09-11 01:50:12 +08:00
|
|
|
int rc = 0;
|
|
|
|
|
2012-09-12 05:21:25 +08:00
|
|
|
xprt_clear_connected(xprt);
|
|
|
|
|
|
|
|
dprintk("RPC: %s: %sconnect\n", __func__,
|
|
|
|
r_xprt->rx_ep.rep_connected != 0 ? "re" : "");
|
|
|
|
rc = rpcrdma_ep_connect(&r_xprt->rx_ep, &r_xprt->rx_ia);
|
|
|
|
if (rc)
|
|
|
|
xprt_wake_pending_tasks(xprt, rc);
|
|
|
|
|
2007-09-11 01:50:12 +08:00
|
|
|
dprintk("RPC: %s: exit\n", __func__);
|
|
|
|
xprt_clear_connecting(xprt);
|
|
|
|
}
|
|
|
|
|
2015-05-12 02:02:25 +08:00
|
|
|
static void
|
|
|
|
xprt_rdma_inject_disconnect(struct rpc_xprt *xprt)
|
|
|
|
{
|
|
|
|
struct rpcrdma_xprt *r_xprt = container_of(xprt, struct rpcrdma_xprt,
|
|
|
|
rx_xprt);
|
|
|
|
|
|
|
|
pr_info("rpcrdma: injecting transport disconnect on xprt=%p\n", xprt);
|
|
|
|
rdma_disconnect(r_xprt->rx_ia.ri_id);
|
|
|
|
}
|
|
|
|
|
2007-09-11 01:50:12 +08:00
|
|
|
/*
|
|
|
|
* xprt_rdma_destroy
|
|
|
|
*
|
|
|
|
* Destroy the xprt.
|
|
|
|
* Free all memory associated with the object, including its own.
|
|
|
|
* NOTE: none of the *destroy methods free memory for their top-level
|
|
|
|
* objects, even though they may have allocated it (they do free
|
|
|
|
* private memory). It's up to the caller to handle it. In this
|
|
|
|
* case (RDMA transport), all structure memory is inlined with the
|
|
|
|
* struct rpcrdma_xprt.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
xprt_rdma_destroy(struct rpc_xprt *xprt)
|
|
|
|
{
|
|
|
|
struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
|
|
|
|
|
|
|
|
dprintk("RPC: %s: called\n", __func__);
|
|
|
|
|
2015-01-22 00:02:37 +08:00
|
|
|
cancel_delayed_work_sync(&r_xprt->rx_connect_worker);
|
2007-09-11 01:50:12 +08:00
|
|
|
|
|
|
|
xprt_clear_connected(xprt);
|
|
|
|
|
2014-05-28 22:33:16 +08:00
|
|
|
rpcrdma_ep_destroy(&r_xprt->rx_ep, &r_xprt->rx_ia);
|
2015-09-22 01:24:23 +08:00
|
|
|
rpcrdma_buffer_destroy(&r_xprt->rx_buf);
|
2007-09-11 01:50:12 +08:00
|
|
|
rpcrdma_ia_close(&r_xprt->rx_ia);
|
|
|
|
|
|
|
|
xprt_rdma_free_addresses(xprt);
|
|
|
|
|
2010-09-29 20:03:13 +08:00
|
|
|
xprt_free(xprt);
|
2007-09-11 01:50:12 +08:00
|
|
|
|
|
|
|
dprintk("RPC: %s: returning\n", __func__);
|
|
|
|
|
|
|
|
module_put(THIS_MODULE);
|
|
|
|
}
|
|
|
|
|
2007-12-21 05:03:54 +08:00
|
|
|
static const struct rpc_timeout xprt_rdma_default_timeout = {
|
|
|
|
.to_initval = 60 * HZ,
|
|
|
|
.to_maxval = 60 * HZ,
|
|
|
|
};
|
|
|
|
|
2007-09-11 01:50:12 +08:00
|
|
|
/**
|
|
|
|
* xprt_setup_rdma - Set up transport to use RDMA
|
|
|
|
*
|
|
|
|
* @args: rpc transport arguments
|
|
|
|
*/
|
|
|
|
static struct rpc_xprt *
|
|
|
|
xprt_setup_rdma(struct xprt_create *args)
|
|
|
|
{
|
|
|
|
struct rpcrdma_create_data_internal cdata;
|
|
|
|
struct rpc_xprt *xprt;
|
|
|
|
struct rpcrdma_xprt *new_xprt;
|
|
|
|
struct rpcrdma_ep *new_ep;
|
2015-08-04 01:02:41 +08:00
|
|
|
struct sockaddr *sap;
|
2007-09-11 01:50:12 +08:00
|
|
|
int rc;
|
|
|
|
|
|
|
|
if (args->addrlen > sizeof(xprt->addr)) {
|
|
|
|
dprintk("RPC: %s: address too large\n", __func__);
|
|
|
|
return ERR_PTR(-EBADF);
|
|
|
|
}
|
|
|
|
|
2010-09-29 20:05:43 +08:00
|
|
|
xprt = xprt_alloc(args->net, sizeof(struct rpcrdma_xprt),
|
2011-07-18 06:11:30 +08:00
|
|
|
xprt_rdma_slot_table_entries,
|
2010-09-29 20:02:43 +08:00
|
|
|
xprt_rdma_slot_table_entries);
|
2007-09-11 01:50:12 +08:00
|
|
|
if (xprt == NULL) {
|
|
|
|
dprintk("RPC: %s: couldn't allocate rpcrdma_xprt\n",
|
|
|
|
__func__);
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* 60 second timeout, no retries */
|
2007-12-21 05:03:55 +08:00
|
|
|
xprt->timeout = &xprt_rdma_default_timeout;
|
2014-05-28 22:34:32 +08:00
|
|
|
xprt->bind_timeout = RPCRDMA_BIND_TO;
|
|
|
|
xprt->reestablish_timeout = RPCRDMA_INIT_REEST_TO;
|
|
|
|
xprt->idle_timeout = RPCRDMA_IDLE_DISC_TO;
|
2007-09-11 01:50:12 +08:00
|
|
|
|
|
|
|
xprt->resvport = 0; /* privileged port not needed */
|
|
|
|
xprt->tsh_size = 0; /* RPC-RDMA handles framing */
|
|
|
|
xprt->ops = &xprt_rdma_procs;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Set up RDMA-specific connect data.
|
|
|
|
*/
|
|
|
|
|
2015-08-04 01:02:41 +08:00
|
|
|
sap = (struct sockaddr *)&cdata.addr;
|
|
|
|
memcpy(sap, args->dstaddr, args->addrlen);
|
2007-09-11 01:50:12 +08:00
|
|
|
|
|
|
|
/* Ensure xprt->addr holds valid server TCP (not RDMA)
|
|
|
|
* address, for any side protocols which peek at it */
|
|
|
|
xprt->prot = IPPROTO_TCP;
|
|
|
|
xprt->addrlen = args->addrlen;
|
2015-08-04 01:02:41 +08:00
|
|
|
memcpy(&xprt->addr, sap, xprt->addrlen);
|
2007-09-11 01:50:12 +08:00
|
|
|
|
2015-08-04 01:02:41 +08:00
|
|
|
if (rpc_get_port(sap))
|
2007-09-11 01:50:12 +08:00
|
|
|
xprt_set_bound(xprt);
|
|
|
|
|
|
|
|
cdata.max_requests = xprt->max_reqs;
|
|
|
|
|
|
|
|
cdata.rsize = RPCRDMA_MAX_SEGS * PAGE_SIZE; /* RDMA write max */
|
|
|
|
cdata.wsize = RPCRDMA_MAX_SEGS * PAGE_SIZE; /* RDMA read max */
|
|
|
|
|
|
|
|
cdata.inline_wsize = xprt_rdma_max_inline_write;
|
|
|
|
if (cdata.inline_wsize > cdata.wsize)
|
|
|
|
cdata.inline_wsize = cdata.wsize;
|
|
|
|
|
|
|
|
cdata.inline_rsize = xprt_rdma_max_inline_read;
|
|
|
|
if (cdata.inline_rsize > cdata.rsize)
|
|
|
|
cdata.inline_rsize = cdata.rsize;
|
|
|
|
|
|
|
|
cdata.padding = xprt_rdma_inline_write_padding;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Create new transport instance, which includes initialized
|
|
|
|
* o ia
|
|
|
|
* o endpoint
|
|
|
|
* o buffers
|
|
|
|
*/
|
|
|
|
|
|
|
|
new_xprt = rpcx_to_rdmax(xprt);
|
|
|
|
|
2017-04-12 01:22:54 +08:00
|
|
|
rc = rpcrdma_ia_open(new_xprt, sap);
|
2007-09-11 01:50:12 +08:00
|
|
|
if (rc)
|
|
|
|
goto out1;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* initialize and create ep
|
|
|
|
*/
|
|
|
|
new_xprt->rx_data = cdata;
|
|
|
|
new_ep = &new_xprt->rx_ep;
|
|
|
|
new_ep->rep_remote_addr = cdata.addr;
|
|
|
|
|
|
|
|
rc = rpcrdma_ep_create(&new_xprt->rx_ep,
|
|
|
|
&new_xprt->rx_ia, &new_xprt->rx_data);
|
|
|
|
if (rc)
|
|
|
|
goto out2;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate pre-registered send and receive buffers for headers and
|
|
|
|
* any inline data. Also specify any padding which will be provided
|
|
|
|
* from a preregistered zero buffer.
|
|
|
|
*/
|
2015-01-22 00:03:44 +08:00
|
|
|
rc = rpcrdma_buffer_create(new_xprt);
|
2007-09-11 01:50:12 +08:00
|
|
|
if (rc)
|
|
|
|
goto out3;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Register a callback for connection events. This is necessary because
|
|
|
|
* connection loss notification is async. We also catch connection loss
|
|
|
|
* when reaping receives.
|
|
|
|
*/
|
2015-01-22 00:02:37 +08:00
|
|
|
INIT_DELAYED_WORK(&new_xprt->rx_connect_worker,
|
|
|
|
xprt_rdma_connect_worker);
|
2007-09-11 01:50:12 +08:00
|
|
|
|
2015-08-04 01:02:41 +08:00
|
|
|
xprt_rdma_format_addresses(xprt, sap);
|
2015-03-31 02:34:30 +08:00
|
|
|
xprt->max_payload = new_xprt->rx_ia.ri_ops->ro_maxpages(new_xprt);
|
|
|
|
if (xprt->max_payload == 0)
|
|
|
|
goto out4;
|
|
|
|
xprt->max_payload <<= PAGE_SHIFT;
|
2014-07-30 05:23:34 +08:00
|
|
|
dprintk("RPC: %s: transport data payload maximum: %zu bytes\n",
|
|
|
|
__func__, xprt->max_payload);
|
2007-09-11 01:50:12 +08:00
|
|
|
|
|
|
|
if (!try_module_get(THIS_MODULE))
|
|
|
|
goto out4;
|
|
|
|
|
2015-08-04 01:02:41 +08:00
|
|
|
dprintk("RPC: %s: %s:%s\n", __func__,
|
|
|
|
xprt->address_strings[RPC_DISPLAY_ADDR],
|
|
|
|
xprt->address_strings[RPC_DISPLAY_PORT]);
|
2007-09-11 01:50:12 +08:00
|
|
|
return xprt;
|
|
|
|
|
|
|
|
out4:
|
|
|
|
xprt_rdma_free_addresses(xprt);
|
|
|
|
rc = -EINVAL;
|
|
|
|
out3:
|
2014-05-28 22:33:16 +08:00
|
|
|
rpcrdma_ep_destroy(new_ep, &new_xprt->rx_ia);
|
2007-09-11 01:50:12 +08:00
|
|
|
out2:
|
|
|
|
rpcrdma_ia_close(&new_xprt->rx_ia);
|
|
|
|
out1:
|
2010-09-29 20:03:13 +08:00
|
|
|
xprt_free(xprt);
|
2007-09-11 01:50:12 +08:00
|
|
|
return ERR_PTR(rc);
|
|
|
|
}
|
|
|
|
|
2017-04-12 01:23:10 +08:00
|
|
|
/**
|
|
|
|
* xprt_rdma_close - Close down RDMA connection
|
|
|
|
* @xprt: generic transport to be closed
|
|
|
|
*
|
|
|
|
* Called during transport shutdown reconnect, or device
|
|
|
|
* removal. Caller holds the transport's write lock.
|
2007-09-11 01:50:12 +08:00
|
|
|
*/
|
|
|
|
static void
|
|
|
|
xprt_rdma_close(struct rpc_xprt *xprt)
|
|
|
|
{
|
|
|
|
struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
|
2017-04-12 01:23:10 +08:00
|
|
|
struct rpcrdma_ep *ep = &r_xprt->rx_ep;
|
|
|
|
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
|
|
|
|
|
|
|
|
dprintk("RPC: %s: closing xprt %p\n", __func__, xprt);
|
2007-09-11 01:50:12 +08:00
|
|
|
|
2017-04-12 01:23:10 +08:00
|
|
|
if (test_and_clear_bit(RPCRDMA_IAF_REMOVING, &ia->ri_flags)) {
|
|
|
|
xprt_clear_connected(xprt);
|
|
|
|
rpcrdma_ia_remove(ia);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
if (ep->rep_connected == -ENODEV)
|
|
|
|
return;
|
|
|
|
if (ep->rep_connected > 0)
|
2008-10-10 23:32:34 +08:00
|
|
|
xprt->reestablish_timeout = 0;
|
2007-11-07 07:44:20 +08:00
|
|
|
xprt_disconnect_done(xprt);
|
2017-04-12 01:23:10 +08:00
|
|
|
rpcrdma_ep_disconnect(ep, ia);
|
2007-09-11 01:50:12 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
xprt_rdma_set_port(struct rpc_xprt *xprt, u16 port)
|
|
|
|
{
|
|
|
|
struct sockaddr_in *sap;
|
|
|
|
|
|
|
|
sap = (struct sockaddr_in *)&xprt->addr;
|
|
|
|
sap->sin_port = htons(port);
|
|
|
|
sap = (struct sockaddr_in *)&rpcx_to_rdmad(xprt).addr;
|
|
|
|
sap->sin_port = htons(port);
|
|
|
|
dprintk("RPC: %s: %u\n", __func__, port);
|
|
|
|
}
|
|
|
|
|
xprtrdma: Detect unreachable NFS/RDMA servers more reliably
Current NFS clients rely on connection loss to determine when to
retransmit. In particular, for protocols like NFSv4, clients no
longer rely on RPC timeouts to drive retransmission: NFSv4 servers
are required to terminate a connection when they need a client to
retransmit pending RPCs.
When a server is no longer reachable, either because it has crashed
or because the network path has broken, the server cannot actively
terminate a connection. Thus NFS clients depend on transport-level
keepalive to determine when a connection must be replaced and
pending RPCs retransmitted.
However, RDMA RC connections do not have a native keepalive
mechanism. If an NFS/RDMA server crashes after a client has sent
RPCs successfully (an RC ACK has been received for all OTW RDMA
requests), there is no way for the client to know the connection is
moribund.
In addition, new RDMA requests are subject to the RPC-over-RDMA
credit limit. If the client has consumed all granted credits with
NFS traffic, it is not allowed to send another RDMA request until
the server replies. Thus it has no way to send a true keepalive when
the workload has already consumed all credits with pending RPCs.
To address this, forcibly disconnect a transport when an RPC times
out. This prevents moribund connections from stopping the
detection of failover or other configuration changes on the server.
Note that even if the connection is still good, retransmitting
any RPC will trigger a disconnect thanks to this logic in
xprt_rdma_send_request:
/* Must suppress retransmit to maintain credits */
if (req->rl_connect_cookie == xprt->connect_cookie)
goto drop_connection;
req->rl_connect_cookie = xprt->connect_cookie;
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-12 01:22:46 +08:00
|
|
|
/**
|
|
|
|
* xprt_rdma_timer - invoked when an RPC times out
|
|
|
|
* @xprt: controlling RPC transport
|
|
|
|
* @task: RPC task that timed out
|
|
|
|
*
|
|
|
|
* Invoked when the transport is still connected, but an RPC
|
|
|
|
* retransmit timeout occurs.
|
|
|
|
*
|
|
|
|
* Since RDMA connections don't have a keep-alive, forcibly
|
|
|
|
* disconnect and retry to connect. This drives full
|
|
|
|
* detection of the network path, and retransmissions of
|
|
|
|
* all pending RPCs.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
xprt_rdma_timer(struct rpc_xprt *xprt, struct rpc_task *task)
|
|
|
|
{
|
|
|
|
dprintk("RPC: %5u %s: xprt = %p\n", task->tk_pid, __func__, xprt);
|
|
|
|
|
|
|
|
xprt_force_disconnect(xprt);
|
|
|
|
}
|
|
|
|
|
2007-09-11 01:50:12 +08:00
|
|
|
static void
|
2013-01-08 22:26:49 +08:00
|
|
|
xprt_rdma_connect(struct rpc_xprt *xprt, struct rpc_task *task)
|
2007-09-11 01:50:12 +08:00
|
|
|
{
|
|
|
|
struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
|
|
|
|
|
2010-04-17 04:41:57 +08:00
|
|
|
if (r_xprt->rx_ep.rep_connected != 0) {
|
|
|
|
/* Reconnect */
|
2015-01-22 00:02:37 +08:00
|
|
|
schedule_delayed_work(&r_xprt->rx_connect_worker,
|
|
|
|
xprt->reestablish_timeout);
|
2010-04-17 04:41:57 +08:00
|
|
|
xprt->reestablish_timeout <<= 1;
|
2014-05-28 22:34:32 +08:00
|
|
|
if (xprt->reestablish_timeout > RPCRDMA_MAX_REEST_TO)
|
|
|
|
xprt->reestablish_timeout = RPCRDMA_MAX_REEST_TO;
|
|
|
|
else if (xprt->reestablish_timeout < RPCRDMA_INIT_REEST_TO)
|
|
|
|
xprt->reestablish_timeout = RPCRDMA_INIT_REEST_TO;
|
2010-04-17 04:41:57 +08:00
|
|
|
} else {
|
2015-01-22 00:02:37 +08:00
|
|
|
schedule_delayed_work(&r_xprt->rx_connect_worker, 0);
|
2010-04-17 04:41:57 +08:00
|
|
|
if (!RPC_IS_ASYNC(task))
|
2015-01-22 00:02:37 +08:00
|
|
|
flush_delayed_work(&r_xprt->rx_connect_worker);
|
2007-09-11 01:50:12 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
xprtrdma: Initialize separate RPC call and reply buffers
RPC-over-RDMA needs to separate its RPC call and reply buffers.
o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
Send operation using DMA_TO_DEVICE
o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
as part of a Reply chunk using DMA_FROM_DEVICE
The two mappings are for data movement in opposite directions.
DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.
On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.
Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.
Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.
Some incidental changes worth noting:
- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
the iov.length field, so eliminate rg_size
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 22:55:53 +08:00
|
|
|
/* Allocate a fixed-size buffer in which to construct and send the
|
|
|
|
* RPC-over-RDMA header for this request.
|
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
rpcrdma_get_rdmabuf(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
|
|
|
|
gfp_t flags)
|
|
|
|
{
|
2016-09-15 22:56:02 +08:00
|
|
|
size_t size = RPCRDMA_HDRBUF_SIZE;
|
xprtrdma: Initialize separate RPC call and reply buffers
RPC-over-RDMA needs to separate its RPC call and reply buffers.
o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
Send operation using DMA_TO_DEVICE
o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
as part of a Reply chunk using DMA_FROM_DEVICE
The two mappings are for data movement in opposite directions.
DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.
On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.
Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.
Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.
Some incidental changes worth noting:
- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
the iov.length field, so eliminate rg_size
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 22:55:53 +08:00
|
|
|
struct rpcrdma_regbuf *rb;
|
|
|
|
|
|
|
|
if (req->rl_rdmabuf)
|
|
|
|
return true;
|
|
|
|
|
2016-09-15 22:56:26 +08:00
|
|
|
rb = rpcrdma_alloc_regbuf(size, DMA_TO_DEVICE, flags);
|
xprtrdma: Initialize separate RPC call and reply buffers
RPC-over-RDMA needs to separate its RPC call and reply buffers.
o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
Send operation using DMA_TO_DEVICE
o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
as part of a Reply chunk using DMA_FROM_DEVICE
The two mappings are for data movement in opposite directions.
DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.
On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.
Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.
Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.
Some incidental changes worth noting:
- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
the iov.length field, so eliminate rg_size
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 22:55:53 +08:00
|
|
|
if (IS_ERR(rb))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
r_xprt->rx_stats.hardway_register_count += size;
|
|
|
|
req->rl_rdmabuf = rb;
|
2017-08-11 00:47:28 +08:00
|
|
|
xdr_buf_init(&req->rl_hdrbuf, rb->rg_base, rdmab_length(rb));
|
xprtrdma: Initialize separate RPC call and reply buffers
RPC-over-RDMA needs to separate its RPC call and reply buffers.
o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
Send operation using DMA_TO_DEVICE
o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
as part of a Reply chunk using DMA_FROM_DEVICE
The two mappings are for data movement in opposite directions.
DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.
On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.
Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.
Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.
Some incidental changes worth noting:
- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
the iov.length field, so eliminate rg_size
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 22:55:53 +08:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool
|
|
|
|
rpcrdma_get_sendbuf(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
|
|
|
|
size_t size, gfp_t flags)
|
|
|
|
{
|
|
|
|
struct rpcrdma_regbuf *rb;
|
|
|
|
|
|
|
|
if (req->rl_sendbuf && rdmab_length(req->rl_sendbuf) >= size)
|
|
|
|
return true;
|
|
|
|
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 22:57:24 +08:00
|
|
|
rb = rpcrdma_alloc_regbuf(size, DMA_TO_DEVICE, flags);
|
xprtrdma: Initialize separate RPC call and reply buffers
RPC-over-RDMA needs to separate its RPC call and reply buffers.
o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
Send operation using DMA_TO_DEVICE
o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
as part of a Reply chunk using DMA_FROM_DEVICE
The two mappings are for data movement in opposite directions.
DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.
On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.
Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.
Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.
Some incidental changes worth noting:
- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
the iov.length field, so eliminate rg_size
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 22:55:53 +08:00
|
|
|
if (IS_ERR(rb))
|
|
|
|
return false;
|
|
|
|
|
2016-09-15 22:56:26 +08:00
|
|
|
rpcrdma_free_regbuf(req->rl_sendbuf);
|
xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"
- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload
- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent
As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.
The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.
Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.
This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.
This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 22:57:24 +08:00
|
|
|
r_xprt->rx_stats.hardway_register_count += size;
|
xprtrdma: Initialize separate RPC call and reply buffers
RPC-over-RDMA needs to separate its RPC call and reply buffers.
o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
Send operation using DMA_TO_DEVICE
o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
as part of a Reply chunk using DMA_FROM_DEVICE
The two mappings are for data movement in opposite directions.
DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.
On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.
Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.
Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.
Some incidental changes worth noting:
- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
the iov.length field, so eliminate rg_size
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 22:55:53 +08:00
|
|
|
req->rl_sendbuf = rb;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* The rq_rcv_buf is used only if a Reply chunk is necessary.
|
|
|
|
* The decision to use a Reply chunk is made later in
|
|
|
|
* rpcrdma_marshal_req. This buffer is registered at that time.
|
|
|
|
*
|
|
|
|
* Otherwise, the associated RPC Reply arrives in a separate
|
|
|
|
* Receive buffer, arbitrarily chosen by the HCA. The buffer
|
|
|
|
* allocated here for the RPC Reply is not utilized in that
|
|
|
|
* case. See rpcrdma_inline_fixup.
|
|
|
|
*
|
|
|
|
* A regbuf is used here to remember the buffer size.
|
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
rpcrdma_get_recvbuf(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
|
|
|
|
size_t size, gfp_t flags)
|
|
|
|
{
|
|
|
|
struct rpcrdma_regbuf *rb;
|
|
|
|
|
|
|
|
if (req->rl_recvbuf && rdmab_length(req->rl_recvbuf) >= size)
|
|
|
|
return true;
|
|
|
|
|
2016-09-15 22:56:26 +08:00
|
|
|
rb = rpcrdma_alloc_regbuf(size, DMA_NONE, flags);
|
xprtrdma: Initialize separate RPC call and reply buffers
RPC-over-RDMA needs to separate its RPC call and reply buffers.
o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
Send operation using DMA_TO_DEVICE
o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
as part of a Reply chunk using DMA_FROM_DEVICE
The two mappings are for data movement in opposite directions.
DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.
On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.
Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.
Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.
Some incidental changes worth noting:
- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
the iov.length field, so eliminate rg_size
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 22:55:53 +08:00
|
|
|
if (IS_ERR(rb))
|
|
|
|
return false;
|
|
|
|
|
2016-09-15 22:56:26 +08:00
|
|
|
rpcrdma_free_regbuf(req->rl_recvbuf);
|
xprtrdma: Initialize separate RPC call and reply buffers
RPC-over-RDMA needs to separate its RPC call and reply buffers.
o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
Send operation using DMA_TO_DEVICE
o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
as part of a Reply chunk using DMA_FROM_DEVICE
The two mappings are for data movement in opposite directions.
DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.
On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.
Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.
Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.
Some incidental changes worth noting:
- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
the iov.length field, so eliminate rg_size
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 22:55:53 +08:00
|
|
|
r_xprt->rx_stats.hardway_register_count += size;
|
|
|
|
req->rl_recvbuf = rb;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2016-09-15 22:55:20 +08:00
|
|
|
/**
|
|
|
|
* xprt_rdma_allocate - allocate transport resources for an RPC
|
|
|
|
* @task: RPC task
|
|
|
|
*
|
|
|
|
* Return values:
|
|
|
|
* 0: Success; rq_buffer points to RPC buffer to use
|
|
|
|
* ENOMEM: Out of memory, call again later
|
|
|
|
* EIO: A permanent error occurred, do not retry
|
|
|
|
*
|
2007-09-11 01:50:12 +08:00
|
|
|
* The RDMA allocate/free functions need the task structure as a place
|
xprtrdma: Initialize separate RPC call and reply buffers
RPC-over-RDMA needs to separate its RPC call and reply buffers.
o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
Send operation using DMA_TO_DEVICE
o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
as part of a Reply chunk using DMA_FROM_DEVICE
The two mappings are for data movement in opposite directions.
DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.
On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.
Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.
Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.
Some incidental changes worth noting:
- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
the iov.length field, so eliminate rg_size
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 22:55:53 +08:00
|
|
|
* to hide the struct rpcrdma_req, which is necessary for the actual
|
|
|
|
* send/recv sequence.
|
2015-01-22 00:04:08 +08:00
|
|
|
*
|
xprtrdma: Initialize separate RPC call and reply buffers
RPC-over-RDMA needs to separate its RPC call and reply buffers.
o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
Send operation using DMA_TO_DEVICE
o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
as part of a Reply chunk using DMA_FROM_DEVICE
The two mappings are for data movement in opposite directions.
DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.
On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.
Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.
Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.
Some incidental changes worth noting:
- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
the iov.length field, so eliminate rg_size
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 22:55:53 +08:00
|
|
|
* xprt_rdma_allocate provides buffers that are already mapped for
|
|
|
|
* DMA, and a local DMA lkey is provided for each.
|
2007-09-11 01:50:12 +08:00
|
|
|
*/
|
2016-09-15 22:55:20 +08:00
|
|
|
static int
|
|
|
|
xprt_rdma_allocate(struct rpc_task *task)
|
2007-09-11 01:50:12 +08:00
|
|
|
{
|
2016-09-15 22:55:20 +08:00
|
|
|
struct rpc_rqst *rqst = task->tk_rqstp;
|
|
|
|
struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(rqst->rq_xprt);
|
2015-01-22 00:04:08 +08:00
|
|
|
struct rpcrdma_req *req;
|
2015-01-27 06:11:47 +08:00
|
|
|
gfp_t flags;
|
2007-09-11 01:50:12 +08:00
|
|
|
|
2015-01-22 00:04:08 +08:00
|
|
|
req = rpcrdma_buffer_get(&r_xprt->rx_buf);
|
2014-05-28 22:35:06 +08:00
|
|
|
if (req == NULL)
|
2016-09-15 22:55:20 +08:00
|
|
|
return -ENOMEM;
|
2007-09-11 01:50:12 +08:00
|
|
|
|
2016-01-08 03:50:10 +08:00
|
|
|
flags = RPCRDMA_DEF_GFP;
|
2015-01-27 06:11:47 +08:00
|
|
|
if (RPC_IS_SWAPPER(task))
|
|
|
|
flags = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
|
|
|
|
|
xprtrdma: Initialize separate RPC call and reply buffers
RPC-over-RDMA needs to separate its RPC call and reply buffers.
o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
Send operation using DMA_TO_DEVICE
o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
as part of a Reply chunk using DMA_FROM_DEVICE
The two mappings are for data movement in opposite directions.
DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.
On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.
Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.
Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.
Some incidental changes worth noting:
- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
the iov.length field, so eliminate rg_size
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 22:55:53 +08:00
|
|
|
if (!rpcrdma_get_rdmabuf(r_xprt, req, flags))
|
|
|
|
goto out_fail;
|
|
|
|
if (!rpcrdma_get_sendbuf(r_xprt, req, rqst->rq_callsize, flags))
|
|
|
|
goto out_fail;
|
|
|
|
if (!rpcrdma_get_recvbuf(r_xprt, req, rqst->rq_rcvsize, flags))
|
|
|
|
goto out_fail;
|
|
|
|
|
|
|
|
dprintk("RPC: %5u %s: send size = %zd, recv size = %zd, req = %p\n",
|
|
|
|
task->tk_pid, __func__, rqst->rq_callsize,
|
|
|
|
rqst->rq_rcvsize, req);
|
2015-01-22 00:04:08 +08:00
|
|
|
|
2008-10-10 03:00:40 +08:00
|
|
|
req->rl_connect_cookie = 0; /* our reserved value */
|
2016-09-15 22:55:45 +08:00
|
|
|
rpcrdma_set_xprtdata(rqst, req);
|
2016-09-15 22:55:20 +08:00
|
|
|
rqst->rq_buffer = req->rl_sendbuf->rg_base;
|
xprtrdma: Initialize separate RPC call and reply buffers
RPC-over-RDMA needs to separate its RPC call and reply buffers.
o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
Send operation using DMA_TO_DEVICE
o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
as part of a Reply chunk using DMA_FROM_DEVICE
The two mappings are for data movement in opposite directions.
DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.
On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.
Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.
Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.
Some incidental changes worth noting:
- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
the iov.length field, so eliminate rg_size
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-15 22:55:53 +08:00
|
|
|
rqst->rq_rbuffer = req->rl_recvbuf->rg_base;
|
2016-09-15 22:55:20 +08:00
|
|
|
return 0;
|
2015-01-22 00:04:08 +08:00
|
|
|
|
|
|
|
out_fail:
|
2007-09-11 01:50:12 +08:00
|
|
|
rpcrdma_buffer_put(req);
|
2016-09-15 22:55:20 +08:00
|
|
|
return -ENOMEM;
|
2007-09-11 01:50:12 +08:00
|
|
|
}
|
|
|
|
|
2016-09-15 22:55:29 +08:00
|
|
|
/**
|
|
|
|
* xprt_rdma_free - release resources allocated by xprt_rdma_allocate
|
|
|
|
* @task: RPC task
|
|
|
|
*
|
|
|
|
* Caller guarantees rqst->rq_buffer is non-NULL.
|
2007-09-11 01:50:12 +08:00
|
|
|
*/
|
|
|
|
static void
|
2016-09-15 22:55:29 +08:00
|
|
|
xprt_rdma_free(struct rpc_task *task)
|
2007-09-11 01:50:12 +08:00
|
|
|
{
|
2016-09-15 22:55:29 +08:00
|
|
|
struct rpc_rqst *rqst = task->tk_rqstp;
|
|
|
|
struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(rqst->rq_xprt);
|
|
|
|
struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
|
2007-09-11 01:50:12 +08:00
|
|
|
|
2017-10-20 22:48:20 +08:00
|
|
|
if (test_bit(RPCRDMA_REQ_F_BACKCHANNEL, &req->rl_flags))
|
2015-12-17 06:22:14 +08:00
|
|
|
return;
|
|
|
|
|
2015-01-22 00:04:08 +08:00
|
|
|
dprintk("RPC: %s: called on 0x%p\n", __func__, req->rl_reply);
|
2007-09-11 01:50:12 +08:00
|
|
|
|
2017-10-20 22:48:28 +08:00
|
|
|
if (test_bit(RPCRDMA_REQ_F_PENDING, &req->rl_flags))
|
|
|
|
rpcrdma_release_rqst(r_xprt, req);
|
2007-09-11 01:50:12 +08:00
|
|
|
rpcrdma_buffer_put(req);
|
|
|
|
}
|
|
|
|
|
2016-06-30 01:53:43 +08:00
|
|
|
/**
|
|
|
|
* xprt_rdma_send_request - marshal and send an RPC request
|
|
|
|
* @task: RPC task with an RPC message in rq_snd_buf
|
|
|
|
*
|
2017-04-12 01:23:10 +08:00
|
|
|
* Caller holds the transport's write lock.
|
|
|
|
*
|
2016-06-30 01:53:43 +08:00
|
|
|
* Return values:
|
|
|
|
* 0: The request has been sent
|
|
|
|
* ENOTCONN: Caller needs to invoke connect logic then call again
|
|
|
|
* ENOBUFS: Call again later to send the request
|
|
|
|
* EIO: A permanent error occurred. The request was not sent,
|
|
|
|
* and don't try it again
|
|
|
|
*
|
2007-09-11 01:50:12 +08:00
|
|
|
* send_request invokes the meat of RPC RDMA. It must do the following:
|
2016-06-30 01:53:43 +08:00
|
|
|
*
|
2007-09-11 01:50:12 +08:00
|
|
|
* 1. Marshal the RPC request into an RPC RDMA request, which means
|
|
|
|
* putting a header in front of data, and creating IOVs for RDMA
|
|
|
|
* from those in the request.
|
|
|
|
* 2. In marshaling, detect opportunities for RDMA, and use them.
|
|
|
|
* 3. Post a recv message to set up asynch completion, then send
|
|
|
|
* the request (rpcrdma_ep_post).
|
|
|
|
* 4. No partial sends are possible in the RPC-RDMA protocol (as in UDP).
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
xprt_rdma_send_request(struct rpc_task *task)
|
|
|
|
{
|
|
|
|
struct rpc_rqst *rqst = task->tk_rqstp;
|
2013-01-08 22:10:21 +08:00
|
|
|
struct rpc_xprt *xprt = rqst->rq_xprt;
|
2007-09-11 01:50:12 +08:00
|
|
|
struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
|
|
|
|
struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
|
2014-07-30 05:23:43 +08:00
|
|
|
int rc = 0;
|
2007-09-11 01:50:12 +08:00
|
|
|
|
2017-04-12 01:23:10 +08:00
|
|
|
if (!xprt_connected(xprt))
|
|
|
|
goto drop_connection;
|
|
|
|
|
2016-06-30 01:54:16 +08:00
|
|
|
/* On retransmit, remove any previously registered chunks */
|
2016-11-29 23:52:48 +08:00
|
|
|
if (unlikely(!list_empty(&req->rl_registered)))
|
2017-10-10 00:03:34 +08:00
|
|
|
r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt,
|
|
|
|
&req->rl_registered);
|
2016-06-30 01:54:16 +08:00
|
|
|
|
2017-08-11 00:47:12 +08:00
|
|
|
rc = rpcrdma_marshal_req(r_xprt, rqst);
|
2014-07-30 05:23:43 +08:00
|
|
|
if (rc < 0)
|
|
|
|
goto failed_marshal;
|
2007-09-11 01:50:12 +08:00
|
|
|
|
|
|
|
if (req->rl_reply == NULL) /* e.g. reconnection */
|
|
|
|
rpcrdma_recv_buffer_get(req);
|
|
|
|
|
2008-10-10 03:00:40 +08:00
|
|
|
/* Must suppress retransmit to maintain credits */
|
|
|
|
if (req->rl_connect_cookie == xprt->connect_cookie)
|
|
|
|
goto drop_connection;
|
|
|
|
req->rl_connect_cookie = xprt->connect_cookie;
|
|
|
|
|
2017-10-20 22:48:28 +08:00
|
|
|
set_bit(RPCRDMA_REQ_F_PENDING, &req->rl_flags);
|
2008-10-10 03:00:40 +08:00
|
|
|
if (rpcrdma_ep_post(&r_xprt->rx_ia, &r_xprt->rx_ep, req))
|
|
|
|
goto drop_connection;
|
2007-09-11 01:50:12 +08:00
|
|
|
|
2010-05-14 00:51:49 +08:00
|
|
|
rqst->rq_xmit_bytes_sent += rqst->rq_snd_buf.len;
|
2007-09-11 01:50:12 +08:00
|
|
|
rqst->rq_bytes_sent = 0;
|
|
|
|
return 0;
|
2008-10-10 03:00:40 +08:00
|
|
|
|
2014-05-28 22:35:14 +08:00
|
|
|
failed_marshal:
|
2016-06-30 01:53:43 +08:00
|
|
|
if (rc != -ENOTCONN)
|
|
|
|
return rc;
|
2008-10-10 03:00:40 +08:00
|
|
|
drop_connection:
|
|
|
|
xprt_disconnect_done(xprt);
|
|
|
|
return -ENOTCONN; /* implies disconnect */
|
2007-09-11 01:50:12 +08:00
|
|
|
}
|
|
|
|
|
2016-01-08 03:50:10 +08:00
|
|
|
void xprt_rdma_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
|
2007-09-11 01:50:12 +08:00
|
|
|
{
|
|
|
|
struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
|
|
|
|
long idle_time = 0;
|
|
|
|
|
|
|
|
if (xprt_connected(xprt))
|
|
|
|
idle_time = (long)(jiffies - xprt->last_used) / HZ;
|
|
|
|
|
2015-08-04 01:04:36 +08:00
|
|
|
seq_puts(seq, "\txprt:\trdma ");
|
|
|
|
seq_printf(seq, "%u %lu %lu %lu %ld %lu %lu %lu %llu %llu ",
|
|
|
|
0, /* need a local port? */
|
|
|
|
xprt->stat.bind_count,
|
|
|
|
xprt->stat.connect_count,
|
|
|
|
xprt->stat.connect_time,
|
|
|
|
idle_time,
|
|
|
|
xprt->stat.sends,
|
|
|
|
xprt->stat.recvs,
|
|
|
|
xprt->stat.bad_xids,
|
|
|
|
xprt->stat.req_u,
|
|
|
|
xprt->stat.bklog_u);
|
2016-06-30 01:52:54 +08:00
|
|
|
seq_printf(seq, "%lu %lu %lu %llu %llu %llu %llu %lu %lu %lu %lu ",
|
2015-08-04 01:04:36 +08:00
|
|
|
r_xprt->rx_stats.read_chunk_count,
|
|
|
|
r_xprt->rx_stats.write_chunk_count,
|
|
|
|
r_xprt->rx_stats.reply_chunk_count,
|
|
|
|
r_xprt->rx_stats.total_rdma_request,
|
|
|
|
r_xprt->rx_stats.total_rdma_reply,
|
|
|
|
r_xprt->rx_stats.pullup_copy_count,
|
|
|
|
r_xprt->rx_stats.fixup_copy_count,
|
|
|
|
r_xprt->rx_stats.hardway_register_count,
|
|
|
|
r_xprt->rx_stats.failed_marshal_count,
|
2015-08-04 01:04:45 +08:00
|
|
|
r_xprt->rx_stats.bad_reply_count,
|
|
|
|
r_xprt->rx_stats.nomsg_call_count);
|
2017-10-20 22:48:36 +08:00
|
|
|
seq_printf(seq, "%lu %lu %lu %lu %lu %lu\n",
|
2016-06-30 01:52:54 +08:00
|
|
|
r_xprt->rx_stats.mrs_recovered,
|
2016-06-30 01:54:00 +08:00
|
|
|
r_xprt->rx_stats.mrs_orphaned,
|
2016-09-15 22:57:16 +08:00
|
|
|
r_xprt->rx_stats.mrs_allocated,
|
xprtrdma: Add data structure to manage RDMA Send arguments
Problem statement:
Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.
This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.
Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In NFS parlance, the RPC Reply
arrives and completes the RPC, but the HCA is still retrying the
Send WR that conveyed the RPC Call. In this case, the HCA can try
to use memory that has been invalidated or DMA unmapped, and the
connection is lost. If that memory has been re-used for something
else (possibly not related to NFS), and the Send retransmission
exposes that data on the wire.
Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.
After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:
- Inline Send buffers are registered via the local DMA key, and
are already left DMA mapped for the lifetime of a transport
connection, thus no additional handling is necessary for those
- Gathered Sends involving page cache pages _will_ need to
DMA unmap those pages after the Send completes. But like
inline send buffers, they are registered via the local DMA key,
and thus will not need to be invalidated
In addition, RPC completion will need to wait for Send completion
in the latter case. However, nearly always, the Send that conveys
the RPC Call will have completed long before the RPC Reply
arrives, and thus no additional latency will be accrued.
Design notes:
In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.
The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.
The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).
The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, expressed more conventionally, the queue is empty).
When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a Send
Queue overflow. In this case, the ULP is told to wait and try again.
This introduces strong Send Queue accounting to xprtrdma.
As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and perform SGE unmapping of N Send
operations during that one completion.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-20 22:48:12 +08:00
|
|
|
r_xprt->rx_stats.local_inv_needed,
|
2017-10-20 22:48:36 +08:00
|
|
|
r_xprt->rx_stats.empty_sendctx_q,
|
|
|
|
r_xprt->rx_stats.reply_waits_for_send);
|
2007-09-11 01:50:12 +08:00
|
|
|
}
|
|
|
|
|
2015-06-04 04:14:29 +08:00
|
|
|
static int
|
|
|
|
xprt_rdma_enable_swap(struct rpc_xprt *xprt)
|
|
|
|
{
|
2015-10-25 05:26:29 +08:00
|
|
|
return 0;
|
2015-06-04 04:14:29 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
xprt_rdma_disable_swap(struct rpc_xprt *xprt)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2007-09-11 01:50:12 +08:00
|
|
|
/*
|
|
|
|
* Plumbing for rpc transport switch and kernel module
|
|
|
|
*/
|
|
|
|
|
2017-08-02 00:00:39 +08:00
|
|
|
static const struct rpc_xprt_ops xprt_rdma_procs = {
|
2014-05-28 22:34:57 +08:00
|
|
|
.reserve_xprt = xprt_reserve_xprt_cong,
|
2007-09-11 01:50:12 +08:00
|
|
|
.release_xprt = xprt_release_xprt_cong, /* sunrpc/xprt.c */
|
2012-09-07 23:08:50 +08:00
|
|
|
.alloc_slot = xprt_alloc_slot,
|
2007-09-11 01:50:12 +08:00
|
|
|
.release_request = xprt_release_rqst_cong, /* ditto */
|
|
|
|
.set_retrans_timeout = xprt_set_retrans_timeout_def, /* ditto */
|
xprtrdma: Detect unreachable NFS/RDMA servers more reliably
Current NFS clients rely on connection loss to determine when to
retransmit. In particular, for protocols like NFSv4, clients no
longer rely on RPC timeouts to drive retransmission: NFSv4 servers
are required to terminate a connection when they need a client to
retransmit pending RPCs.
When a server is no longer reachable, either because it has crashed
or because the network path has broken, the server cannot actively
terminate a connection. Thus NFS clients depend on transport-level
keepalive to determine when a connection must be replaced and
pending RPCs retransmitted.
However, RDMA RC connections do not have a native keepalive
mechanism. If an NFS/RDMA server crashes after a client has sent
RPCs successfully (an RC ACK has been received for all OTW RDMA
requests), there is no way for the client to know the connection is
moribund.
In addition, new RDMA requests are subject to the RPC-over-RDMA
credit limit. If the client has consumed all granted credits with
NFS traffic, it is not allowed to send another RDMA request until
the server replies. Thus it has no way to send a true keepalive when
the workload has already consumed all credits with pending RPCs.
To address this, forcibly disconnect a transport when an RPC times
out. This prevents moribund connections from stopping the
detection of failover or other configuration changes on the server.
Note that even if the connection is still good, retransmitting
any RPC will trigger a disconnect thanks to this logic in
xprt_rdma_send_request:
/* Must suppress retransmit to maintain credits */
if (req->rl_connect_cookie == xprt->connect_cookie)
goto drop_connection;
req->rl_connect_cookie = xprt->connect_cookie;
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-12 01:22:46 +08:00
|
|
|
.timer = xprt_rdma_timer,
|
2007-09-11 01:50:12 +08:00
|
|
|
.rpcbind = rpcb_getport_async, /* sunrpc/rpcb_clnt.c */
|
|
|
|
.set_port = xprt_rdma_set_port,
|
|
|
|
.connect = xprt_rdma_connect,
|
|
|
|
.buf_alloc = xprt_rdma_allocate,
|
|
|
|
.buf_free = xprt_rdma_free,
|
|
|
|
.send_request = xprt_rdma_send_request,
|
|
|
|
.close = xprt_rdma_close,
|
|
|
|
.destroy = xprt_rdma_destroy,
|
2015-06-04 04:14:29 +08:00
|
|
|
.print_stats = xprt_rdma_print_stats,
|
|
|
|
.enable_swap = xprt_rdma_enable_swap,
|
|
|
|
.disable_swap = xprt_rdma_disable_swap,
|
2015-10-25 05:27:43 +08:00
|
|
|
.inject_disconnect = xprt_rdma_inject_disconnect,
|
|
|
|
#if defined(CONFIG_SUNRPC_BACKCHANNEL)
|
|
|
|
.bc_setup = xprt_rdma_bc_setup,
|
2015-10-25 05:28:32 +08:00
|
|
|
.bc_up = xprt_rdma_bc_up,
|
2016-05-03 02:40:40 +08:00
|
|
|
.bc_maxpayload = xprt_rdma_bc_maxpayload,
|
2015-10-25 05:27:43 +08:00
|
|
|
.bc_free_rqst = xprt_rdma_bc_free_rqst,
|
|
|
|
.bc_destroy = xprt_rdma_bc_destroy,
|
|
|
|
#endif
|
2007-09-11 01:50:12 +08:00
|
|
|
};
|
|
|
|
|
|
|
|
static struct xprt_class xprt_rdma = {
|
|
|
|
.list = LIST_HEAD_INIT(xprt_rdma.list),
|
|
|
|
.name = "rdma",
|
|
|
|
.owner = THIS_MODULE,
|
|
|
|
.ident = XPRT_TRANSPORT_RDMA,
|
|
|
|
.setup = xprt_setup_rdma,
|
|
|
|
};
|
|
|
|
|
2015-06-04 23:21:42 +08:00
|
|
|
void xprt_rdma_cleanup(void)
|
2007-09-11 01:50:12 +08:00
|
|
|
{
|
|
|
|
int rc;
|
|
|
|
|
2014-03-13 00:51:39 +08:00
|
|
|
dprintk("RPCRDMA Module Removed, deregister RPC RDMA transport\n");
|
2014-11-18 05:58:04 +08:00
|
|
|
#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
|
2007-09-11 01:50:12 +08:00
|
|
|
if (sunrpc_table_header) {
|
|
|
|
unregister_sysctl_table(sunrpc_table_header);
|
|
|
|
sunrpc_table_header = NULL;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
rc = xprt_unregister_transport(&xprt_rdma);
|
|
|
|
if (rc)
|
|
|
|
dprintk("RPC: %s: xprt_unregister returned %i\n",
|
|
|
|
__func__, rc);
|
2015-05-26 23:52:25 +08:00
|
|
|
|
2015-10-25 05:27:10 +08:00
|
|
|
rpcrdma_destroy_wq();
|
2016-01-08 03:50:10 +08:00
|
|
|
|
|
|
|
rc = xprt_unregister_transport(&xprt_rdma_bc);
|
|
|
|
if (rc)
|
|
|
|
dprintk("RPC: %s: xprt_unregister(bc) returned %i\n",
|
|
|
|
__func__, rc);
|
2007-09-11 01:50:12 +08:00
|
|
|
}
|
|
|
|
|
2015-06-04 23:21:42 +08:00
|
|
|
int xprt_rdma_init(void)
|
2007-09-11 01:50:12 +08:00
|
|
|
{
|
|
|
|
int rc;
|
|
|
|
|
2015-10-25 05:27:10 +08:00
|
|
|
rc = rpcrdma_alloc_wq();
|
2016-06-30 01:52:54 +08:00
|
|
|
if (rc)
|
2015-10-25 05:27:10 +08:00
|
|
|
return rc;
|
|
|
|
|
2015-05-26 23:52:25 +08:00
|
|
|
rc = xprt_register_transport(&xprt_rdma);
|
|
|
|
if (rc) {
|
2015-10-25 05:27:10 +08:00
|
|
|
rpcrdma_destroy_wq();
|
2015-05-26 23:52:25 +08:00
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
2016-01-08 03:50:10 +08:00
|
|
|
rc = xprt_register_transport(&xprt_rdma_bc);
|
|
|
|
if (rc) {
|
|
|
|
xprt_unregister_transport(&xprt_rdma);
|
|
|
|
rpcrdma_destroy_wq();
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
2014-03-13 00:51:39 +08:00
|
|
|
dprintk("RPCRDMA Module Init, register RPC RDMA transport\n");
|
2007-09-11 01:50:12 +08:00
|
|
|
|
2014-03-13 00:51:39 +08:00
|
|
|
dprintk("Defaults:\n");
|
|
|
|
dprintk("\tSlots %d\n"
|
2007-09-11 01:50:12 +08:00
|
|
|
"\tMaxInlineRead %d\n\tMaxInlineWrite %d\n",
|
|
|
|
xprt_rdma_slot_table_entries,
|
|
|
|
xprt_rdma_max_inline_read, xprt_rdma_max_inline_write);
|
2014-03-13 00:51:39 +08:00
|
|
|
dprintk("\tPadding %d\n\tMemreg %d\n",
|
2007-09-11 01:50:12 +08:00
|
|
|
xprt_rdma_inline_write_padding, xprt_rdma_memreg_strategy);
|
|
|
|
|
2014-11-18 05:58:04 +08:00
|
|
|
#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
|
2007-09-11 01:50:12 +08:00
|
|
|
if (!sunrpc_table_header)
|
|
|
|
sunrpc_table_header = register_sysctl_table(sunrpc_table);
|
|
|
|
#endif
|
|
|
|
return 0;
|
|
|
|
}
|