OpenCloudOS-Kernel/net/ceph/messenger.c

2219 lines
55 KiB
C
Raw Normal View History

License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
// SPDX-License-Identifier: GPL-2.0
#include <linux/ceph/ceph_debug.h>
#include <linux/crc32c.h>
#include <linux/ctype.h>
#include <linux/highmem.h>
#include <linux/inet.h>
#include <linux/kthread.h>
#include <linux/net.h>
#include <linux/nsproxy.h>
libceph: force GFP_NOIO for socket allocations sock_alloc_inode() allocates socket+inode and socket_wq with GFP_KERNEL, which is not allowed on the writeback path: Workqueue: ceph-msgr con_work [libceph] ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000 0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00 ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148 Call Trace: [<ffffffff816dd629>] schedule+0x29/0x70 [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200 [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120 [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70 [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180 [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390 [<ffffffff81086335>] flush_work+0x165/0x250 [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0 [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs] [<ffffffff816d6b42>] ? __slab_free+0xee/0x234 [<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs] [<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30 [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs] [<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs] [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs] [<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs] [<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40 [<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs] [<ffffffffa039ac07>] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs] [<ffffffffa039bb13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs] [<ffffffffa03ab745>] xfs_fs_free_cached_objects+0x15/0x20 [xfs] [<ffffffff811c0c18>] super_cache_scan+0x178/0x180 [<ffffffff8115912e>] shrink_slab_node+0x14e/0x340 [<ffffffff811afc3b>] ? mem_cgroup_iter+0x16b/0x450 [<ffffffff8115af70>] shrink_slab+0x100/0x140 [<ffffffff8115e425>] do_try_to_free_pages+0x335/0x490 [<ffffffff8115e7f9>] try_to_free_pages+0xb9/0x1f0 [<ffffffff816d56e4>] ? __alloc_pages_direct_compact+0x69/0x1be [<ffffffff81150cba>] __alloc_pages_nodemask+0x69a/0xb40 [<ffffffff8119743e>] alloc_pages_current+0x9e/0x110 [<ffffffff811a0ac5>] new_slab+0x2c5/0x390 [<ffffffff816d71c4>] __slab_alloc+0x33b/0x459 [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0 [<ffffffff8164bda1>] ? inet_sendmsg+0x71/0xc0 [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0 [<ffffffff811a21f2>] kmem_cache_alloc+0x1a2/0x1b0 [<ffffffff815b906d>] sock_alloc_inode+0x2d/0xd0 [<ffffffff811d8566>] alloc_inode+0x26/0xa0 [<ffffffff811da04a>] new_inode_pseudo+0x1a/0x70 [<ffffffff815b933e>] sock_alloc+0x1e/0x80 [<ffffffff815ba855>] __sock_create+0x95/0x220 [<ffffffff815baa04>] sock_create_kern+0x24/0x30 [<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph] [<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd] [<ffffffff81084c19>] process_one_work+0x159/0x4f0 [<ffffffff8108561b>] worker_thread+0x11b/0x530 [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0 [<ffffffff8108b6f9>] kthread+0xc9/0xe0 [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90 [<ffffffff816e1b98>] ret_from_fork+0x58/0x90 [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90 Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here. Cc: stable@vger.kernel.org # 3.10+, needs backporting Link: http://tracker.ceph.com/issues/19309 Reported-by: Sergey Jerusalimov <wintchester@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com>
2017-03-21 20:44:28 +08:00
#include <linux/sched/mm.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
#include <linux/slab.h>
#include <linux/socket.h>
#include <linux/string.h>
#ifdef CONFIG_BLOCK
#include <linux/bio.h>
#endif /* CONFIG_BLOCK */
#include <linux/dns_resolver.h>
#include <net/tcp.h>
#include <trace/events/sock.h>
#include <linux/ceph/ceph_features.h>
#include <linux/ceph/libceph.h>
#include <linux/ceph/messenger.h>
#include <linux/ceph/decode.h>
#include <linux/ceph/pagelist.h>
#include <linux/export.h>
/*
* Ceph uses the messenger to exchange ceph_msg messages with other
* hosts in the system. The messenger provides ordered and reliable
* delivery. We tolerate TCP disconnects by reconnecting (with
* exponential backoff) in the case of a fault (disconnection, bad
* crc, protocol error). Acks allow sent messages to be discarded by
* the sender.
*/
/*
* We track the state of the socket on a given connection using
* values defined below. The transition to a new socket state is
* handled by a function which verifies we aren't coming from an
* unexpected state.
*
* --------
* | NEW* | transient initial state
* --------
* | con_sock_state_init()
* v
* ----------
* | CLOSED | initialized, but no socket (and no
* ---------- TCP connection)
* ^ \
* | \ con_sock_state_connecting()
* | ----------------------
* | \
* + con_sock_state_closed() \
* |+--------------------------- \
* | \ \ \
* | ----------- \ \
* | | CLOSING | socket event; \ \
* | ----------- await close \ \
* | ^ \ |
* | | \ |
* | + con_sock_state_closing() \ |
* | / \ | |
* | / --------------- | |
* | / \ v v
* | / --------------
* | / -----------------| CONNECTING | socket created, TCP
* | | / -------------- connect initiated
* | | | con_sock_state_connected()
* | | v
* -------------
* | CONNECTED | TCP connection established
* -------------
*
* State values for ceph_connection->sock_state; NEW is assumed to be 0.
*/
#define CON_SOCK_STATE_NEW 0 /* -> CLOSED */
#define CON_SOCK_STATE_CLOSED 1 /* -> CONNECTING */
#define CON_SOCK_STATE_CONNECTING 2 /* -> CONNECTED or -> CLOSING */
#define CON_SOCK_STATE_CONNECTED 3 /* -> CLOSING or -> CLOSED */
#define CON_SOCK_STATE_CLOSING 4 /* -> CLOSED */
static bool con_flag_valid(unsigned long con_flag)
{
switch (con_flag) {
case CEPH_CON_F_LOSSYTX:
case CEPH_CON_F_KEEPALIVE_PENDING:
case CEPH_CON_F_WRITE_PENDING:
case CEPH_CON_F_SOCK_CLOSED:
case CEPH_CON_F_BACKOFF:
return true;
default:
return false;
}
}
void ceph_con_flag_clear(struct ceph_connection *con, unsigned long con_flag)
{
BUG_ON(!con_flag_valid(con_flag));
clear_bit(con_flag, &con->flags);
}
void ceph_con_flag_set(struct ceph_connection *con, unsigned long con_flag)
{
BUG_ON(!con_flag_valid(con_flag));
set_bit(con_flag, &con->flags);
}
bool ceph_con_flag_test(struct ceph_connection *con, unsigned long con_flag)
{
BUG_ON(!con_flag_valid(con_flag));
return test_bit(con_flag, &con->flags);
}
bool ceph_con_flag_test_and_clear(struct ceph_connection *con,
unsigned long con_flag)
{
BUG_ON(!con_flag_valid(con_flag));
return test_and_clear_bit(con_flag, &con->flags);
}
bool ceph_con_flag_test_and_set(struct ceph_connection *con,
unsigned long con_flag)
{
BUG_ON(!con_flag_valid(con_flag));
return test_and_set_bit(con_flag, &con->flags);
}
/* Slab caches for frequently-allocated structures */
static struct kmem_cache *ceph_msg_cache;
#ifdef CONFIG_LOCKDEP
static struct lock_class_key socket_class;
#endif
static void queue_con(struct ceph_connection *con);
static void cancel_con(struct ceph_connection *con);
static void ceph_con_workfn(struct work_struct *);
static void con_fault(struct ceph_connection *con);
/*
* Nicely render a sockaddr as a string. An array of formatted
* strings is used, to approximate reentrancy.
*/
#define ADDR_STR_COUNT_LOG 5 /* log2(# address strings in array) */
#define ADDR_STR_COUNT (1 << ADDR_STR_COUNT_LOG)
#define ADDR_STR_COUNT_MASK (ADDR_STR_COUNT - 1)
#define MAX_ADDR_STR_LEN 64 /* 54 is enough */
static char addr_str[ADDR_STR_COUNT][MAX_ADDR_STR_LEN];
static atomic_t addr_str_seq = ATOMIC_INIT(0);
struct page *ceph_zero_page; /* used in certain error cases */
const char *ceph_pr_addr(const struct ceph_entity_addr *addr)
{
int i;
char *s;
struct sockaddr_storage ss = addr->in_addr; /* align */
struct sockaddr_in *in4 = (struct sockaddr_in *)&ss;
struct sockaddr_in6 *in6 = (struct sockaddr_in6 *)&ss;
i = atomic_inc_return(&addr_str_seq) & ADDR_STR_COUNT_MASK;
s = addr_str[i];
switch (ss.ss_family) {
case AF_INET:
snprintf(s, MAX_ADDR_STR_LEN, "(%d)%pI4:%hu",
le32_to_cpu(addr->type), &in4->sin_addr,
ntohs(in4->sin_port));
break;
case AF_INET6:
snprintf(s, MAX_ADDR_STR_LEN, "(%d)[%pI6c]:%hu",
le32_to_cpu(addr->type), &in6->sin6_addr,
ntohs(in6->sin6_port));
break;
default:
snprintf(s, MAX_ADDR_STR_LEN, "(unknown sockaddr family %hu)",
ss.ss_family);
}
return s;
}
EXPORT_SYMBOL(ceph_pr_addr);
void ceph_encode_my_addr(struct ceph_messenger *msgr)
{
if (!ceph_msgr2(from_msgr(msgr))) {
memcpy(&msgr->my_enc_addr, &msgr->inst.addr,
sizeof(msgr->my_enc_addr));
ceph_encode_banner_addr(&msgr->my_enc_addr);
}
}
/*
* work queue for all reading and writing to/from the socket.
*/
static struct workqueue_struct *ceph_msgr_wq;
static int ceph_msgr_slab_init(void)
{
BUG_ON(ceph_msg_cache);
ceph_msg_cache = KMEM_CACHE(ceph_msg, 0);
if (!ceph_msg_cache)
return -ENOMEM;
return 0;
}
static void ceph_msgr_slab_exit(void)
{
BUG_ON(!ceph_msg_cache);
kmem_cache_destroy(ceph_msg_cache);
ceph_msg_cache = NULL;
}
static void _ceph_msgr_exit(void)
{
if (ceph_msgr_wq) {
destroy_workqueue(ceph_msgr_wq);
ceph_msgr_wq = NULL;
}
BUG_ON(!ceph_zero_page);
put_page(ceph_zero_page);
ceph_zero_page = NULL;
ceph_msgr_slab_exit();
}
int __init ceph_msgr_init(void)
{
if (ceph_msgr_slab_init())
return -ENOMEM;
BUG_ON(ceph_zero_page);
ceph_zero_page = ZERO_PAGE(0);
get_page(ceph_zero_page);
/*
* The number of active work items is limited by the number of
* connections, so leave @max_active at default.
*/
ceph_msgr_wq = alloc_workqueue("ceph-msgr", WQ_MEM_RECLAIM, 0);
if (ceph_msgr_wq)
return 0;
pr_err("msgr_init failed to create workqueue\n");
_ceph_msgr_exit();
return -ENOMEM;
}
void ceph_msgr_exit(void)
{
BUG_ON(ceph_msgr_wq == NULL);
_ceph_msgr_exit();
}
void ceph_msgr_flush(void)
{
flush_workqueue(ceph_msgr_wq);
}
EXPORT_SYMBOL(ceph_msgr_flush);
/* Connection socket state transition functions */
static void con_sock_state_init(struct ceph_connection *con)
{
int old_state;
old_state = atomic_xchg(&con->sock_state, CON_SOCK_STATE_CLOSED);
if (WARN_ON(old_state != CON_SOCK_STATE_NEW))
printk("%s: unexpected old state %d\n", __func__, old_state);
dout("%s con %p sock %d -> %d\n", __func__, con, old_state,
CON_SOCK_STATE_CLOSED);
}
static void con_sock_state_connecting(struct ceph_connection *con)
{
int old_state;
old_state = atomic_xchg(&con->sock_state, CON_SOCK_STATE_CONNECTING);
if (WARN_ON(old_state != CON_SOCK_STATE_CLOSED))
printk("%s: unexpected old state %d\n", __func__, old_state);
dout("%s con %p sock %d -> %d\n", __func__, con, old_state,
CON_SOCK_STATE_CONNECTING);
}
static void con_sock_state_connected(struct ceph_connection *con)
{
int old_state;
old_state = atomic_xchg(&con->sock_state, CON_SOCK_STATE_CONNECTED);
if (WARN_ON(old_state != CON_SOCK_STATE_CONNECTING))
printk("%s: unexpected old state %d\n", __func__, old_state);
dout("%s con %p sock %d -> %d\n", __func__, con, old_state,
CON_SOCK_STATE_CONNECTED);
}
static void con_sock_state_closing(struct ceph_connection *con)
{
int old_state;
old_state = atomic_xchg(&con->sock_state, CON_SOCK_STATE_CLOSING);
if (WARN_ON(old_state != CON_SOCK_STATE_CONNECTING &&
old_state != CON_SOCK_STATE_CONNECTED &&
old_state != CON_SOCK_STATE_CLOSING))
printk("%s: unexpected old state %d\n", __func__, old_state);
dout("%s con %p sock %d -> %d\n", __func__, con, old_state,
CON_SOCK_STATE_CLOSING);
}
static void con_sock_state_closed(struct ceph_connection *con)
{
int old_state;
old_state = atomic_xchg(&con->sock_state, CON_SOCK_STATE_CLOSED);
if (WARN_ON(old_state != CON_SOCK_STATE_CONNECTED &&
old_state != CON_SOCK_STATE_CLOSING &&
old_state != CON_SOCK_STATE_CONNECTING &&
old_state != CON_SOCK_STATE_CLOSED))
printk("%s: unexpected old state %d\n", __func__, old_state);
dout("%s con %p sock %d -> %d\n", __func__, con, old_state,
CON_SOCK_STATE_CLOSED);
}
/*
* socket callback functions
*/
/* data available on socket, or listen socket received a connect */
static void ceph_sock_data_ready(struct sock *sk)
{
struct ceph_connection *con = sk->sk_user_data;
trace_sk_data_ready(sk);
if (atomic_read(&con->msgr->stopping)) {
return;
}
if (sk->sk_state != TCP_CLOSE_WAIT) {
dout("%s %p state = %d, queueing work\n", __func__,
con, con->state);
queue_con(con);
}
}
/* socket has buffer space for writing */
static void ceph_sock_write_space(struct sock *sk)
{
struct ceph_connection *con = sk->sk_user_data;
/* only queue to workqueue if there is data we want to write,
* and there is sufficient space in the socket buffer to accept
* more data. clear SOCK_NOSPACE so that ceph_sock_write_space()
* doesn't get called again until try_write() fills the socket
* buffer. See net/ipv4/tcp_input.c:tcp_check_space()
* and net/core/stream.c:sk_stream_write_space().
*/
if (ceph_con_flag_test(con, CEPH_CON_F_WRITE_PENDING)) {
if (sk_stream_is_writeable(sk)) {
dout("%s %p queueing write work\n", __func__, con);
clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
queue_con(con);
}
} else {
dout("%s %p nothing to write\n", __func__, con);
}
}
/* socket's state has changed */
static void ceph_sock_state_change(struct sock *sk)
{
struct ceph_connection *con = sk->sk_user_data;
dout("%s %p state = %d sk_state = %u\n", __func__,
con, con->state, sk->sk_state);
switch (sk->sk_state) {
case TCP_CLOSE:
dout("%s TCP_CLOSE\n", __func__);
fallthrough;
case TCP_CLOSE_WAIT:
dout("%s TCP_CLOSE_WAIT\n", __func__);
con_sock_state_closing(con);
ceph_con_flag_set(con, CEPH_CON_F_SOCK_CLOSED);
queue_con(con);
break;
case TCP_ESTABLISHED:
dout("%s TCP_ESTABLISHED\n", __func__);
con_sock_state_connected(con);
queue_con(con);
break;
default: /* Everything else is uninteresting */
break;
}
}
/*
* set up socket callbacks
*/
static void set_sock_callbacks(struct socket *sock,
struct ceph_connection *con)
{
struct sock *sk = sock->sk;
sk->sk_user_data = con;
sk->sk_data_ready = ceph_sock_data_ready;
sk->sk_write_space = ceph_sock_write_space;
sk->sk_state_change = ceph_sock_state_change;
}
/*
* socket helpers
*/
/*
* initiate connection to a remote socket.
*/
int ceph_tcp_connect(struct ceph_connection *con)
{
struct sockaddr_storage ss = con->peer_addr.in_addr; /* align */
struct socket *sock;
libceph: force GFP_NOIO for socket allocations sock_alloc_inode() allocates socket+inode and socket_wq with GFP_KERNEL, which is not allowed on the writeback path: Workqueue: ceph-msgr con_work [libceph] ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000 0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00 ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148 Call Trace: [<ffffffff816dd629>] schedule+0x29/0x70 [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200 [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120 [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70 [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180 [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390 [<ffffffff81086335>] flush_work+0x165/0x250 [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0 [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs] [<ffffffff816d6b42>] ? __slab_free+0xee/0x234 [<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs] [<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30 [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs] [<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs] [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs] [<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs] [<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40 [<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs] [<ffffffffa039ac07>] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs] [<ffffffffa039bb13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs] [<ffffffffa03ab745>] xfs_fs_free_cached_objects+0x15/0x20 [xfs] [<ffffffff811c0c18>] super_cache_scan+0x178/0x180 [<ffffffff8115912e>] shrink_slab_node+0x14e/0x340 [<ffffffff811afc3b>] ? mem_cgroup_iter+0x16b/0x450 [<ffffffff8115af70>] shrink_slab+0x100/0x140 [<ffffffff8115e425>] do_try_to_free_pages+0x335/0x490 [<ffffffff8115e7f9>] try_to_free_pages+0xb9/0x1f0 [<ffffffff816d56e4>] ? __alloc_pages_direct_compact+0x69/0x1be [<ffffffff81150cba>] __alloc_pages_nodemask+0x69a/0xb40 [<ffffffff8119743e>] alloc_pages_current+0x9e/0x110 [<ffffffff811a0ac5>] new_slab+0x2c5/0x390 [<ffffffff816d71c4>] __slab_alloc+0x33b/0x459 [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0 [<ffffffff8164bda1>] ? inet_sendmsg+0x71/0xc0 [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0 [<ffffffff811a21f2>] kmem_cache_alloc+0x1a2/0x1b0 [<ffffffff815b906d>] sock_alloc_inode+0x2d/0xd0 [<ffffffff811d8566>] alloc_inode+0x26/0xa0 [<ffffffff811da04a>] new_inode_pseudo+0x1a/0x70 [<ffffffff815b933e>] sock_alloc+0x1e/0x80 [<ffffffff815ba855>] __sock_create+0x95/0x220 [<ffffffff815baa04>] sock_create_kern+0x24/0x30 [<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph] [<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd] [<ffffffff81084c19>] process_one_work+0x159/0x4f0 [<ffffffff8108561b>] worker_thread+0x11b/0x530 [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0 [<ffffffff8108b6f9>] kthread+0xc9/0xe0 [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90 [<ffffffff816e1b98>] ret_from_fork+0x58/0x90 [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90 Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here. Cc: stable@vger.kernel.org # 3.10+, needs backporting Link: http://tracker.ceph.com/issues/19309 Reported-by: Sergey Jerusalimov <wintchester@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com>
2017-03-21 20:44:28 +08:00
unsigned int noio_flag;
int ret;
dout("%s con %p peer_addr %s\n", __func__, con,
ceph_pr_addr(&con->peer_addr));
BUG_ON(con->sock);
libceph: force GFP_NOIO for socket allocations sock_alloc_inode() allocates socket+inode and socket_wq with GFP_KERNEL, which is not allowed on the writeback path: Workqueue: ceph-msgr con_work [libceph] ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000 0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00 ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148 Call Trace: [<ffffffff816dd629>] schedule+0x29/0x70 [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200 [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120 [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70 [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180 [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390 [<ffffffff81086335>] flush_work+0x165/0x250 [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0 [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs] [<ffffffff816d6b42>] ? __slab_free+0xee/0x234 [<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs] [<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30 [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs] [<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs] [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs] [<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs] [<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40 [<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs] [<ffffffffa039ac07>] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs] [<ffffffffa039bb13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs] [<ffffffffa03ab745>] xfs_fs_free_cached_objects+0x15/0x20 [xfs] [<ffffffff811c0c18>] super_cache_scan+0x178/0x180 [<ffffffff8115912e>] shrink_slab_node+0x14e/0x340 [<ffffffff811afc3b>] ? mem_cgroup_iter+0x16b/0x450 [<ffffffff8115af70>] shrink_slab+0x100/0x140 [<ffffffff8115e425>] do_try_to_free_pages+0x335/0x490 [<ffffffff8115e7f9>] try_to_free_pages+0xb9/0x1f0 [<ffffffff816d56e4>] ? __alloc_pages_direct_compact+0x69/0x1be [<ffffffff81150cba>] __alloc_pages_nodemask+0x69a/0xb40 [<ffffffff8119743e>] alloc_pages_current+0x9e/0x110 [<ffffffff811a0ac5>] new_slab+0x2c5/0x390 [<ffffffff816d71c4>] __slab_alloc+0x33b/0x459 [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0 [<ffffffff8164bda1>] ? inet_sendmsg+0x71/0xc0 [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0 [<ffffffff811a21f2>] kmem_cache_alloc+0x1a2/0x1b0 [<ffffffff815b906d>] sock_alloc_inode+0x2d/0xd0 [<ffffffff811d8566>] alloc_inode+0x26/0xa0 [<ffffffff811da04a>] new_inode_pseudo+0x1a/0x70 [<ffffffff815b933e>] sock_alloc+0x1e/0x80 [<ffffffff815ba855>] __sock_create+0x95/0x220 [<ffffffff815baa04>] sock_create_kern+0x24/0x30 [<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph] [<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd] [<ffffffff81084c19>] process_one_work+0x159/0x4f0 [<ffffffff8108561b>] worker_thread+0x11b/0x530 [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0 [<ffffffff8108b6f9>] kthread+0xc9/0xe0 [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90 [<ffffffff816e1b98>] ret_from_fork+0x58/0x90 [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90 Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here. Cc: stable@vger.kernel.org # 3.10+, needs backporting Link: http://tracker.ceph.com/issues/19309 Reported-by: Sergey Jerusalimov <wintchester@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com>
2017-03-21 20:44:28 +08:00
/* sock_create_kern() allocates with GFP_KERNEL */
noio_flag = memalloc_noio_save();
ret = sock_create_kern(read_pnet(&con->msgr->net), ss.ss_family,
SOCK_STREAM, IPPROTO_TCP, &sock);
libceph: force GFP_NOIO for socket allocations sock_alloc_inode() allocates socket+inode and socket_wq with GFP_KERNEL, which is not allowed on the writeback path: Workqueue: ceph-msgr con_work [libceph] ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000 0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00 ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148 Call Trace: [<ffffffff816dd629>] schedule+0x29/0x70 [<ffffffff816e066d>] schedule_timeout+0x1bd/0x200 [<ffffffff81093ffc>] ? ttwu_do_wakeup+0x2c/0x120 [<ffffffff81094266>] ? ttwu_do_activate.constprop.135+0x66/0x70 [<ffffffff816deb5f>] wait_for_completion+0xbf/0x180 [<ffffffff81097cd0>] ? try_to_wake_up+0x390/0x390 [<ffffffff81086335>] flush_work+0x165/0x250 [<ffffffff81082940>] ? worker_detach_from_pool+0xd0/0xd0 [<ffffffffa03b65b1>] xlog_cil_force_lsn+0x81/0x200 [xfs] [<ffffffff816d6b42>] ? __slab_free+0xee/0x234 [<ffffffffa03b4b1d>] _xfs_log_force_lsn+0x4d/0x2c0 [xfs] [<ffffffff811adc1e>] ? lookup_page_cgroup_used+0xe/0x30 [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs] [<ffffffffa03b4dcf>] xfs_log_force_lsn+0x3f/0xf0 [xfs] [<ffffffffa039a723>] ? xfs_reclaim_inode+0xa3/0x330 [xfs] [<ffffffffa03a62c6>] xfs_iunpin_wait+0xc6/0x1a0 [xfs] [<ffffffff810aa250>] ? wake_atomic_t_function+0x40/0x40 [<ffffffffa039a723>] xfs_reclaim_inode+0xa3/0x330 [xfs] [<ffffffffa039ac07>] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs] [<ffffffffa039bb13>] xfs_reclaim_inodes_nr+0x33/0x40 [xfs] [<ffffffffa03ab745>] xfs_fs_free_cached_objects+0x15/0x20 [xfs] [<ffffffff811c0c18>] super_cache_scan+0x178/0x180 [<ffffffff8115912e>] shrink_slab_node+0x14e/0x340 [<ffffffff811afc3b>] ? mem_cgroup_iter+0x16b/0x450 [<ffffffff8115af70>] shrink_slab+0x100/0x140 [<ffffffff8115e425>] do_try_to_free_pages+0x335/0x490 [<ffffffff8115e7f9>] try_to_free_pages+0xb9/0x1f0 [<ffffffff816d56e4>] ? __alloc_pages_direct_compact+0x69/0x1be [<ffffffff81150cba>] __alloc_pages_nodemask+0x69a/0xb40 [<ffffffff8119743e>] alloc_pages_current+0x9e/0x110 [<ffffffff811a0ac5>] new_slab+0x2c5/0x390 [<ffffffff816d71c4>] __slab_alloc+0x33b/0x459 [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0 [<ffffffff8164bda1>] ? inet_sendmsg+0x71/0xc0 [<ffffffff815b906d>] ? sock_alloc_inode+0x2d/0xd0 [<ffffffff811a21f2>] kmem_cache_alloc+0x1a2/0x1b0 [<ffffffff815b906d>] sock_alloc_inode+0x2d/0xd0 [<ffffffff811d8566>] alloc_inode+0x26/0xa0 [<ffffffff811da04a>] new_inode_pseudo+0x1a/0x70 [<ffffffff815b933e>] sock_alloc+0x1e/0x80 [<ffffffff815ba855>] __sock_create+0x95/0x220 [<ffffffff815baa04>] sock_create_kern+0x24/0x30 [<ffffffffa04794d9>] con_work+0xef9/0x2050 [libceph] [<ffffffffa04aa9ec>] ? rbd_img_request_submit+0x4c/0x60 [rbd] [<ffffffff81084c19>] process_one_work+0x159/0x4f0 [<ffffffff8108561b>] worker_thread+0x11b/0x530 [<ffffffff81085500>] ? create_worker+0x1d0/0x1d0 [<ffffffff8108b6f9>] kthread+0xc9/0xe0 [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90 [<ffffffff816e1b98>] ret_from_fork+0x58/0x90 [<ffffffff8108b630>] ? flush_kthread_worker+0x90/0x90 Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here. Cc: stable@vger.kernel.org # 3.10+, needs backporting Link: http://tracker.ceph.com/issues/19309 Reported-by: Sergey Jerusalimov <wintchester@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com>
2017-03-21 20:44:28 +08:00
memalloc_noio_restore(noio_flag);
if (ret)
return ret;
Revert "libceph: use memalloc flags for net IO" This reverts commit 89baaa570ab0b476db09408d209578cfed700e9f. Dirty page throttling should be sufficient for us in the general case so there is no need to use __GFP_MEMALLOC - it would be needed only in the swap-over-rbd case, which we currently don't support. (It would probably take approximately the commit that is being reverted to add that support, but we would also need the "swap" option to distinguish from the general case and make sure swap ceph_client-s aren't shared with anything else.) See ceph-devel threads [1] and [2] for the details of why enabling pfmemalloc reserves for all cases is a bad thing. On top of potential system lockups related to drained emergency reserves, this turned out to cause ceph lockups in case peers are on the same host and communicating via loopback due to sk_filter() dropping pfmemalloc skbs on the receiving side because the receiving loopback socket is not tagged with SOCK_MEMALLOC. [1] "SOCK_MEMALLOC vs loopback" http://www.spinics.net/lists/ceph-devel/msg22998.html [2] "[PATCH] libceph: don't set memalloc flags in loopback case" http://www.spinics.net/lists/ceph-devel/msg23392.html Conflicts: net/ceph/messenger.c [ context: tcp_nodelay option ] Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Mel Gorman <mgorman@suse.de> Cc: Sage Weil <sage@redhat.com> Cc: stable@vger.kernel.org # 3.18+, needs backporting Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Acked-by: Mike Christie <michaelc@cs.wisc.edu> Acked-by: Mel Gorman <mgorman@suse.de>
2015-04-02 19:40:58 +08:00
sock->sk->sk_allocation = GFP_NOFS;
Treewide: Stop corrupting socket's task_frag Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the GFP_NOIO flag on sk_allocation which the networking system uses to decide when it is safe to use current->task_frag. The results of this are unexpected corruption in task_frag when SUNRPC is involved in memory reclaim. The corruption can be seen in crashes, but the root cause is often difficult to ascertain as a crashing machine's stack trace will have no evidence of being near NFS or SUNRPC code. I believe this problem to be much more pervasive than reports to the community may indicate. Fix this by having kernel users of sockets that may corrupt task_frag due to reclaim set sk_use_task_frag = false. Preemptively correcting this situation for users that still set sk_allocation allows them to convert to memalloc_nofs_save/restore without the same unexpected corruptions that are sure to follow, unlikely to show up in testing, and difficult to bisect. CC: Philipp Reisner <philipp.reisner@linbit.com> CC: Lars Ellenberg <lars.ellenberg@linbit.com> CC: "Christoph Böhmwalder" <christoph.boehmwalder@linbit.com> CC: Jens Axboe <axboe@kernel.dk> CC: Josef Bacik <josef@toxicpanda.com> CC: Keith Busch <kbusch@kernel.org> CC: Christoph Hellwig <hch@lst.de> CC: Sagi Grimberg <sagi@grimberg.me> CC: Lee Duncan <lduncan@suse.com> CC: Chris Leech <cleech@redhat.com> CC: Mike Christie <michael.christie@oracle.com> CC: "James E.J. Bottomley" <jejb@linux.ibm.com> CC: "Martin K. Petersen" <martin.petersen@oracle.com> CC: Valentina Manea <valentina.manea.m@gmail.com> CC: Shuah Khan <shuah@kernel.org> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: David Howells <dhowells@redhat.com> CC: Marc Dionne <marc.dionne@auristor.com> CC: Steve French <sfrench@samba.org> CC: Christine Caulfield <ccaulfie@redhat.com> CC: David Teigland <teigland@redhat.com> CC: Mark Fasheh <mark@fasheh.com> CC: Joel Becker <jlbec@evilplan.org> CC: Joseph Qi <joseph.qi@linux.alibaba.com> CC: Eric Van Hensbergen <ericvh@gmail.com> CC: Latchesar Ionkov <lucho@ionkov.net> CC: Dominique Martinet <asmadeus@codewreck.org> CC: Ilya Dryomov <idryomov@gmail.com> CC: Xiubo Li <xiubli@redhat.com> CC: Chuck Lever <chuck.lever@oracle.com> CC: Jeff Layton <jlayton@kernel.org> CC: Trond Myklebust <trond.myklebust@hammerspace.com> CC: Anna Schumaker <anna@kernel.org> CC: Steffen Klassert <steffen.klassert@secunet.com> CC: Herbert Xu <herbert@gondor.apana.org.au> Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-16 20:45:27 +08:00
sock->sk->sk_use_task_frag = false;
#ifdef CONFIG_LOCKDEP
lockdep_set_class(&sock->sk->sk_lock, &socket_class);
#endif
set_sock_callbacks(sock, con);
con_sock_state_connecting(con);
ret = kernel_connect(sock, (struct sockaddr *)&ss, sizeof(ss),
O_NONBLOCK);
if (ret == -EINPROGRESS) {
dout("connect %s EINPROGRESS sk_state = %u\n",
ceph_pr_addr(&con->peer_addr),
sock->sk->sk_state);
} else if (ret < 0) {
pr_err("connect %s error %d\n",
ceph_pr_addr(&con->peer_addr), ret);
sock_release(sock);
return ret;
}
if (ceph_test_opt(from_msgr(con->msgr), TCP_NODELAY))
tcp_sock_set_nodelay(sock->sk);
con->sock = sock;
return 0;
}
/*
* Shutdown/close the socket for the given connection.
*/
int ceph_con_close_socket(struct ceph_connection *con)
{
int rc = 0;
dout("%s con %p sock %p\n", __func__, con, con->sock);
if (con->sock) {
rc = con->sock->ops->shutdown(con->sock, SHUT_RDWR);
sock_release(con->sock);
con->sock = NULL;
}
/*
* Forcibly clear the SOCK_CLOSED flag. It gets set
* independent of the connection mutex, and we could have
* received a socket close event before we had the chance to
* shut the socket down.
*/
ceph_con_flag_clear(con, CEPH_CON_F_SOCK_CLOSED);
con_sock_state_closed(con);
return rc;
}
static void ceph_con_reset_protocol(struct ceph_connection *con)
{
dout("%s con %p\n", __func__, con);
ceph_con_close_socket(con);
if (con->in_msg) {
WARN_ON(con->in_msg->con != con);
ceph_msg_put(con->in_msg);
con->in_msg = NULL;
}
if (con->out_msg) {
WARN_ON(con->out_msg->con != con);
ceph_msg_put(con->out_msg);
con->out_msg = NULL;
}
if (con->bounce_page) {
__free_page(con->bounce_page);
con->bounce_page = NULL;
}
if (ceph_msgr2(from_msgr(con->msgr)))
ceph_con_v2_reset_protocol(con);
else
ceph_con_v1_reset_protocol(con);
}
/*
* Reset a connection. Discard all incoming and outgoing messages
* and clear *_seq state.
*/
static void ceph_msg_remove(struct ceph_msg *msg)
{
list_del_init(&msg->list_head);
ceph_msg_put(msg);
}
static void ceph_msg_remove_list(struct list_head *head)
{
while (!list_empty(head)) {
struct ceph_msg *msg = list_first_entry(head, struct ceph_msg,
list_head);
ceph_msg_remove(msg);
}
}
void ceph_con_reset_session(struct ceph_connection *con)
{
dout("%s con %p\n", __func__, con);
WARN_ON(con->in_msg);
WARN_ON(con->out_msg);
ceph_msg_remove_list(&con->out_queue);
ceph_msg_remove_list(&con->out_sent);
con->out_seq = 0;
con->in_seq = 0;
con->in_seq_acked = 0;
if (ceph_msgr2(from_msgr(con->msgr)))
ceph_con_v2_reset_session(con);
else
ceph_con_v1_reset_session(con);
}
/*
* mark a peer down. drop any open connections.
*/
void ceph_con_close(struct ceph_connection *con)
{
mutex_lock(&con->mutex);
dout("con_close %p peer %s\n", con, ceph_pr_addr(&con->peer_addr));
con->state = CEPH_CON_S_CLOSED;
ceph_con_flag_clear(con, CEPH_CON_F_LOSSYTX); /* so we retry next
connect */
ceph_con_flag_clear(con, CEPH_CON_F_KEEPALIVE_PENDING);
ceph_con_flag_clear(con, CEPH_CON_F_WRITE_PENDING);
ceph_con_flag_clear(con, CEPH_CON_F_BACKOFF);
ceph_con_reset_protocol(con);
ceph_con_reset_session(con);
cancel_con(con);
mutex_unlock(&con->mutex);
}
EXPORT_SYMBOL(ceph_con_close);
/*
* Reopen a closed connection, with a new peer address.
*/
void ceph_con_open(struct ceph_connection *con,
__u8 entity_type, __u64 entity_num,
struct ceph_entity_addr *addr)
{
mutex_lock(&con->mutex);
dout("con_open %p %s\n", con, ceph_pr_addr(addr));
WARN_ON(con->state != CEPH_CON_S_CLOSED);
con->state = CEPH_CON_S_PREOPEN;
con->peer_name.type = (__u8) entity_type;
con->peer_name.num = cpu_to_le64(entity_num);
memcpy(&con->peer_addr, addr, sizeof(*addr));
con->delay = 0; /* reset backoff memory */
mutex_unlock(&con->mutex);
queue_con(con);
}
EXPORT_SYMBOL(ceph_con_open);
/*
* return true if this connection ever successfully opened
*/
bool ceph_con_opened(struct ceph_connection *con)
{
if (ceph_msgr2(from_msgr(con->msgr)))
return ceph_con_v2_opened(con);
return ceph_con_v1_opened(con);
}
/*
* initialize a new connection.
*/
void ceph_con_init(struct ceph_connection *con, void *private,
const struct ceph_connection_operations *ops,
struct ceph_messenger *msgr)
{
dout("con_init %p\n", con);
memset(con, 0, sizeof(*con));
con->private = private;
con->ops = ops;
con->msgr = msgr;
con_sock_state_init(con);
mutex_init(&con->mutex);
INIT_LIST_HEAD(&con->out_queue);
INIT_LIST_HEAD(&con->out_sent);
INIT_DELAYED_WORK(&con->work, ceph_con_workfn);
con->state = CEPH_CON_S_CLOSED;
}
EXPORT_SYMBOL(ceph_con_init);
/*
* We maintain a global counter to order connection attempts. Get
* a unique seq greater than @gt.
*/
u32 ceph_get_global_seq(struct ceph_messenger *msgr, u32 gt)
{
u32 ret;
spin_lock(&msgr->global_seq_lock);
if (msgr->global_seq < gt)
msgr->global_seq = gt;
ret = ++msgr->global_seq;
spin_unlock(&msgr->global_seq_lock);
return ret;
}
/*
* Discard messages that have been acked by the server.
*/
void ceph_con_discard_sent(struct ceph_connection *con, u64 ack_seq)
{
struct ceph_msg *msg;
u64 seq;
dout("%s con %p ack_seq %llu\n", __func__, con, ack_seq);
while (!list_empty(&con->out_sent)) {
msg = list_first_entry(&con->out_sent, struct ceph_msg,
list_head);
WARN_ON(msg->needs_out_seq);
seq = le64_to_cpu(msg->hdr.seq);
if (seq > ack_seq)
break;
dout("%s con %p discarding msg %p seq %llu\n", __func__, con,
msg, seq);
ceph_msg_remove(msg);
}
}
/*
* Discard messages that have been requeued in con_fault(), up to
* reconnect_seq. This avoids gratuitously resending messages that
* the server had received and handled prior to reconnect.
*/
void ceph_con_discard_requeued(struct ceph_connection *con, u64 reconnect_seq)
{
struct ceph_msg *msg;
u64 seq;
dout("%s con %p reconnect_seq %llu\n", __func__, con, reconnect_seq);
while (!list_empty(&con->out_queue)) {
msg = list_first_entry(&con->out_queue, struct ceph_msg,
list_head);
if (msg->needs_out_seq)
break;
seq = le64_to_cpu(msg->hdr.seq);
if (seq > reconnect_seq)
break;
dout("%s con %p discarding msg %p seq %llu\n", __func__, con,
msg, seq);
ceph_msg_remove(msg);
}
}
#ifdef CONFIG_BLOCK
/*
* For a bio data item, a piece is whatever remains of the next
* entry in the current bio iovec, or the first entry in the next
* bio in the list.
*/
static void ceph_msg_data_bio_cursor_init(struct ceph_msg_data_cursor *cursor,
size_t length)
{
struct ceph_msg_data *data = cursor->data;
struct ceph_bio_iter *it = &cursor->bio_iter;
cursor->resid = min_t(size_t, length, data->bio_length);
*it = data->bio_pos;
if (cursor->resid < it->iter.bi_size)
it->iter.bi_size = cursor->resid;
BUG_ON(cursor->resid < bio_iter_len(it->bio, it->iter));
}
static struct page *ceph_msg_data_bio_next(struct ceph_msg_data_cursor *cursor,
size_t *page_offset,
size_t *length)
{
struct bio_vec bv = bio_iter_iovec(cursor->bio_iter.bio,
cursor->bio_iter.iter);
*page_offset = bv.bv_offset;
*length = bv.bv_len;
return bv.bv_page;
}
static bool ceph_msg_data_bio_advance(struct ceph_msg_data_cursor *cursor,
size_t bytes)
{
struct ceph_bio_iter *it = &cursor->bio_iter;
struct page *page = bio_iter_page(it->bio, it->iter);
BUG_ON(bytes > cursor->resid);
BUG_ON(bytes > bio_iter_len(it->bio, it->iter));
cursor->resid -= bytes;
bio_advance_iter(it->bio, &it->iter, bytes);
if (!cursor->resid)
return false; /* no more data */
if (!bytes || (it->iter.bi_size && it->iter.bi_bvec_done &&
page == bio_iter_page(it->bio, it->iter)))
return false; /* more bytes to process in this segment */
if (!it->iter.bi_size) {
it->bio = it->bio->bi_next;
it->iter = it->bio->bi_iter;
if (cursor->resid < it->iter.bi_size)
it->iter.bi_size = cursor->resid;
}
BUG_ON(cursor->resid < bio_iter_len(it->bio, it->iter));
return true;
}
#endif /* CONFIG_BLOCK */
static void ceph_msg_data_bvecs_cursor_init(struct ceph_msg_data_cursor *cursor,
size_t length)
{
struct ceph_msg_data *data = cursor->data;
struct bio_vec *bvecs = data->bvec_pos.bvecs;
cursor->resid = min_t(size_t, length, data->bvec_pos.iter.bi_size);
cursor->bvec_iter = data->bvec_pos.iter;
cursor->bvec_iter.bi_size = cursor->resid;
BUG_ON(cursor->resid < bvec_iter_len(bvecs, cursor->bvec_iter));
}
static struct page *ceph_msg_data_bvecs_next(struct ceph_msg_data_cursor *cursor,
size_t *page_offset,
size_t *length)
{
struct bio_vec bv = bvec_iter_bvec(cursor->data->bvec_pos.bvecs,
cursor->bvec_iter);
*page_offset = bv.bv_offset;
*length = bv.bv_len;
return bv.bv_page;
}
static bool ceph_msg_data_bvecs_advance(struct ceph_msg_data_cursor *cursor,
size_t bytes)
{
struct bio_vec *bvecs = cursor->data->bvec_pos.bvecs;
struct page *page = bvec_iter_page(bvecs, cursor->bvec_iter);
BUG_ON(bytes > cursor->resid);
BUG_ON(bytes > bvec_iter_len(bvecs, cursor->bvec_iter));
cursor->resid -= bytes;
bvec_iter_advance(bvecs, &cursor->bvec_iter, bytes);
if (!cursor->resid)
return false; /* no more data */
if (!bytes || (cursor->bvec_iter.bi_bvec_done &&
page == bvec_iter_page(bvecs, cursor->bvec_iter)))
return false; /* more bytes to process in this segment */
BUG_ON(cursor->resid < bvec_iter_len(bvecs, cursor->bvec_iter));
return true;
}
/*
* For a page array, a piece comes from the first page in the array
* that has not already been fully consumed.
*/
static void ceph_msg_data_pages_cursor_init(struct ceph_msg_data_cursor *cursor,
size_t length)
{
struct ceph_msg_data *data = cursor->data;
int page_count;
BUG_ON(data->type != CEPH_MSG_DATA_PAGES);
BUG_ON(!data->pages);
BUG_ON(!data->length);
libceph: implement multiple data items in a message This patch adds support to the messenger for more than one data item in its data list. A message data cursor has two more fields to support this: - a count of the number of bytes left to be consumed across all data items in the list, "total_resid" - a pointer to the head of the list (for validation only) The cursor initialization routine has been split into two parts: the outer one, which initializes the cursor for traversing the entire list of data items; and the inner one, which initializes the cursor to start processing a single data item. When a message cursor is first initialized, the outer initialization routine sets total_resid to the length provided. The data pointer is initialized to the first data item on the list. From there, the inner initialization routine finishes by setting up to process the data item the cursor points to. Advancing the cursor consumes bytes in total_resid. If the resid field reaches zero, it means the current data item is fully consumed. If total_resid indicates there is more data, the cursor is advanced to point to the next data item, and then the inner initialization routine prepares for using that. (A check is made at this point to make sure we don't wrap around the front of the list.) The type-specific init routines are modified so they can be given a length that's larger than what the data item can support. The resid field is initialized to the smaller of the provided length and the length of the entire data item. When total_resid reaches zero, we're done. This resolves: http://tracker.ceph.com/issues/3761 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-04-06 03:46:01 +08:00
cursor->resid = min(length, data->length);
page_count = calc_pages_for(data->alignment, (u64)data->length);
cursor->page_offset = data->alignment & ~PAGE_MASK;
cursor->page_index = 0;
BUG_ON(page_count > (int)USHRT_MAX);
cursor->page_count = (unsigned short)page_count;
BUG_ON(length > SIZE_MAX - cursor->page_offset);
}
static struct page *
ceph_msg_data_pages_next(struct ceph_msg_data_cursor *cursor,
size_t *page_offset, size_t *length)
{
struct ceph_msg_data *data = cursor->data;
BUG_ON(data->type != CEPH_MSG_DATA_PAGES);
BUG_ON(cursor->page_index >= cursor->page_count);
BUG_ON(cursor->page_offset >= PAGE_SIZE);
*page_offset = cursor->page_offset;
*length = min_t(size_t, cursor->resid, PAGE_SIZE - *page_offset);
return data->pages[cursor->page_index];
}
static bool ceph_msg_data_pages_advance(struct ceph_msg_data_cursor *cursor,
size_t bytes)
{
BUG_ON(cursor->data->type != CEPH_MSG_DATA_PAGES);
BUG_ON(cursor->page_offset + bytes > PAGE_SIZE);
/* Advance the cursor page offset */
cursor->resid -= bytes;
cursor->page_offset = (cursor->page_offset + bytes) & ~PAGE_MASK;
if (!bytes || cursor->page_offset)
return false; /* more bytes to process in the current page */
if (!cursor->resid)
return false; /* no more data */
/* Move on to the next page; offset is already at 0 */
BUG_ON(cursor->page_index >= cursor->page_count);
cursor->page_index++;
return true;
}
libceph: start defining message data cursor This patch lays out the foundation for using generic routines to manage processing items of message data. For simplicity, we'll start with just the trail portion of a message, because it stands alone and is only present for outgoing data. First some basic concepts. We'll use the term "data item" to represent one of the ceph_msg_data structures associated with a message. There are currently four of those, with single-letter field names p, l, b, and t. A data item is further broken into "pieces" which always lie in a single page. A data item will include a "cursor" that will track state as the memory defined by the item is consumed by sending data from or receiving data into it. We define three routines to manipulate a data item's cursor: the "init" routine; the "next" routine; and the "advance" routine. The "init" routine initializes the cursor so it points at the beginning of the first piece in the item. The "next" routine returns the page, page offset, and length (limited by both the page and item size) of the next unconsumed piece in the item. It also indicates to the caller whether the piece being returned is the last one in the data item. The "advance" routine consumes the requested number of bytes in the item (advancing the cursor). This is used to record the number of bytes from the current piece that were actually sent or received by the network code. It returns an indication of whether the result means the current piece has been fully consumed. This is used by the message send code to determine whether it should calculate the CRC for the next piece processed. The trail of a message is implemented as a ceph pagelist. The routines defined for it will be usable for non-trail pagelist data as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-03-07 13:39:39 +08:00
/*
* For a pagelist, a piece is whatever remains to be consumed in the
* first page in the list, or the front of the next page.
libceph: start defining message data cursor This patch lays out the foundation for using generic routines to manage processing items of message data. For simplicity, we'll start with just the trail portion of a message, because it stands alone and is only present for outgoing data. First some basic concepts. We'll use the term "data item" to represent one of the ceph_msg_data structures associated with a message. There are currently four of those, with single-letter field names p, l, b, and t. A data item is further broken into "pieces" which always lie in a single page. A data item will include a "cursor" that will track state as the memory defined by the item is consumed by sending data from or receiving data into it. We define three routines to manipulate a data item's cursor: the "init" routine; the "next" routine; and the "advance" routine. The "init" routine initializes the cursor so it points at the beginning of the first piece in the item. The "next" routine returns the page, page offset, and length (limited by both the page and item size) of the next unconsumed piece in the item. It also indicates to the caller whether the piece being returned is the last one in the data item. The "advance" routine consumes the requested number of bytes in the item (advancing the cursor). This is used to record the number of bytes from the current piece that were actually sent or received by the network code. It returns an indication of whether the result means the current piece has been fully consumed. This is used by the message send code to determine whether it should calculate the CRC for the next piece processed. The trail of a message is implemented as a ceph pagelist. The routines defined for it will be usable for non-trail pagelist data as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-03-07 13:39:39 +08:00
*/
static void
ceph_msg_data_pagelist_cursor_init(struct ceph_msg_data_cursor *cursor,
size_t length)
libceph: start defining message data cursor This patch lays out the foundation for using generic routines to manage processing items of message data. For simplicity, we'll start with just the trail portion of a message, because it stands alone and is only present for outgoing data. First some basic concepts. We'll use the term "data item" to represent one of the ceph_msg_data structures associated with a message. There are currently four of those, with single-letter field names p, l, b, and t. A data item is further broken into "pieces" which always lie in a single page. A data item will include a "cursor" that will track state as the memory defined by the item is consumed by sending data from or receiving data into it. We define three routines to manipulate a data item's cursor: the "init" routine; the "next" routine; and the "advance" routine. The "init" routine initializes the cursor so it points at the beginning of the first piece in the item. The "next" routine returns the page, page offset, and length (limited by both the page and item size) of the next unconsumed piece in the item. It also indicates to the caller whether the piece being returned is the last one in the data item. The "advance" routine consumes the requested number of bytes in the item (advancing the cursor). This is used to record the number of bytes from the current piece that were actually sent or received by the network code. It returns an indication of whether the result means the current piece has been fully consumed. This is used by the message send code to determine whether it should calculate the CRC for the next piece processed. The trail of a message is implemented as a ceph pagelist. The routines defined for it will be usable for non-trail pagelist data as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-03-07 13:39:39 +08:00
{
struct ceph_msg_data *data = cursor->data;
libceph: start defining message data cursor This patch lays out the foundation for using generic routines to manage processing items of message data. For simplicity, we'll start with just the trail portion of a message, because it stands alone and is only present for outgoing data. First some basic concepts. We'll use the term "data item" to represent one of the ceph_msg_data structures associated with a message. There are currently four of those, with single-letter field names p, l, b, and t. A data item is further broken into "pieces" which always lie in a single page. A data item will include a "cursor" that will track state as the memory defined by the item is consumed by sending data from or receiving data into it. We define three routines to manipulate a data item's cursor: the "init" routine; the "next" routine; and the "advance" routine. The "init" routine initializes the cursor so it points at the beginning of the first piece in the item. The "next" routine returns the page, page offset, and length (limited by both the page and item size) of the next unconsumed piece in the item. It also indicates to the caller whether the piece being returned is the last one in the data item. The "advance" routine consumes the requested number of bytes in the item (advancing the cursor). This is used to record the number of bytes from the current piece that were actually sent or received by the network code. It returns an indication of whether the result means the current piece has been fully consumed. This is used by the message send code to determine whether it should calculate the CRC for the next piece processed. The trail of a message is implemented as a ceph pagelist. The routines defined for it will be usable for non-trail pagelist data as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-03-07 13:39:39 +08:00
struct ceph_pagelist *pagelist;
struct page *page;
BUG_ON(data->type != CEPH_MSG_DATA_PAGELIST);
libceph: start defining message data cursor This patch lays out the foundation for using generic routines to manage processing items of message data. For simplicity, we'll start with just the trail portion of a message, because it stands alone and is only present for outgoing data. First some basic concepts. We'll use the term "data item" to represent one of the ceph_msg_data structures associated with a message. There are currently four of those, with single-letter field names p, l, b, and t. A data item is further broken into "pieces" which always lie in a single page. A data item will include a "cursor" that will track state as the memory defined by the item is consumed by sending data from or receiving data into it. We define three routines to manipulate a data item's cursor: the "init" routine; the "next" routine; and the "advance" routine. The "init" routine initializes the cursor so it points at the beginning of the first piece in the item. The "next" routine returns the page, page offset, and length (limited by both the page and item size) of the next unconsumed piece in the item. It also indicates to the caller whether the piece being returned is the last one in the data item. The "advance" routine consumes the requested number of bytes in the item (advancing the cursor). This is used to record the number of bytes from the current piece that were actually sent or received by the network code. It returns an indication of whether the result means the current piece has been fully consumed. This is used by the message send code to determine whether it should calculate the CRC for the next piece processed. The trail of a message is implemented as a ceph pagelist. The routines defined for it will be usable for non-trail pagelist data as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-03-07 13:39:39 +08:00
pagelist = data->pagelist;
BUG_ON(!pagelist);
if (!length)
libceph: start defining message data cursor This patch lays out the foundation for using generic routines to manage processing items of message data. For simplicity, we'll start with just the trail portion of a message, because it stands alone and is only present for outgoing data. First some basic concepts. We'll use the term "data item" to represent one of the ceph_msg_data structures associated with a message. There are currently four of those, with single-letter field names p, l, b, and t. A data item is further broken into "pieces" which always lie in a single page. A data item will include a "cursor" that will track state as the memory defined by the item is consumed by sending data from or receiving data into it. We define three routines to manipulate a data item's cursor: the "init" routine; the "next" routine; and the "advance" routine. The "init" routine initializes the cursor so it points at the beginning of the first piece in the item. The "next" routine returns the page, page offset, and length (limited by both the page and item size) of the next unconsumed piece in the item. It also indicates to the caller whether the piece being returned is the last one in the data item. The "advance" routine consumes the requested number of bytes in the item (advancing the cursor). This is used to record the number of bytes from the current piece that were actually sent or received by the network code. It returns an indication of whether the result means the current piece has been fully consumed. This is used by the message send code to determine whether it should calculate the CRC for the next piece processed. The trail of a message is implemented as a ceph pagelist. The routines defined for it will be usable for non-trail pagelist data as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-03-07 13:39:39 +08:00
return; /* pagelist can be assigned but empty */
BUG_ON(list_empty(&pagelist->head));
page = list_first_entry(&pagelist->head, struct page, lru);
libceph: implement multiple data items in a message This patch adds support to the messenger for more than one data item in its data list. A message data cursor has two more fields to support this: - a count of the number of bytes left to be consumed across all data items in the list, "total_resid" - a pointer to the head of the list (for validation only) The cursor initialization routine has been split into two parts: the outer one, which initializes the cursor for traversing the entire list of data items; and the inner one, which initializes the cursor to start processing a single data item. When a message cursor is first initialized, the outer initialization routine sets total_resid to the length provided. The data pointer is initialized to the first data item on the list. From there, the inner initialization routine finishes by setting up to process the data item the cursor points to. Advancing the cursor consumes bytes in total_resid. If the resid field reaches zero, it means the current data item is fully consumed. If total_resid indicates there is more data, the cursor is advanced to point to the next data item, and then the inner initialization routine prepares for using that. (A check is made at this point to make sure we don't wrap around the front of the list.) The type-specific init routines are modified so they can be given a length that's larger than what the data item can support. The resid field is initialized to the smaller of the provided length and the length of the entire data item. When total_resid reaches zero, we're done. This resolves: http://tracker.ceph.com/issues/3761 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-04-06 03:46:01 +08:00
cursor->resid = min(length, pagelist->length);
libceph: start defining message data cursor This patch lays out the foundation for using generic routines to manage processing items of message data. For simplicity, we'll start with just the trail portion of a message, because it stands alone and is only present for outgoing data. First some basic concepts. We'll use the term "data item" to represent one of the ceph_msg_data structures associated with a message. There are currently four of those, with single-letter field names p, l, b, and t. A data item is further broken into "pieces" which always lie in a single page. A data item will include a "cursor" that will track state as the memory defined by the item is consumed by sending data from or receiving data into it. We define three routines to manipulate a data item's cursor: the "init" routine; the "next" routine; and the "advance" routine. The "init" routine initializes the cursor so it points at the beginning of the first piece in the item. The "next" routine returns the page, page offset, and length (limited by both the page and item size) of the next unconsumed piece in the item. It also indicates to the caller whether the piece being returned is the last one in the data item. The "advance" routine consumes the requested number of bytes in the item (advancing the cursor). This is used to record the number of bytes from the current piece that were actually sent or received by the network code. It returns an indication of whether the result means the current piece has been fully consumed. This is used by the message send code to determine whether it should calculate the CRC for the next piece processed. The trail of a message is implemented as a ceph pagelist. The routines defined for it will be usable for non-trail pagelist data as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-03-07 13:39:39 +08:00
cursor->page = page;
cursor->offset = 0;
}
static struct page *
ceph_msg_data_pagelist_next(struct ceph_msg_data_cursor *cursor,
size_t *page_offset, size_t *length)
libceph: start defining message data cursor This patch lays out the foundation for using generic routines to manage processing items of message data. For simplicity, we'll start with just the trail portion of a message, because it stands alone and is only present for outgoing data. First some basic concepts. We'll use the term "data item" to represent one of the ceph_msg_data structures associated with a message. There are currently four of those, with single-letter field names p, l, b, and t. A data item is further broken into "pieces" which always lie in a single page. A data item will include a "cursor" that will track state as the memory defined by the item is consumed by sending data from or receiving data into it. We define three routines to manipulate a data item's cursor: the "init" routine; the "next" routine; and the "advance" routine. The "init" routine initializes the cursor so it points at the beginning of the first piece in the item. The "next" routine returns the page, page offset, and length (limited by both the page and item size) of the next unconsumed piece in the item. It also indicates to the caller whether the piece being returned is the last one in the data item. The "advance" routine consumes the requested number of bytes in the item (advancing the cursor). This is used to record the number of bytes from the current piece that were actually sent or received by the network code. It returns an indication of whether the result means the current piece has been fully consumed. This is used by the message send code to determine whether it should calculate the CRC for the next piece processed. The trail of a message is implemented as a ceph pagelist. The routines defined for it will be usable for non-trail pagelist data as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-03-07 13:39:39 +08:00
{
struct ceph_msg_data *data = cursor->data;
libceph: start defining message data cursor This patch lays out the foundation for using generic routines to manage processing items of message data. For simplicity, we'll start with just the trail portion of a message, because it stands alone and is only present for outgoing data. First some basic concepts. We'll use the term "data item" to represent one of the ceph_msg_data structures associated with a message. There are currently four of those, with single-letter field names p, l, b, and t. A data item is further broken into "pieces" which always lie in a single page. A data item will include a "cursor" that will track state as the memory defined by the item is consumed by sending data from or receiving data into it. We define three routines to manipulate a data item's cursor: the "init" routine; the "next" routine; and the "advance" routine. The "init" routine initializes the cursor so it points at the beginning of the first piece in the item. The "next" routine returns the page, page offset, and length (limited by both the page and item size) of the next unconsumed piece in the item. It also indicates to the caller whether the piece being returned is the last one in the data item. The "advance" routine consumes the requested number of bytes in the item (advancing the cursor). This is used to record the number of bytes from the current piece that were actually sent or received by the network code. It returns an indication of whether the result means the current piece has been fully consumed. This is used by the message send code to determine whether it should calculate the CRC for the next piece processed. The trail of a message is implemented as a ceph pagelist. The routines defined for it will be usable for non-trail pagelist data as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-03-07 13:39:39 +08:00
struct ceph_pagelist *pagelist;
BUG_ON(data->type != CEPH_MSG_DATA_PAGELIST);
pagelist = data->pagelist;
BUG_ON(!pagelist);
BUG_ON(!cursor->page);
BUG_ON(cursor->offset + cursor->resid != pagelist->length);
libceph: start defining message data cursor This patch lays out the foundation for using generic routines to manage processing items of message data. For simplicity, we'll start with just the trail portion of a message, because it stands alone and is only present for outgoing data. First some basic concepts. We'll use the term "data item" to represent one of the ceph_msg_data structures associated with a message. There are currently four of those, with single-letter field names p, l, b, and t. A data item is further broken into "pieces" which always lie in a single page. A data item will include a "cursor" that will track state as the memory defined by the item is consumed by sending data from or receiving data into it. We define three routines to manipulate a data item's cursor: the "init" routine; the "next" routine; and the "advance" routine. The "init" routine initializes the cursor so it points at the beginning of the first piece in the item. The "next" routine returns the page, page offset, and length (limited by both the page and item size) of the next unconsumed piece in the item. It also indicates to the caller whether the piece being returned is the last one in the data item. The "advance" routine consumes the requested number of bytes in the item (advancing the cursor). This is used to record the number of bytes from the current piece that were actually sent or received by the network code. It returns an indication of whether the result means the current piece has been fully consumed. This is used by the message send code to determine whether it should calculate the CRC for the next piece processed. The trail of a message is implemented as a ceph pagelist. The routines defined for it will be usable for non-trail pagelist data as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-03-07 13:39:39 +08:00
/* offset of first page in pagelist is always 0 */
libceph: start defining message data cursor This patch lays out the foundation for using generic routines to manage processing items of message data. For simplicity, we'll start with just the trail portion of a message, because it stands alone and is only present for outgoing data. First some basic concepts. We'll use the term "data item" to represent one of the ceph_msg_data structures associated with a message. There are currently four of those, with single-letter field names p, l, b, and t. A data item is further broken into "pieces" which always lie in a single page. A data item will include a "cursor" that will track state as the memory defined by the item is consumed by sending data from or receiving data into it. We define three routines to manipulate a data item's cursor: the "init" routine; the "next" routine; and the "advance" routine. The "init" routine initializes the cursor so it points at the beginning of the first piece in the item. The "next" routine returns the page, page offset, and length (limited by both the page and item size) of the next unconsumed piece in the item. It also indicates to the caller whether the piece being returned is the last one in the data item. The "advance" routine consumes the requested number of bytes in the item (advancing the cursor). This is used to record the number of bytes from the current piece that were actually sent or received by the network code. It returns an indication of whether the result means the current piece has been fully consumed. This is used by the message send code to determine whether it should calculate the CRC for the next piece processed. The trail of a message is implemented as a ceph pagelist. The routines defined for it will be usable for non-trail pagelist data as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-03-07 13:39:39 +08:00
*page_offset = cursor->offset & ~PAGE_MASK;
*length = min_t(size_t, cursor->resid, PAGE_SIZE - *page_offset);
return cursor->page;
libceph: start defining message data cursor This patch lays out the foundation for using generic routines to manage processing items of message data. For simplicity, we'll start with just the trail portion of a message, because it stands alone and is only present for outgoing data. First some basic concepts. We'll use the term "data item" to represent one of the ceph_msg_data structures associated with a message. There are currently four of those, with single-letter field names p, l, b, and t. A data item is further broken into "pieces" which always lie in a single page. A data item will include a "cursor" that will track state as the memory defined by the item is consumed by sending data from or receiving data into it. We define three routines to manipulate a data item's cursor: the "init" routine; the "next" routine; and the "advance" routine. The "init" routine initializes the cursor so it points at the beginning of the first piece in the item. The "next" routine returns the page, page offset, and length (limited by both the page and item size) of the next unconsumed piece in the item. It also indicates to the caller whether the piece being returned is the last one in the data item. The "advance" routine consumes the requested number of bytes in the item (advancing the cursor). This is used to record the number of bytes from the current piece that were actually sent or received by the network code. It returns an indication of whether the result means the current piece has been fully consumed. This is used by the message send code to determine whether it should calculate the CRC for the next piece processed. The trail of a message is implemented as a ceph pagelist. The routines defined for it will be usable for non-trail pagelist data as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-03-07 13:39:39 +08:00
}
static bool ceph_msg_data_pagelist_advance(struct ceph_msg_data_cursor *cursor,
size_t bytes)
libceph: start defining message data cursor This patch lays out the foundation for using generic routines to manage processing items of message data. For simplicity, we'll start with just the trail portion of a message, because it stands alone and is only present for outgoing data. First some basic concepts. We'll use the term "data item" to represent one of the ceph_msg_data structures associated with a message. There are currently four of those, with single-letter field names p, l, b, and t. A data item is further broken into "pieces" which always lie in a single page. A data item will include a "cursor" that will track state as the memory defined by the item is consumed by sending data from or receiving data into it. We define three routines to manipulate a data item's cursor: the "init" routine; the "next" routine; and the "advance" routine. The "init" routine initializes the cursor so it points at the beginning of the first piece in the item. The "next" routine returns the page, page offset, and length (limited by both the page and item size) of the next unconsumed piece in the item. It also indicates to the caller whether the piece being returned is the last one in the data item. The "advance" routine consumes the requested number of bytes in the item (advancing the cursor). This is used to record the number of bytes from the current piece that were actually sent or received by the network code. It returns an indication of whether the result means the current piece has been fully consumed. This is used by the message send code to determine whether it should calculate the CRC for the next piece processed. The trail of a message is implemented as a ceph pagelist. The routines defined for it will be usable for non-trail pagelist data as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-03-07 13:39:39 +08:00
{
struct ceph_msg_data *data = cursor->data;
libceph: start defining message data cursor This patch lays out the foundation for using generic routines to manage processing items of message data. For simplicity, we'll start with just the trail portion of a message, because it stands alone and is only present for outgoing data. First some basic concepts. We'll use the term "data item" to represent one of the ceph_msg_data structures associated with a message. There are currently four of those, with single-letter field names p, l, b, and t. A data item is further broken into "pieces" which always lie in a single page. A data item will include a "cursor" that will track state as the memory defined by the item is consumed by sending data from or receiving data into it. We define three routines to manipulate a data item's cursor: the "init" routine; the "next" routine; and the "advance" routine. The "init" routine initializes the cursor so it points at the beginning of the first piece in the item. The "next" routine returns the page, page offset, and length (limited by both the page and item size) of the next unconsumed piece in the item. It also indicates to the caller whether the piece being returned is the last one in the data item. The "advance" routine consumes the requested number of bytes in the item (advancing the cursor). This is used to record the number of bytes from the current piece that were actually sent or received by the network code. It returns an indication of whether the result means the current piece has been fully consumed. This is used by the message send code to determine whether it should calculate the CRC for the next piece processed. The trail of a message is implemented as a ceph pagelist. The routines defined for it will be usable for non-trail pagelist data as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-03-07 13:39:39 +08:00
struct ceph_pagelist *pagelist;
BUG_ON(data->type != CEPH_MSG_DATA_PAGELIST);
pagelist = data->pagelist;
BUG_ON(!pagelist);
BUG_ON(cursor->offset + cursor->resid != pagelist->length);
libceph: start defining message data cursor This patch lays out the foundation for using generic routines to manage processing items of message data. For simplicity, we'll start with just the trail portion of a message, because it stands alone and is only present for outgoing data. First some basic concepts. We'll use the term "data item" to represent one of the ceph_msg_data structures associated with a message. There are currently four of those, with single-letter field names p, l, b, and t. A data item is further broken into "pieces" which always lie in a single page. A data item will include a "cursor" that will track state as the memory defined by the item is consumed by sending data from or receiving data into it. We define three routines to manipulate a data item's cursor: the "init" routine; the "next" routine; and the "advance" routine. The "init" routine initializes the cursor so it points at the beginning of the first piece in the item. The "next" routine returns the page, page offset, and length (limited by both the page and item size) of the next unconsumed piece in the item. It also indicates to the caller whether the piece being returned is the last one in the data item. The "advance" routine consumes the requested number of bytes in the item (advancing the cursor). This is used to record the number of bytes from the current piece that were actually sent or received by the network code. It returns an indication of whether the result means the current piece has been fully consumed. This is used by the message send code to determine whether it should calculate the CRC for the next piece processed. The trail of a message is implemented as a ceph pagelist. The routines defined for it will be usable for non-trail pagelist data as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-03-07 13:39:39 +08:00
BUG_ON((cursor->offset & ~PAGE_MASK) + bytes > PAGE_SIZE);
/* Advance the cursor offset */
cursor->resid -= bytes;
libceph: start defining message data cursor This patch lays out the foundation for using generic routines to manage processing items of message data. For simplicity, we'll start with just the trail portion of a message, because it stands alone and is only present for outgoing data. First some basic concepts. We'll use the term "data item" to represent one of the ceph_msg_data structures associated with a message. There are currently four of those, with single-letter field names p, l, b, and t. A data item is further broken into "pieces" which always lie in a single page. A data item will include a "cursor" that will track state as the memory defined by the item is consumed by sending data from or receiving data into it. We define three routines to manipulate a data item's cursor: the "init" routine; the "next" routine; and the "advance" routine. The "init" routine initializes the cursor so it points at the beginning of the first piece in the item. The "next" routine returns the page, page offset, and length (limited by both the page and item size) of the next unconsumed piece in the item. It also indicates to the caller whether the piece being returned is the last one in the data item. The "advance" routine consumes the requested number of bytes in the item (advancing the cursor). This is used to record the number of bytes from the current piece that were actually sent or received by the network code. It returns an indication of whether the result means the current piece has been fully consumed. This is used by the message send code to determine whether it should calculate the CRC for the next piece processed. The trail of a message is implemented as a ceph pagelist. The routines defined for it will be usable for non-trail pagelist data as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-03-07 13:39:39 +08:00
cursor->offset += bytes;
/* offset of first page in pagelist is always 0 */
libceph: start defining message data cursor This patch lays out the foundation for using generic routines to manage processing items of message data. For simplicity, we'll start with just the trail portion of a message, because it stands alone and is only present for outgoing data. First some basic concepts. We'll use the term "data item" to represent one of the ceph_msg_data structures associated with a message. There are currently four of those, with single-letter field names p, l, b, and t. A data item is further broken into "pieces" which always lie in a single page. A data item will include a "cursor" that will track state as the memory defined by the item is consumed by sending data from or receiving data into it. We define three routines to manipulate a data item's cursor: the "init" routine; the "next" routine; and the "advance" routine. The "init" routine initializes the cursor so it points at the beginning of the first piece in the item. The "next" routine returns the page, page offset, and length (limited by both the page and item size) of the next unconsumed piece in the item. It also indicates to the caller whether the piece being returned is the last one in the data item. The "advance" routine consumes the requested number of bytes in the item (advancing the cursor). This is used to record the number of bytes from the current piece that were actually sent or received by the network code. It returns an indication of whether the result means the current piece has been fully consumed. This is used by the message send code to determine whether it should calculate the CRC for the next piece processed. The trail of a message is implemented as a ceph pagelist. The routines defined for it will be usable for non-trail pagelist data as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-03-07 13:39:39 +08:00
if (!bytes || cursor->offset & ~PAGE_MASK)
return false; /* more bytes to process in the current page */
if (!cursor->resid)
return false; /* no more data */
libceph: start defining message data cursor This patch lays out the foundation for using generic routines to manage processing items of message data. For simplicity, we'll start with just the trail portion of a message, because it stands alone and is only present for outgoing data. First some basic concepts. We'll use the term "data item" to represent one of the ceph_msg_data structures associated with a message. There are currently four of those, with single-letter field names p, l, b, and t. A data item is further broken into "pieces" which always lie in a single page. A data item will include a "cursor" that will track state as the memory defined by the item is consumed by sending data from or receiving data into it. We define three routines to manipulate a data item's cursor: the "init" routine; the "next" routine; and the "advance" routine. The "init" routine initializes the cursor so it points at the beginning of the first piece in the item. The "next" routine returns the page, page offset, and length (limited by both the page and item size) of the next unconsumed piece in the item. It also indicates to the caller whether the piece being returned is the last one in the data item. The "advance" routine consumes the requested number of bytes in the item (advancing the cursor). This is used to record the number of bytes from the current piece that were actually sent or received by the network code. It returns an indication of whether the result means the current piece has been fully consumed. This is used by the message send code to determine whether it should calculate the CRC for the next piece processed. The trail of a message is implemented as a ceph pagelist. The routines defined for it will be usable for non-trail pagelist data as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-03-07 13:39:39 +08:00
/* Move on to the next page */
BUG_ON(list_is_last(&cursor->page->lru, &pagelist->head));
cursor->page = list_next_entry(cursor->page, lru);
libceph: start defining message data cursor This patch lays out the foundation for using generic routines to manage processing items of message data. For simplicity, we'll start with just the trail portion of a message, because it stands alone and is only present for outgoing data. First some basic concepts. We'll use the term "data item" to represent one of the ceph_msg_data structures associated with a message. There are currently four of those, with single-letter field names p, l, b, and t. A data item is further broken into "pieces" which always lie in a single page. A data item will include a "cursor" that will track state as the memory defined by the item is consumed by sending data from or receiving data into it. We define three routines to manipulate a data item's cursor: the "init" routine; the "next" routine; and the "advance" routine. The "init" routine initializes the cursor so it points at the beginning of the first piece in the item. The "next" routine returns the page, page offset, and length (limited by both the page and item size) of the next unconsumed piece in the item. It also indicates to the caller whether the piece being returned is the last one in the data item. The "advance" routine consumes the requested number of bytes in the item (advancing the cursor). This is used to record the number of bytes from the current piece that were actually sent or received by the network code. It returns an indication of whether the result means the current piece has been fully consumed. This is used by the message send code to determine whether it should calculate the CRC for the next piece processed. The trail of a message is implemented as a ceph pagelist. The routines defined for it will be usable for non-trail pagelist data as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-03-07 13:39:39 +08:00
return true;
}
static void ceph_msg_data_iter_cursor_init(struct ceph_msg_data_cursor *cursor,
size_t length)
{
struct ceph_msg_data *data = cursor->data;
cursor->iov_iter = data->iter;
cursor->lastlen = 0;
iov_iter_truncate(&cursor->iov_iter, length);
cursor->resid = iov_iter_count(&cursor->iov_iter);
}
static struct page *ceph_msg_data_iter_next(struct ceph_msg_data_cursor *cursor,
size_t *page_offset, size_t *length)
{
struct page *page;
ssize_t len;
if (cursor->lastlen)
iov_iter_revert(&cursor->iov_iter, cursor->lastlen);
len = iov_iter_get_pages2(&cursor->iov_iter, &page, PAGE_SIZE,
1, page_offset);
BUG_ON(len < 0);
cursor->lastlen = len;
/*
* FIXME: The assumption is that the pages represented by the iov_iter
* are pinned, with the references held by the upper-level
* callers, or by virtue of being under writeback. Eventually,
* we'll get an iov_iter_get_pages2 variant that doesn't take
* page refs. Until then, just put the page ref.
*/
VM_BUG_ON_PAGE(!PageWriteback(page) && page_count(page) < 2, page);
put_page(page);
*length = min_t(size_t, len, cursor->resid);
return page;
}
static bool ceph_msg_data_iter_advance(struct ceph_msg_data_cursor *cursor,
size_t bytes)
{
BUG_ON(bytes > cursor->resid);
cursor->resid -= bytes;
if (bytes < cursor->lastlen) {
cursor->lastlen -= bytes;
} else {
iov_iter_advance(&cursor->iov_iter, bytes - cursor->lastlen);
cursor->lastlen = 0;
}
return cursor->resid;
}
/*
* Message data is handled (sent or received) in pieces, where each
* piece resides on a single page. The network layer might not
* consume an entire piece at once. A data item's cursor keeps
* track of which piece is next to process and how much remains to
* be processed in that piece. It also tracks whether the current
* piece is the last one in the data item.
*/
libceph: implement multiple data items in a message This patch adds support to the messenger for more than one data item in its data list. A message data cursor has two more fields to support this: - a count of the number of bytes left to be consumed across all data items in the list, "total_resid" - a pointer to the head of the list (for validation only) The cursor initialization routine has been split into two parts: the outer one, which initializes the cursor for traversing the entire list of data items; and the inner one, which initializes the cursor to start processing a single data item. When a message cursor is first initialized, the outer initialization routine sets total_resid to the length provided. The data pointer is initialized to the first data item on the list. From there, the inner initialization routine finishes by setting up to process the data item the cursor points to. Advancing the cursor consumes bytes in total_resid. If the resid field reaches zero, it means the current data item is fully consumed. If total_resid indicates there is more data, the cursor is advanced to point to the next data item, and then the inner initialization routine prepares for using that. (A check is made at this point to make sure we don't wrap around the front of the list.) The type-specific init routines are modified so they can be given a length that's larger than what the data item can support. The resid field is initialized to the smaller of the provided length and the length of the entire data item. When total_resid reaches zero, we're done. This resolves: http://tracker.ceph.com/issues/3761 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-04-06 03:46:01 +08:00
static void __ceph_msg_data_cursor_init(struct ceph_msg_data_cursor *cursor)
{
libceph: implement multiple data items in a message This patch adds support to the messenger for more than one data item in its data list. A message data cursor has two more fields to support this: - a count of the number of bytes left to be consumed across all data items in the list, "total_resid" - a pointer to the head of the list (for validation only) The cursor initialization routine has been split into two parts: the outer one, which initializes the cursor for traversing the entire list of data items; and the inner one, which initializes the cursor to start processing a single data item. When a message cursor is first initialized, the outer initialization routine sets total_resid to the length provided. The data pointer is initialized to the first data item on the list. From there, the inner initialization routine finishes by setting up to process the data item the cursor points to. Advancing the cursor consumes bytes in total_resid. If the resid field reaches zero, it means the current data item is fully consumed. If total_resid indicates there is more data, the cursor is advanced to point to the next data item, and then the inner initialization routine prepares for using that. (A check is made at this point to make sure we don't wrap around the front of the list.) The type-specific init routines are modified so they can be given a length that's larger than what the data item can support. The resid field is initialized to the smaller of the provided length and the length of the entire data item. When total_resid reaches zero, we're done. This resolves: http://tracker.ceph.com/issues/3761 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-04-06 03:46:01 +08:00
size_t length = cursor->total_resid;
switch (cursor->data->type) {
case CEPH_MSG_DATA_PAGELIST:
ceph_msg_data_pagelist_cursor_init(cursor, length);
break;
case CEPH_MSG_DATA_PAGES:
ceph_msg_data_pages_cursor_init(cursor, length);
break;
#ifdef CONFIG_BLOCK
case CEPH_MSG_DATA_BIO:
ceph_msg_data_bio_cursor_init(cursor, length);
break;
#endif /* CONFIG_BLOCK */
case CEPH_MSG_DATA_BVECS:
ceph_msg_data_bvecs_cursor_init(cursor, length);
break;
case CEPH_MSG_DATA_ITER:
ceph_msg_data_iter_cursor_init(cursor, length);
break;
case CEPH_MSG_DATA_NONE:
default:
/* BUG(); */
break;
}
cursor->need_crc = true;
}
void ceph_msg_data_cursor_init(struct ceph_msg_data_cursor *cursor,
struct ceph_msg *msg, size_t length)
libceph: implement multiple data items in a message This patch adds support to the messenger for more than one data item in its data list. A message data cursor has two more fields to support this: - a count of the number of bytes left to be consumed across all data items in the list, "total_resid" - a pointer to the head of the list (for validation only) The cursor initialization routine has been split into two parts: the outer one, which initializes the cursor for traversing the entire list of data items; and the inner one, which initializes the cursor to start processing a single data item. When a message cursor is first initialized, the outer initialization routine sets total_resid to the length provided. The data pointer is initialized to the first data item on the list. From there, the inner initialization routine finishes by setting up to process the data item the cursor points to. Advancing the cursor consumes bytes in total_resid. If the resid field reaches zero, it means the current data item is fully consumed. If total_resid indicates there is more data, the cursor is advanced to point to the next data item, and then the inner initialization routine prepares for using that. (A check is made at this point to make sure we don't wrap around the front of the list.) The type-specific init routines are modified so they can be given a length that's larger than what the data item can support. The resid field is initialized to the smaller of the provided length and the length of the entire data item. When total_resid reaches zero, we're done. This resolves: http://tracker.ceph.com/issues/3761 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-04-06 03:46:01 +08:00
{
BUG_ON(!length);
BUG_ON(length > msg->data_length);
BUG_ON(!msg->num_data_items);
libceph: implement multiple data items in a message This patch adds support to the messenger for more than one data item in its data list. A message data cursor has two more fields to support this: - a count of the number of bytes left to be consumed across all data items in the list, "total_resid" - a pointer to the head of the list (for validation only) The cursor initialization routine has been split into two parts: the outer one, which initializes the cursor for traversing the entire list of data items; and the inner one, which initializes the cursor to start processing a single data item. When a message cursor is first initialized, the outer initialization routine sets total_resid to the length provided. The data pointer is initialized to the first data item on the list. From there, the inner initialization routine finishes by setting up to process the data item the cursor points to. Advancing the cursor consumes bytes in total_resid. If the resid field reaches zero, it means the current data item is fully consumed. If total_resid indicates there is more data, the cursor is advanced to point to the next data item, and then the inner initialization routine prepares for using that. (A check is made at this point to make sure we don't wrap around the front of the list.) The type-specific init routines are modified so they can be given a length that's larger than what the data item can support. The resid field is initialized to the smaller of the provided length and the length of the entire data item. When total_resid reaches zero, we're done. This resolves: http://tracker.ceph.com/issues/3761 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-04-06 03:46:01 +08:00
cursor->total_resid = length;
cursor->data = msg->data;
cursor->sr_resid = 0;
libceph: implement multiple data items in a message This patch adds support to the messenger for more than one data item in its data list. A message data cursor has two more fields to support this: - a count of the number of bytes left to be consumed across all data items in the list, "total_resid" - a pointer to the head of the list (for validation only) The cursor initialization routine has been split into two parts: the outer one, which initializes the cursor for traversing the entire list of data items; and the inner one, which initializes the cursor to start processing a single data item. When a message cursor is first initialized, the outer initialization routine sets total_resid to the length provided. The data pointer is initialized to the first data item on the list. From there, the inner initialization routine finishes by setting up to process the data item the cursor points to. Advancing the cursor consumes bytes in total_resid. If the resid field reaches zero, it means the current data item is fully consumed. If total_resid indicates there is more data, the cursor is advanced to point to the next data item, and then the inner initialization routine prepares for using that. (A check is made at this point to make sure we don't wrap around the front of the list.) The type-specific init routines are modified so they can be given a length that's larger than what the data item can support. The resid field is initialized to the smaller of the provided length and the length of the entire data item. When total_resid reaches zero, we're done. This resolves: http://tracker.ceph.com/issues/3761 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-04-06 03:46:01 +08:00
__ceph_msg_data_cursor_init(cursor);
}
/*
* Return the page containing the next piece to process for a given
* data item, and supply the page offset and length of that piece.
* Indicate whether this is the last piece in this data item.
*/
struct page *ceph_msg_data_next(struct ceph_msg_data_cursor *cursor,
size_t *page_offset, size_t *length)
{
struct page *page;
switch (cursor->data->type) {
case CEPH_MSG_DATA_PAGELIST:
page = ceph_msg_data_pagelist_next(cursor, page_offset, length);
break;
case CEPH_MSG_DATA_PAGES:
page = ceph_msg_data_pages_next(cursor, page_offset, length);
break;
#ifdef CONFIG_BLOCK
case CEPH_MSG_DATA_BIO:
page = ceph_msg_data_bio_next(cursor, page_offset, length);
break;
#endif /* CONFIG_BLOCK */
case CEPH_MSG_DATA_BVECS:
page = ceph_msg_data_bvecs_next(cursor, page_offset, length);
break;
case CEPH_MSG_DATA_ITER:
page = ceph_msg_data_iter_next(cursor, page_offset, length);
break;
case CEPH_MSG_DATA_NONE:
default:
page = NULL;
break;
}
BUG_ON(!page);
BUG_ON(*page_offset + *length > PAGE_SIZE);
BUG_ON(!*length);
BUG_ON(*length > cursor->resid);
return page;
}
/*
* Returns true if the result moves the cursor on to the next piece
* of the data item.
*/
void ceph_msg_data_advance(struct ceph_msg_data_cursor *cursor, size_t bytes)
{
bool new_piece;
BUG_ON(bytes > cursor->resid);
switch (cursor->data->type) {
case CEPH_MSG_DATA_PAGELIST:
new_piece = ceph_msg_data_pagelist_advance(cursor, bytes);
break;
case CEPH_MSG_DATA_PAGES:
new_piece = ceph_msg_data_pages_advance(cursor, bytes);
break;
#ifdef CONFIG_BLOCK
case CEPH_MSG_DATA_BIO:
new_piece = ceph_msg_data_bio_advance(cursor, bytes);
break;
#endif /* CONFIG_BLOCK */
case CEPH_MSG_DATA_BVECS:
new_piece = ceph_msg_data_bvecs_advance(cursor, bytes);
break;
case CEPH_MSG_DATA_ITER:
new_piece = ceph_msg_data_iter_advance(cursor, bytes);
break;
case CEPH_MSG_DATA_NONE:
default:
BUG();
break;
}
libceph: implement multiple data items in a message This patch adds support to the messenger for more than one data item in its data list. A message data cursor has two more fields to support this: - a count of the number of bytes left to be consumed across all data items in the list, "total_resid" - a pointer to the head of the list (for validation only) The cursor initialization routine has been split into two parts: the outer one, which initializes the cursor for traversing the entire list of data items; and the inner one, which initializes the cursor to start processing a single data item. When a message cursor is first initialized, the outer initialization routine sets total_resid to the length provided. The data pointer is initialized to the first data item on the list. From there, the inner initialization routine finishes by setting up to process the data item the cursor points to. Advancing the cursor consumes bytes in total_resid. If the resid field reaches zero, it means the current data item is fully consumed. If total_resid indicates there is more data, the cursor is advanced to point to the next data item, and then the inner initialization routine prepares for using that. (A check is made at this point to make sure we don't wrap around the front of the list.) The type-specific init routines are modified so they can be given a length that's larger than what the data item can support. The resid field is initialized to the smaller of the provided length and the length of the entire data item. When total_resid reaches zero, we're done. This resolves: http://tracker.ceph.com/issues/3761 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-04-06 03:46:01 +08:00
cursor->total_resid -= bytes;
libceph: implement multiple data items in a message This patch adds support to the messenger for more than one data item in its data list. A message data cursor has two more fields to support this: - a count of the number of bytes left to be consumed across all data items in the list, "total_resid" - a pointer to the head of the list (for validation only) The cursor initialization routine has been split into two parts: the outer one, which initializes the cursor for traversing the entire list of data items; and the inner one, which initializes the cursor to start processing a single data item. When a message cursor is first initialized, the outer initialization routine sets total_resid to the length provided. The data pointer is initialized to the first data item on the list. From there, the inner initialization routine finishes by setting up to process the data item the cursor points to. Advancing the cursor consumes bytes in total_resid. If the resid field reaches zero, it means the current data item is fully consumed. If total_resid indicates there is more data, the cursor is advanced to point to the next data item, and then the inner initialization routine prepares for using that. (A check is made at this point to make sure we don't wrap around the front of the list.) The type-specific init routines are modified so they can be given a length that's larger than what the data item can support. The resid field is initialized to the smaller of the provided length and the length of the entire data item. When total_resid reaches zero, we're done. This resolves: http://tracker.ceph.com/issues/3761 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-04-06 03:46:01 +08:00
if (!cursor->resid && cursor->total_resid) {
cursor->data++;
libceph: implement multiple data items in a message This patch adds support to the messenger for more than one data item in its data list. A message data cursor has two more fields to support this: - a count of the number of bytes left to be consumed across all data items in the list, "total_resid" - a pointer to the head of the list (for validation only) The cursor initialization routine has been split into two parts: the outer one, which initializes the cursor for traversing the entire list of data items; and the inner one, which initializes the cursor to start processing a single data item. When a message cursor is first initialized, the outer initialization routine sets total_resid to the length provided. The data pointer is initialized to the first data item on the list. From there, the inner initialization routine finishes by setting up to process the data item the cursor points to. Advancing the cursor consumes bytes in total_resid. If the resid field reaches zero, it means the current data item is fully consumed. If total_resid indicates there is more data, the cursor is advanced to point to the next data item, and then the inner initialization routine prepares for using that. (A check is made at this point to make sure we don't wrap around the front of the list.) The type-specific init routines are modified so they can be given a length that's larger than what the data item can support. The resid field is initialized to the smaller of the provided length and the length of the entire data item. When total_resid reaches zero, we're done. This resolves: http://tracker.ceph.com/issues/3761 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-04-06 03:46:01 +08:00
__ceph_msg_data_cursor_init(cursor);
libceph: fix two messenger bugs This patch makes four small changes in the ceph messenger. While getting copyup functionality working I found two bugs in the messenger. Existing paths through the code did not trigger these problems, but they're fixed here: - In ceph_msg_data_pagelist_cursor_init(), the cursor's last_piece field was being checked against the length supplied. This was OK until this commit: ccba6d98 libceph: implement multiple data items in a message That commit changed the cursor init routines to allow lengths to be supplied that exceeded the size of the current data item. Because of this, we have to use the assigned cursor resid field rather than the provided length in determining whether the cursor points to the last piece of a data item. - In ceph_msg_data_add_pages(), a BUG_ON() was erroneously catching attempts to add page data to a message if the message already had data assigned to it. That was OK until that same commit, at which point it was fine for messages to have multiple data items. It slipped through because that BUG_ON() call was present twice in that function. (You can never be too careful.) In addition two other minor things are changed: - In ceph_msg_data_cursor_init(), the local variable "data" was getting assigned twice. - In ceph_msg_data_advance(), it was assumed that the type-specific advance routine would set new_piece to true after it advanced past the last piece. That may have been fine, but since we check for that case we might as well set it explicitly in ceph_msg_data_advance(). This resolves: http://tracker.ceph.com/issues/4762 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-04-20 04:34:49 +08:00
new_piece = true;
libceph: implement multiple data items in a message This patch adds support to the messenger for more than one data item in its data list. A message data cursor has two more fields to support this: - a count of the number of bytes left to be consumed across all data items in the list, "total_resid" - a pointer to the head of the list (for validation only) The cursor initialization routine has been split into two parts: the outer one, which initializes the cursor for traversing the entire list of data items; and the inner one, which initializes the cursor to start processing a single data item. When a message cursor is first initialized, the outer initialization routine sets total_resid to the length provided. The data pointer is initialized to the first data item on the list. From there, the inner initialization routine finishes by setting up to process the data item the cursor points to. Advancing the cursor consumes bytes in total_resid. If the resid field reaches zero, it means the current data item is fully consumed. If total_resid indicates there is more data, the cursor is advanced to point to the next data item, and then the inner initialization routine prepares for using that. (A check is made at this point to make sure we don't wrap around the front of the list.) The type-specific init routines are modified so they can be given a length that's larger than what the data item can support. The resid field is initialized to the smaller of the provided length and the length of the entire data item. When total_resid reaches zero, we're done. This resolves: http://tracker.ceph.com/issues/3761 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-04-06 03:46:01 +08:00
}
libceph: fix two messenger bugs This patch makes four small changes in the ceph messenger. While getting copyup functionality working I found two bugs in the messenger. Existing paths through the code did not trigger these problems, but they're fixed here: - In ceph_msg_data_pagelist_cursor_init(), the cursor's last_piece field was being checked against the length supplied. This was OK until this commit: ccba6d98 libceph: implement multiple data items in a message That commit changed the cursor init routines to allow lengths to be supplied that exceeded the size of the current data item. Because of this, we have to use the assigned cursor resid field rather than the provided length in determining whether the cursor points to the last piece of a data item. - In ceph_msg_data_add_pages(), a BUG_ON() was erroneously catching attempts to add page data to a message if the message already had data assigned to it. That was OK until that same commit, at which point it was fine for messages to have multiple data items. It slipped through because that BUG_ON() call was present twice in that function. (You can never be too careful.) In addition two other minor things are changed: - In ceph_msg_data_cursor_init(), the local variable "data" was getting assigned twice. - In ceph_msg_data_advance(), it was assumed that the type-specific advance routine would set new_piece to true after it advanced past the last piece. That may have been fine, but since we check for that case we might as well set it explicitly in ceph_msg_data_advance(). This resolves: http://tracker.ceph.com/issues/4762 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-04-20 04:34:49 +08:00
cursor->need_crc = new_piece;
}
u32 ceph_crc32c_page(u32 crc, struct page *page, unsigned int page_offset,
unsigned int length)
{
char *kaddr;
kaddr = kmap(page);
BUG_ON(kaddr == NULL);
crc = crc32c(crc, kaddr + page_offset, length);
kunmap(page);
return crc;
}
bool ceph_addr_is_blank(const struct ceph_entity_addr *addr)
{
struct sockaddr_storage ss = addr->in_addr; /* align */
struct in_addr *addr4 = &((struct sockaddr_in *)&ss)->sin_addr;
struct in6_addr *addr6 = &((struct sockaddr_in6 *)&ss)->sin6_addr;
switch (ss.ss_family) {
case AF_INET:
return addr4->s_addr == htonl(INADDR_ANY);
case AF_INET6:
return ipv6_addr_any(addr6);
default:
return true;
}
}
EXPORT_SYMBOL(ceph_addr_is_blank);
int ceph_addr_port(const struct ceph_entity_addr *addr)
{
switch (get_unaligned(&addr->in_addr.ss_family)) {
case AF_INET:
return ntohs(get_unaligned(&((struct sockaddr_in *)&addr->in_addr)->sin_port));
case AF_INET6:
return ntohs(get_unaligned(&((struct sockaddr_in6 *)&addr->in_addr)->sin6_port));
}
return 0;
}
void ceph_addr_set_port(struct ceph_entity_addr *addr, int p)
{
switch (get_unaligned(&addr->in_addr.ss_family)) {
case AF_INET:
put_unaligned(htons(p), &((struct sockaddr_in *)&addr->in_addr)->sin_port);
break;
case AF_INET6:
put_unaligned(htons(p), &((struct sockaddr_in6 *)&addr->in_addr)->sin6_port);
break;
}
}
/*
* Unlike other *_pton function semantics, zero indicates success.
*/
static int ceph_pton(const char *str, size_t len, struct ceph_entity_addr *addr,
char delim, const char **ipend)
{
memset(&addr->in_addr, 0, sizeof(addr->in_addr));
if (in4_pton(str, len, (u8 *)&((struct sockaddr_in *)&addr->in_addr)->sin_addr.s_addr, delim, ipend)) {
put_unaligned(AF_INET, &addr->in_addr.ss_family);
return 0;
}
if (in6_pton(str, len, (u8 *)&((struct sockaddr_in6 *)&addr->in_addr)->sin6_addr.s6_addr, delim, ipend)) {
put_unaligned(AF_INET6, &addr->in_addr.ss_family);
return 0;
}
return -EINVAL;
}
/*
* Extract hostname string and resolve using kernel DNS facility.
*/
#ifdef CONFIG_CEPH_LIB_USE_DNS_RESOLVER
static int ceph_dns_resolve_name(const char *name, size_t namelen,
struct ceph_entity_addr *addr, char delim, const char **ipend)
{
const char *end, *delim_p;
char *colon_p, *ip_addr = NULL;
int ip_len, ret;
/*
* The end of the hostname occurs immediately preceding the delimiter or
* the port marker (':') where the delimiter takes precedence.
*/
delim_p = memchr(name, delim, namelen);
colon_p = memchr(name, ':', namelen);
if (delim_p && colon_p)
end = delim_p < colon_p ? delim_p : colon_p;
else if (!delim_p && colon_p)
end = colon_p;
else {
end = delim_p;
if (!end) /* case: hostname:/ */
end = name + namelen;
}
if (end <= name)
return -EINVAL;
/* do dns_resolve upcall */
ip_len = dns_query(current->nsproxy->net_ns,
NULL, name, end - name, NULL, &ip_addr, NULL, false);
if (ip_len > 0)
ret = ceph_pton(ip_addr, ip_len, addr, -1, NULL);
else
ret = -ESRCH;
kfree(ip_addr);
*ipend = end;
pr_info("resolve '%.*s' (ret=%d): %s\n", (int)(end - name), name,
ret, ret ? "failed" : ceph_pr_addr(addr));
return ret;
}
#else
static inline int ceph_dns_resolve_name(const char *name, size_t namelen,
struct ceph_entity_addr *addr, char delim, const char **ipend)
{
return -EINVAL;
}
#endif
/*
* Parse a server name (IP or hostname). If a valid IP address is not found
* then try to extract a hostname to resolve using userspace DNS upcall.
*/
static int ceph_parse_server_name(const char *name, size_t namelen,
struct ceph_entity_addr *addr, char delim, const char **ipend)
{
int ret;
ret = ceph_pton(name, namelen, addr, delim, ipend);
if (ret)
ret = ceph_dns_resolve_name(name, namelen, addr, delim, ipend);
return ret;
}
/*
* Parse an ip[:port] list into an addr array. Use the default
* monitor port if a port isn't specified.
*/
int ceph_parse_ips(const char *c, const char *end,
struct ceph_entity_addr *addr,
int max_count, int *count, char delim)
{
int i, ret = -EINVAL;
const char *p = c;
dout("parse_ips on '%.*s'\n", (int)(end-c), c);
for (i = 0; i < max_count; i++) {
char cur_delim = delim;
const char *ipend;
int port;
if (*p == '[') {
cur_delim = ']';
p++;
}
ret = ceph_parse_server_name(p, end - p, &addr[i], cur_delim,
&ipend);
if (ret)
goto bad;
ret = -EINVAL;
p = ipend;
if (cur_delim == ']') {
if (*p != ']') {
dout("missing matching ']'\n");
goto bad;
}
p++;
}
/* port? */
if (p < end && *p == ':') {
port = 0;
p++;
while (p < end && *p >= '0' && *p <= '9') {
port = (port * 10) + (*p - '0');
p++;
}
if (port == 0)
port = CEPH_MON_PORT;
else if (port > 65535)
goto bad;
} else {
port = CEPH_MON_PORT;
}
ceph_addr_set_port(&addr[i], port);
/*
* We want the type to be set according to ms_mode
* option, but options are normally parsed after mon
* addresses. Rather than complicating parsing, set
* to LEGACY and override in build_initial_monmap()
* for mon addresses and ceph_messenger_init() for
* ip option.
*/
addr[i].type = CEPH_ENTITY_ADDR_TYPE_LEGACY;
addr[i].nonce = 0;
dout("%s got %s\n", __func__, ceph_pr_addr(&addr[i]));
if (p == end)
break;
if (*p != delim)
goto bad;
p++;
}
if (p != end)
goto bad;
if (count)
*count = i + 1;
return 0;
bad:
return ret;
}
/*
* Process message. This happens in the worker thread. The callback should
* be careful not to do anything that waits on other incoming messages or it
* may deadlock.
*/
void ceph_con_process_message(struct ceph_connection *con)
{
libceph: clear msg->con in ceph_msg_release() only The following bit in ceph_msg_revoke_incoming() is unsafe: struct ceph_connection *con = msg->con; if (!con) return; mutex_lock(&con->mutex); <more msg->con use> There is nothing preventing con from getting destroyed right after msg->con test. One easy way to reproduce this is to disable message signing only on the server side and try to map an image. The system will go into a libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature loop which has to be interrupted with Ctrl-C. Hit Ctrl-C and you are likely to end up with a random GP fault if the reset handler executes "within" ceph_msg_revoke_incoming(): <yet another reply w/o a signature> ... <Ctrl-C> rbd_obj_request_end ceph_osdc_cancel_request __unregister_request ceph_osdc_put_request ceph_msg_revoke_incoming ... osd_reset __kick_osd_requests __reset_osd remove_osd ceph_con_close reset_connection <clear con->in_msg->con> <put con ref> put_osd <free osd/con> <msg->con use> <-- !!! If ceph_msg_revoke_incoming() executes "before" the reset handler, osd/con will be leaked because ceph_msg_revoke_incoming() clears con->in_msg but doesn't put con ref, while reset_connection() only puts con ref if con->in_msg != NULL. The current msg->con scheme was introduced by commits 38941f8031bf ("libceph: have messages point to their connection") and 92ce034b5a74 ("libceph: have messages take a connection reference"), which defined when messages get associated with a connection and when that association goes away. Part of the problem is that this association is supposed to go away in much too many places; closing this race entirely requires either a rework of the existing or an addition of a new layer of synchronization. In lieu of that, we can make it *much* less likely to hit by disassociating messages only on their destruction and resend through a different connection. This makes the code simpler and is probably a good thing to do regardless - this patch adds a msg_con_set() helper which is is called from only three places: ceph_con_send() and ceph_con_in_msg_alloc() to set msg->con and ceph_msg_release() to clear it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-03 00:13:58 +08:00
struct ceph_msg *msg = con->in_msg;
BUG_ON(con->in_msg->con != con);
con->in_msg = NULL;
/* if first message, set peer_name */
if (con->peer_name.type == 0)
con->peer_name = msg->hdr.src;
con->in_seq++;
mutex_unlock(&con->mutex);
dout("===== %p %llu from %s%lld %d=%s len %d+%d+%d (%u %u %u) =====\n",
msg, le64_to_cpu(msg->hdr.seq),
ENTITY_NAME(msg->hdr.src),
le16_to_cpu(msg->hdr.type),
ceph_msg_type_name(le16_to_cpu(msg->hdr.type)),
le32_to_cpu(msg->hdr.front_len),
le32_to_cpu(msg->hdr.middle_len),
le32_to_cpu(msg->hdr.data_len),
con->in_front_crc, con->in_middle_crc, con->in_data_crc);
con->ops->dispatch(con, msg);
mutex_lock(&con->mutex);
}
/*
* Atomically queue work on a connection after the specified delay.
* Bump @con reference to avoid races with connection teardown.
* Returns 0 if work was queued, or an error code otherwise.
*/
static int queue_con_delay(struct ceph_connection *con, unsigned long delay)
{
if (!con->ops->get(con)) {
dout("%s %p ref count 0\n", __func__, con);
return -ENOENT;
}
if (delay >= HZ)
delay = round_jiffies_relative(delay);
dout("%s %p %lu\n", __func__, con, delay);
if (!queue_delayed_work(ceph_msgr_wq, &con->work, delay)) {
dout("%s %p - already queued\n", __func__, con);
con->ops->put(con);
return -EBUSY;
}
return 0;
}
static void queue_con(struct ceph_connection *con)
{
(void) queue_con_delay(con, 0);
}
static void cancel_con(struct ceph_connection *con)
{
if (cancel_delayed_work(&con->work)) {
dout("%s %p\n", __func__, con);
con->ops->put(con);
}
}
static bool con_sock_closed(struct ceph_connection *con)
{
if (!ceph_con_flag_test_and_clear(con, CEPH_CON_F_SOCK_CLOSED))
return false;
#define CASE(x) \
case CEPH_CON_S_ ## x: \
con->error_msg = "socket closed (con state " #x ")"; \
break;
switch (con->state) {
CASE(CLOSED);
CASE(PREOPEN);
CASE(V1_BANNER);
CASE(V1_CONNECT_MSG);
CASE(V2_BANNER_PREFIX);
CASE(V2_BANNER_PAYLOAD);
CASE(V2_HELLO);
CASE(V2_AUTH);
CASE(V2_AUTH_SIGNATURE);
CASE(V2_SESSION_CONNECT);
CASE(V2_SESSION_RECONNECT);
CASE(OPEN);
CASE(STANDBY);
default:
BUG();
}
#undef CASE
return true;
}
static bool con_backoff(struct ceph_connection *con)
{
int ret;
if (!ceph_con_flag_test_and_clear(con, CEPH_CON_F_BACKOFF))
return false;
ret = queue_con_delay(con, con->delay);
if (ret) {
dout("%s: con %p FAILED to back off %lu\n", __func__,
con, con->delay);
BUG_ON(ret == -ENOENT);
ceph_con_flag_set(con, CEPH_CON_F_BACKOFF);
}
return true;
}
/* Finish fault handling; con->mutex must *not* be held here */
static void con_fault_finish(struct ceph_connection *con)
{
dout("%s %p\n", __func__, con);
/*
* in case we faulted due to authentication, invalidate our
* current tickets so that we can get new ones.
*/
if (con->v1.auth_retry) {
dout("auth_retry %d, invalidating\n", con->v1.auth_retry);
if (con->ops->invalidate_authorizer)
con->ops->invalidate_authorizer(con);
con->v1.auth_retry = 0;
}
if (con->ops->fault)
con->ops->fault(con);
}
/*
* Do some work on a connection. Drop a connection ref when we're done.
*/
static void ceph_con_workfn(struct work_struct *work)
{
struct ceph_connection *con = container_of(work, struct ceph_connection,
work.work);
bool fault;
mutex_lock(&con->mutex);
while (true) {
int ret;
if ((fault = con_sock_closed(con))) {
dout("%s: con %p SOCK_CLOSED\n", __func__, con);
break;
}
if (con_backoff(con)) {
dout("%s: con %p BACKOFF\n", __func__, con);
break;
}
if (con->state == CEPH_CON_S_STANDBY) {
dout("%s: con %p STANDBY\n", __func__, con);
break;
}
if (con->state == CEPH_CON_S_CLOSED) {
dout("%s: con %p CLOSED\n", __func__, con);
BUG_ON(con->sock);
break;
}
if (con->state == CEPH_CON_S_PREOPEN) {
dout("%s: con %p PREOPEN\n", __func__, con);
BUG_ON(con->sock);
}
if (ceph_msgr2(from_msgr(con->msgr)))
ret = ceph_con_v2_try_read(con);
else
ret = ceph_con_v1_try_read(con);
if (ret < 0) {
if (ret == -EAGAIN)
continue;
if (!con->error_msg)
con->error_msg = "socket error on read";
fault = true;
break;
}
if (ceph_msgr2(from_msgr(con->msgr)))
ret = ceph_con_v2_try_write(con);
else
ret = ceph_con_v1_try_write(con);
if (ret < 0) {
if (ret == -EAGAIN)
continue;
if (!con->error_msg)
con->error_msg = "socket error on write";
fault = true;
}
break; /* If we make it to here, we're done */
}
if (fault)
con_fault(con);
mutex_unlock(&con->mutex);
if (fault)
con_fault_finish(con);
con->ops->put(con);
}
/*
* Generic error/fault handler. A retry mechanism is used with
* exponential backoff
*/
static void con_fault(struct ceph_connection *con)
{
dout("fault %p state %d to peer %s\n",
con, con->state, ceph_pr_addr(&con->peer_addr));
pr_warn("%s%lld %s %s\n", ENTITY_NAME(con->peer_name),
ceph_pr_addr(&con->peer_addr), con->error_msg);
con->error_msg = NULL;
WARN_ON(con->state == CEPH_CON_S_STANDBY ||
con->state == CEPH_CON_S_CLOSED);
ceph_con_reset_protocol(con);
if (ceph_con_flag_test(con, CEPH_CON_F_LOSSYTX)) {
dout("fault on LOSSYTX channel, marking CLOSED\n");
con->state = CEPH_CON_S_CLOSED;
return;
}
/* Requeue anything that hasn't been acked */
list_splice_init(&con->out_sent, &con->out_queue);
/* If there are no messages queued or keepalive pending, place
* the connection in a STANDBY state */
if (list_empty(&con->out_queue) &&
!ceph_con_flag_test(con, CEPH_CON_F_KEEPALIVE_PENDING)) {
dout("fault %p setting STANDBY clearing WRITE_PENDING\n", con);
ceph_con_flag_clear(con, CEPH_CON_F_WRITE_PENDING);
con->state = CEPH_CON_S_STANDBY;
} else {
/* retry after a delay. */
con->state = CEPH_CON_S_PREOPEN;
if (!con->delay) {
con->delay = BASE_DELAY_INTERVAL;
} else if (con->delay < MAX_DELAY_INTERVAL) {
con->delay *= 2;
if (con->delay > MAX_DELAY_INTERVAL)
con->delay = MAX_DELAY_INTERVAL;
}
ceph_con_flag_set(con, CEPH_CON_F_BACKOFF);
queue_con(con);
}
}
void ceph_messenger_reset_nonce(struct ceph_messenger *msgr)
{
u32 nonce = le32_to_cpu(msgr->inst.addr.nonce) + 1000000;
msgr->inst.addr.nonce = cpu_to_le32(nonce);
ceph_encode_my_addr(msgr);
}
/*
* initialize a new messenger instance
*/
void ceph_messenger_init(struct ceph_messenger *msgr,
struct ceph_entity_addr *myaddr)
{
spin_lock_init(&msgr->global_seq_lock);
if (myaddr) {
memcpy(&msgr->inst.addr.in_addr, &myaddr->in_addr,
sizeof(msgr->inst.addr.in_addr));
ceph_addr_set_port(&msgr->inst.addr, 0);
}
/*
* Since nautilus, clients are identified using type ANY.
* For msgr1, ceph_encode_banner_addr() munges it to NONE.
*/
msgr->inst.addr.type = CEPH_ENTITY_ADDR_TYPE_ANY;
/* generate a random non-zero nonce */
do {
get_random_bytes(&msgr->inst.addr.nonce,
sizeof(msgr->inst.addr.nonce));
} while (!msgr->inst.addr.nonce);
ceph_encode_my_addr(msgr);
atomic_set(&msgr->stopping, 0);
write_pnet(&msgr->net, get_net(current->nsproxy->net_ns));
dout("%s %p\n", __func__, msgr);
}
void ceph_messenger_fini(struct ceph_messenger *msgr)
{
put_net(read_pnet(&msgr->net));
}
libceph: clear msg->con in ceph_msg_release() only The following bit in ceph_msg_revoke_incoming() is unsafe: struct ceph_connection *con = msg->con; if (!con) return; mutex_lock(&con->mutex); <more msg->con use> There is nothing preventing con from getting destroyed right after msg->con test. One easy way to reproduce this is to disable message signing only on the server side and try to map an image. The system will go into a libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature loop which has to be interrupted with Ctrl-C. Hit Ctrl-C and you are likely to end up with a random GP fault if the reset handler executes "within" ceph_msg_revoke_incoming(): <yet another reply w/o a signature> ... <Ctrl-C> rbd_obj_request_end ceph_osdc_cancel_request __unregister_request ceph_osdc_put_request ceph_msg_revoke_incoming ... osd_reset __kick_osd_requests __reset_osd remove_osd ceph_con_close reset_connection <clear con->in_msg->con> <put con ref> put_osd <free osd/con> <msg->con use> <-- !!! If ceph_msg_revoke_incoming() executes "before" the reset handler, osd/con will be leaked because ceph_msg_revoke_incoming() clears con->in_msg but doesn't put con ref, while reset_connection() only puts con ref if con->in_msg != NULL. The current msg->con scheme was introduced by commits 38941f8031bf ("libceph: have messages point to their connection") and 92ce034b5a74 ("libceph: have messages take a connection reference"), which defined when messages get associated with a connection and when that association goes away. Part of the problem is that this association is supposed to go away in much too many places; closing this race entirely requires either a rework of the existing or an addition of a new layer of synchronization. In lieu of that, we can make it *much* less likely to hit by disassociating messages only on their destruction and resend through a different connection. This makes the code simpler and is probably a good thing to do regardless - this patch adds a msg_con_set() helper which is is called from only three places: ceph_con_send() and ceph_con_in_msg_alloc() to set msg->con and ceph_msg_release() to clear it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-03 00:13:58 +08:00
static void msg_con_set(struct ceph_msg *msg, struct ceph_connection *con)
{
if (msg->con)
msg->con->ops->put(msg->con);
msg->con = con ? con->ops->get(con) : NULL;
BUG_ON(msg->con != con);
}
static void clear_standby(struct ceph_connection *con)
{
/* come back from STANDBY? */
if (con->state == CEPH_CON_S_STANDBY) {
dout("clear_standby %p and ++connect_seq\n", con);
con->state = CEPH_CON_S_PREOPEN;
con->v1.connect_seq++;
WARN_ON(ceph_con_flag_test(con, CEPH_CON_F_WRITE_PENDING));
WARN_ON(ceph_con_flag_test(con, CEPH_CON_F_KEEPALIVE_PENDING));
}
}
/*
* Queue up an outgoing message on the given connection.
*
* Consumes a ref on @msg.
*/
void ceph_con_send(struct ceph_connection *con, struct ceph_msg *msg)
{
/* set src+dst */
msg->hdr.src = con->msgr->inst.name;
BUG_ON(msg->front.iov_len != le32_to_cpu(msg->hdr.front_len));
msg->needs_out_seq = true;
mutex_lock(&con->mutex);
libceph: have messages take a connection reference There are essentially two types of ceph messages: incoming and outgoing. Outgoing messages are always allocated via ceph_msg_new(), and at the time of their allocation they are not associated with any particular connection. Incoming messages are always allocated via ceph_con_in_msg_alloc(), and they are initially associated with the connection from which incoming data will be placed into the message. When an outgoing message gets sent, it becomes associated with a connection and remains that way until the message is successfully sent. The association of an incoming message goes away at the point it is sent to an upper layer via a con->ops->dispatch method. This patch implements reference counting for all ceph messages, such that every message holds a reference (and a pointer) to a connection if and only if it is associated with that connection (as described above). For background, here is an explanation of the ceph message lifecycle, emphasizing when an association exists between a message and a connection. Outgoing Messages An outgoing message is "owned" by its allocator, from the time it is allocated in ceph_msg_new() up to the point it gets queued for sending in ceph_con_send(). Prior to that point the message's msg->con pointer is null; at the point it is queued for sending its message pointer is assigned to refer to the connection. At that time the message is inserted into a connection's out_queue list. When a message on the out_queue list has been sent to the socket layer to be put on the wire, it is transferred out of that list and into the connection's out_sent list. At that point it is still owned by the connection, and will remain so until an acknowledgement is received from the recipient that indicates the message was successfully transferred. When such an acknowledgement is received (in process_ack()), the message is removed from its list (in ceph_msg_remove()), at which point it is no longer associated with the connection. So basically, any time a message is on one of a connection's lists, it is associated with that connection. Reference counting outgoing messages can thus be done at the points a message is added to the out_queue (in ceph_con_send()) and the point it is removed from either its two lists (in ceph_msg_remove())--at which point its connection pointer becomes null. Incoming Messages When an incoming message on a connection is getting read (in read_partial_message()) and there is no message in con->in_msg, a new one is allocated using ceph_con_in_msg_alloc(). At that point the message is associated with the connection. Once that message has been completely and successfully read, it is passed to upper layer code using the connection's con->ops->dispatch method. At that point the association between the message and the connection no longer exists. Reference counting of connections for incoming messages can be done by taking a reference to the connection when the message gets allocated, and releasing that reference when it gets handed off using the dispatch method. We should never fail to get a connection reference for a message--the since the caller should already hold one. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-05 03:43:33 +08:00
if (con->state == CEPH_CON_S_CLOSED) {
dout("con_send %p closed, dropping %p\n", con, msg);
ceph_msg_put(msg);
mutex_unlock(&con->mutex);
return;
}
libceph: clear msg->con in ceph_msg_release() only The following bit in ceph_msg_revoke_incoming() is unsafe: struct ceph_connection *con = msg->con; if (!con) return; mutex_lock(&con->mutex); <more msg->con use> There is nothing preventing con from getting destroyed right after msg->con test. One easy way to reproduce this is to disable message signing only on the server side and try to map an image. The system will go into a libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature loop which has to be interrupted with Ctrl-C. Hit Ctrl-C and you are likely to end up with a random GP fault if the reset handler executes "within" ceph_msg_revoke_incoming(): <yet another reply w/o a signature> ... <Ctrl-C> rbd_obj_request_end ceph_osdc_cancel_request __unregister_request ceph_osdc_put_request ceph_msg_revoke_incoming ... osd_reset __kick_osd_requests __reset_osd remove_osd ceph_con_close reset_connection <clear con->in_msg->con> <put con ref> put_osd <free osd/con> <msg->con use> <-- !!! If ceph_msg_revoke_incoming() executes "before" the reset handler, osd/con will be leaked because ceph_msg_revoke_incoming() clears con->in_msg but doesn't put con ref, while reset_connection() only puts con ref if con->in_msg != NULL. The current msg->con scheme was introduced by commits 38941f8031bf ("libceph: have messages point to their connection") and 92ce034b5a74 ("libceph: have messages take a connection reference"), which defined when messages get associated with a connection and when that association goes away. Part of the problem is that this association is supposed to go away in much too many places; closing this race entirely requires either a rework of the existing or an addition of a new layer of synchronization. In lieu of that, we can make it *much* less likely to hit by disassociating messages only on their destruction and resend through a different connection. This makes the code simpler and is probably a good thing to do regardless - this patch adds a msg_con_set() helper which is is called from only three places: ceph_con_send() and ceph_con_in_msg_alloc() to set msg->con and ceph_msg_release() to clear it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-03 00:13:58 +08:00
msg_con_set(msg, con);
libceph: have messages take a connection reference There are essentially two types of ceph messages: incoming and outgoing. Outgoing messages are always allocated via ceph_msg_new(), and at the time of their allocation they are not associated with any particular connection. Incoming messages are always allocated via ceph_con_in_msg_alloc(), and they are initially associated with the connection from which incoming data will be placed into the message. When an outgoing message gets sent, it becomes associated with a connection and remains that way until the message is successfully sent. The association of an incoming message goes away at the point it is sent to an upper layer via a con->ops->dispatch method. This patch implements reference counting for all ceph messages, such that every message holds a reference (and a pointer) to a connection if and only if it is associated with that connection (as described above). For background, here is an explanation of the ceph message lifecycle, emphasizing when an association exists between a message and a connection. Outgoing Messages An outgoing message is "owned" by its allocator, from the time it is allocated in ceph_msg_new() up to the point it gets queued for sending in ceph_con_send(). Prior to that point the message's msg->con pointer is null; at the point it is queued for sending its message pointer is assigned to refer to the connection. At that time the message is inserted into a connection's out_queue list. When a message on the out_queue list has been sent to the socket layer to be put on the wire, it is transferred out of that list and into the connection's out_sent list. At that point it is still owned by the connection, and will remain so until an acknowledgement is received from the recipient that indicates the message was successfully transferred. When such an acknowledgement is received (in process_ack()), the message is removed from its list (in ceph_msg_remove()), at which point it is no longer associated with the connection. So basically, any time a message is on one of a connection's lists, it is associated with that connection. Reference counting outgoing messages can thus be done at the points a message is added to the out_queue (in ceph_con_send()) and the point it is removed from either its two lists (in ceph_msg_remove())--at which point its connection pointer becomes null. Incoming Messages When an incoming message on a connection is getting read (in read_partial_message()) and there is no message in con->in_msg, a new one is allocated using ceph_con_in_msg_alloc(). At that point the message is associated with the connection. Once that message has been completely and successfully read, it is passed to upper layer code using the connection's con->ops->dispatch method. At that point the association between the message and the connection no longer exists. Reference counting of connections for incoming messages can be done by taking a reference to the connection when the message gets allocated, and releasing that reference when it gets handed off using the dispatch method. We should never fail to get a connection reference for a message--the since the caller should already hold one. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2012-06-05 03:43:33 +08:00
BUG_ON(!list_empty(&msg->list_head));
list_add_tail(&msg->list_head, &con->out_queue);
dout("----- %p to %s%lld %d=%s len %d+%d+%d -----\n", msg,
ENTITY_NAME(con->peer_name), le16_to_cpu(msg->hdr.type),
ceph_msg_type_name(le16_to_cpu(msg->hdr.type)),
le32_to_cpu(msg->hdr.front_len),
le32_to_cpu(msg->hdr.middle_len),
le32_to_cpu(msg->hdr.data_len));
clear_standby(con);
mutex_unlock(&con->mutex);
/* if there wasn't anything waiting to send before, queue
* new work */
if (!ceph_con_flag_test_and_set(con, CEPH_CON_F_WRITE_PENDING))
queue_con(con);
}
EXPORT_SYMBOL(ceph_con_send);
/*
* Revoke a message that was previously queued for send
*/
void ceph_msg_revoke(struct ceph_msg *msg)
{
struct ceph_connection *con = msg->con;
libceph: clear msg->con in ceph_msg_release() only The following bit in ceph_msg_revoke_incoming() is unsafe: struct ceph_connection *con = msg->con; if (!con) return; mutex_lock(&con->mutex); <more msg->con use> There is nothing preventing con from getting destroyed right after msg->con test. One easy way to reproduce this is to disable message signing only on the server side and try to map an image. The system will go into a libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature loop which has to be interrupted with Ctrl-C. Hit Ctrl-C and you are likely to end up with a random GP fault if the reset handler executes "within" ceph_msg_revoke_incoming(): <yet another reply w/o a signature> ... <Ctrl-C> rbd_obj_request_end ceph_osdc_cancel_request __unregister_request ceph_osdc_put_request ceph_msg_revoke_incoming ... osd_reset __kick_osd_requests __reset_osd remove_osd ceph_con_close reset_connection <clear con->in_msg->con> <put con ref> put_osd <free osd/con> <msg->con use> <-- !!! If ceph_msg_revoke_incoming() executes "before" the reset handler, osd/con will be leaked because ceph_msg_revoke_incoming() clears con->in_msg but doesn't put con ref, while reset_connection() only puts con ref if con->in_msg != NULL. The current msg->con scheme was introduced by commits 38941f8031bf ("libceph: have messages point to their connection") and 92ce034b5a74 ("libceph: have messages take a connection reference"), which defined when messages get associated with a connection and when that association goes away. Part of the problem is that this association is supposed to go away in much too many places; closing this race entirely requires either a rework of the existing or an addition of a new layer of synchronization. In lieu of that, we can make it *much* less likely to hit by disassociating messages only on their destruction and resend through a different connection. This makes the code simpler and is probably a good thing to do regardless - this patch adds a msg_con_set() helper which is is called from only three places: ceph_con_send() and ceph_con_in_msg_alloc() to set msg->con and ceph_msg_release() to clear it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-03 00:13:58 +08:00
if (!con) {
dout("%s msg %p null con\n", __func__, msg);
return; /* Message not in our possession */
libceph: clear msg->con in ceph_msg_release() only The following bit in ceph_msg_revoke_incoming() is unsafe: struct ceph_connection *con = msg->con; if (!con) return; mutex_lock(&con->mutex); <more msg->con use> There is nothing preventing con from getting destroyed right after msg->con test. One easy way to reproduce this is to disable message signing only on the server side and try to map an image. The system will go into a libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature loop which has to be interrupted with Ctrl-C. Hit Ctrl-C and you are likely to end up with a random GP fault if the reset handler executes "within" ceph_msg_revoke_incoming(): <yet another reply w/o a signature> ... <Ctrl-C> rbd_obj_request_end ceph_osdc_cancel_request __unregister_request ceph_osdc_put_request ceph_msg_revoke_incoming ... osd_reset __kick_osd_requests __reset_osd remove_osd ceph_con_close reset_connection <clear con->in_msg->con> <put con ref> put_osd <free osd/con> <msg->con use> <-- !!! If ceph_msg_revoke_incoming() executes "before" the reset handler, osd/con will be leaked because ceph_msg_revoke_incoming() clears con->in_msg but doesn't put con ref, while reset_connection() only puts con ref if con->in_msg != NULL. The current msg->con scheme was introduced by commits 38941f8031bf ("libceph: have messages point to their connection") and 92ce034b5a74 ("libceph: have messages take a connection reference"), which defined when messages get associated with a connection and when that association goes away. Part of the problem is that this association is supposed to go away in much too many places; closing this race entirely requires either a rework of the existing or an addition of a new layer of synchronization. In lieu of that, we can make it *much* less likely to hit by disassociating messages only on their destruction and resend through a different connection. This makes the code simpler and is probably a good thing to do regardless - this patch adds a msg_con_set() helper which is is called from only three places: ceph_con_send() and ceph_con_in_msg_alloc() to set msg->con and ceph_msg_release() to clear it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-03 00:13:58 +08:00
}
mutex_lock(&con->mutex);
if (list_empty(&msg->list_head)) {
WARN_ON(con->out_msg == msg);
dout("%s con %p msg %p not linked\n", __func__, con, msg);
mutex_unlock(&con->mutex);
return;
}
libceph: fix ceph_msg_revoke() There are a number of problems with revoking a "was sending" message: (1) We never make any attempt to revoke data - only kvecs contibute to con->out_skip. However, once the header (envelope) is written to the socket, our peer learns data_len and sets itself to expect at least data_len bytes to follow front or front+middle. If ceph_msg_revoke() is called while the messenger is sending message's data portion, anything we send after that call is counted by the OSD towards the now revoked message's data portion. The effects vary, the most common one is the eventual hang - higher layers get stuck waiting for the reply to the message that was sent out after ceph_msg_revoke() returned and treated by the OSD as a bunch of data bytes. This is what Matt ran into. (2) Flat out zeroing con->out_kvec_bytes worth of bytes to handle kvecs is wrong. If ceph_msg_revoke() is called before the tag is sent out or while the messenger is sending the header, we will get a connection reset, either due to a bad tag (0 is not a valid tag) or a bad header CRC, which kind of defeats the purpose of revoke. Currently the kernel client refuses to work with header CRCs disabled, but that will likely change in the future, making this even worse. (3) con->out_skip is not reset on connection reset, leading to one or more spurious connection resets if we happen to get a real one between con->out_skip is set in ceph_msg_revoke() and before it's cleared in write_partial_skip(). Fixing (1) and (3) is trivial. The idea behind fixing (2) is to never zero the tag or the header, i.e. send out tag+header regardless of when ceph_msg_revoke() is called. That way the header is always correct, no unnecessary resets are induced and revoke stands ready for disabled CRCs. Since ceph_msg_revoke() rips out con->out_msg, introduce a new "message out temp" and copy the header into it before sending. Cc: stable@vger.kernel.org # 4.0+ Reported-by: Matt Conner <matt.conner@keepertech.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Tested-by: Matt Conner <matt.conner@keepertech.com> Reviewed-by: Sage Weil <sage@redhat.com>
2015-12-28 18:18:34 +08:00
dout("%s con %p msg %p was linked\n", __func__, con, msg);
msg->hdr.seq = 0;
ceph_msg_remove(msg);
if (con->out_msg == msg) {
WARN_ON(con->state != CEPH_CON_S_OPEN);
dout("%s con %p msg %p was sending\n", __func__, con, msg);
if (ceph_msgr2(from_msgr(con->msgr)))
ceph_con_v2_revoke(con);
else
ceph_con_v1_revoke(con);
ceph_msg_put(con->out_msg);
libceph: fix ceph_msg_revoke() There are a number of problems with revoking a "was sending" message: (1) We never make any attempt to revoke data - only kvecs contibute to con->out_skip. However, once the header (envelope) is written to the socket, our peer learns data_len and sets itself to expect at least data_len bytes to follow front or front+middle. If ceph_msg_revoke() is called while the messenger is sending message's data portion, anything we send after that call is counted by the OSD towards the now revoked message's data portion. The effects vary, the most common one is the eventual hang - higher layers get stuck waiting for the reply to the message that was sent out after ceph_msg_revoke() returned and treated by the OSD as a bunch of data bytes. This is what Matt ran into. (2) Flat out zeroing con->out_kvec_bytes worth of bytes to handle kvecs is wrong. If ceph_msg_revoke() is called before the tag is sent out or while the messenger is sending the header, we will get a connection reset, either due to a bad tag (0 is not a valid tag) or a bad header CRC, which kind of defeats the purpose of revoke. Currently the kernel client refuses to work with header CRCs disabled, but that will likely change in the future, making this even worse. (3) con->out_skip is not reset on connection reset, leading to one or more spurious connection resets if we happen to get a real one between con->out_skip is set in ceph_msg_revoke() and before it's cleared in write_partial_skip(). Fixing (1) and (3) is trivial. The idea behind fixing (2) is to never zero the tag or the header, i.e. send out tag+header regardless of when ceph_msg_revoke() is called. That way the header is always correct, no unnecessary resets are induced and revoke stands ready for disabled CRCs. Since ceph_msg_revoke() rips out con->out_msg, introduce a new "message out temp" and copy the header into it before sending. Cc: stable@vger.kernel.org # 4.0+ Reported-by: Matt Conner <matt.conner@keepertech.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Tested-by: Matt Conner <matt.conner@keepertech.com> Reviewed-by: Sage Weil <sage@redhat.com>
2015-12-28 18:18:34 +08:00
con->out_msg = NULL;
} else {
dout("%s con %p msg %p not current, out_msg %p\n", __func__,
con, msg, con->out_msg);
}
mutex_unlock(&con->mutex);
}
/*
* Revoke a message that we may be reading data into
*/
void ceph_msg_revoke_incoming(struct ceph_msg *msg)
{
libceph: clear msg->con in ceph_msg_release() only The following bit in ceph_msg_revoke_incoming() is unsafe: struct ceph_connection *con = msg->con; if (!con) return; mutex_lock(&con->mutex); <more msg->con use> There is nothing preventing con from getting destroyed right after msg->con test. One easy way to reproduce this is to disable message signing only on the server side and try to map an image. The system will go into a libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature loop which has to be interrupted with Ctrl-C. Hit Ctrl-C and you are likely to end up with a random GP fault if the reset handler executes "within" ceph_msg_revoke_incoming(): <yet another reply w/o a signature> ... <Ctrl-C> rbd_obj_request_end ceph_osdc_cancel_request __unregister_request ceph_osdc_put_request ceph_msg_revoke_incoming ... osd_reset __kick_osd_requests __reset_osd remove_osd ceph_con_close reset_connection <clear con->in_msg->con> <put con ref> put_osd <free osd/con> <msg->con use> <-- !!! If ceph_msg_revoke_incoming() executes "before" the reset handler, osd/con will be leaked because ceph_msg_revoke_incoming() clears con->in_msg but doesn't put con ref, while reset_connection() only puts con ref if con->in_msg != NULL. The current msg->con scheme was introduced by commits 38941f8031bf ("libceph: have messages point to their connection") and 92ce034b5a74 ("libceph: have messages take a connection reference"), which defined when messages get associated with a connection and when that association goes away. Part of the problem is that this association is supposed to go away in much too many places; closing this race entirely requires either a rework of the existing or an addition of a new layer of synchronization. In lieu of that, we can make it *much* less likely to hit by disassociating messages only on their destruction and resend through a different connection. This makes the code simpler and is probably a good thing to do regardless - this patch adds a msg_con_set() helper which is is called from only three places: ceph_con_send() and ceph_con_in_msg_alloc() to set msg->con and ceph_msg_release() to clear it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-03 00:13:58 +08:00
struct ceph_connection *con = msg->con;
libceph: clear msg->con in ceph_msg_release() only The following bit in ceph_msg_revoke_incoming() is unsafe: struct ceph_connection *con = msg->con; if (!con) return; mutex_lock(&con->mutex); <more msg->con use> There is nothing preventing con from getting destroyed right after msg->con test. One easy way to reproduce this is to disable message signing only on the server side and try to map an image. The system will go into a libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature loop which has to be interrupted with Ctrl-C. Hit Ctrl-C and you are likely to end up with a random GP fault if the reset handler executes "within" ceph_msg_revoke_incoming(): <yet another reply w/o a signature> ... <Ctrl-C> rbd_obj_request_end ceph_osdc_cancel_request __unregister_request ceph_osdc_put_request ceph_msg_revoke_incoming ... osd_reset __kick_osd_requests __reset_osd remove_osd ceph_con_close reset_connection <clear con->in_msg->con> <put con ref> put_osd <free osd/con> <msg->con use> <-- !!! If ceph_msg_revoke_incoming() executes "before" the reset handler, osd/con will be leaked because ceph_msg_revoke_incoming() clears con->in_msg but doesn't put con ref, while reset_connection() only puts con ref if con->in_msg != NULL. The current msg->con scheme was introduced by commits 38941f8031bf ("libceph: have messages point to their connection") and 92ce034b5a74 ("libceph: have messages take a connection reference"), which defined when messages get associated with a connection and when that association goes away. Part of the problem is that this association is supposed to go away in much too many places; closing this race entirely requires either a rework of the existing or an addition of a new layer of synchronization. In lieu of that, we can make it *much* less likely to hit by disassociating messages only on their destruction and resend through a different connection. This makes the code simpler and is probably a good thing to do regardless - this patch adds a msg_con_set() helper which is is called from only three places: ceph_con_send() and ceph_con_in_msg_alloc() to set msg->con and ceph_msg_release() to clear it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-03 00:13:58 +08:00
if (!con) {
dout("%s msg %p null con\n", __func__, msg);
return; /* Message not in our possession */
}
mutex_lock(&con->mutex);
if (con->in_msg == msg) {
WARN_ON(con->state != CEPH_CON_S_OPEN);
dout("%s con %p msg %p was recving\n", __func__, con, msg);
if (ceph_msgr2(from_msgr(con->msgr)))
ceph_con_v2_revoke_incoming(con);
else
ceph_con_v1_revoke_incoming(con);
ceph_msg_put(con->in_msg);
con->in_msg = NULL;
} else {
dout("%s con %p msg %p not current, in_msg %p\n", __func__,
con, msg, con->in_msg);
}
mutex_unlock(&con->mutex);
}
/*
* Queue a keepalive byte to ensure the tcp connection is alive.
*/
void ceph_con_keepalive(struct ceph_connection *con)
{
dout("con_keepalive %p\n", con);
mutex_lock(&con->mutex);
clear_standby(con);
ceph_con_flag_set(con, CEPH_CON_F_KEEPALIVE_PENDING);
mutex_unlock(&con->mutex);
if (!ceph_con_flag_test_and_set(con, CEPH_CON_F_WRITE_PENDING))
queue_con(con);
}
EXPORT_SYMBOL(ceph_con_keepalive);
bool ceph_con_keepalive_expired(struct ceph_connection *con,
unsigned long interval)
{
if (interval > 0 &&
(con->peer_features & CEPH_FEATURE_MSGR_KEEPALIVE2)) {
struct timespec64 now;
struct timespec64 ts;
ktime_get_real_ts64(&now);
jiffies_to_timespec64(interval, &ts);
ts = timespec64_add(con->last_keepalive_ack, ts);
return timespec64_compare(&now, &ts) >= 0;
}
return false;
}
static struct ceph_msg_data *ceph_msg_data_add(struct ceph_msg *msg)
{
BUG_ON(msg->num_data_items >= msg->max_data_items);
return &msg->data[msg->num_data_items++];
}
static void ceph_msg_data_destroy(struct ceph_msg_data *data)
{
if (data->type == CEPH_MSG_DATA_PAGES && data->own_pages) {
int num_pages = calc_pages_for(data->alignment, data->length);
ceph_release_page_vector(data->pages, num_pages);
} else if (data->type == CEPH_MSG_DATA_PAGELIST) {
ceph_pagelist_release(data->pagelist);
}
}
void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages,
size_t length, size_t alignment, bool own_pages)
{
struct ceph_msg_data *data;
BUG_ON(!pages);
BUG_ON(!length);
data = ceph_msg_data_add(msg);
data->type = CEPH_MSG_DATA_PAGES;
data->pages = pages;
data->length = length;
data->alignment = alignment & ~PAGE_MASK;
data->own_pages = own_pages;
msg->data_length += length;
}
EXPORT_SYMBOL(ceph_msg_data_add_pages);
void ceph_msg_data_add_pagelist(struct ceph_msg *msg,
struct ceph_pagelist *pagelist)
{
struct ceph_msg_data *data;
BUG_ON(!pagelist);
BUG_ON(!pagelist->length);
data = ceph_msg_data_add(msg);
data->type = CEPH_MSG_DATA_PAGELIST;
refcount_inc(&pagelist->refcnt);
data->pagelist = pagelist;
msg->data_length += pagelist->length;
}
EXPORT_SYMBOL(ceph_msg_data_add_pagelist);
#ifdef CONFIG_BLOCK
void ceph_msg_data_add_bio(struct ceph_msg *msg, struct ceph_bio_iter *bio_pos,
u32 length)
{
struct ceph_msg_data *data;
data = ceph_msg_data_add(msg);
data->type = CEPH_MSG_DATA_BIO;
data->bio_pos = *bio_pos;
data->bio_length = length;
msg->data_length += length;
}
EXPORT_SYMBOL(ceph_msg_data_add_bio);
#endif /* CONFIG_BLOCK */
void ceph_msg_data_add_bvecs(struct ceph_msg *msg,
struct ceph_bvec_iter *bvec_pos)
{
struct ceph_msg_data *data;
data = ceph_msg_data_add(msg);
data->type = CEPH_MSG_DATA_BVECS;
data->bvec_pos = *bvec_pos;
msg->data_length += bvec_pos->iter.bi_size;
}
EXPORT_SYMBOL(ceph_msg_data_add_bvecs);
void ceph_msg_data_add_iter(struct ceph_msg *msg,
struct iov_iter *iter)
{
struct ceph_msg_data *data;
data = ceph_msg_data_add(msg);
data->type = CEPH_MSG_DATA_ITER;
data->iter = *iter;
msg->data_length += iov_iter_count(&data->iter);
}
/*
* construct a new message with given type, size
* the new msg has a ref count of 1.
*/
struct ceph_msg *ceph_msg_new2(int type, int front_len, int max_data_items,
gfp_t flags, bool can_fail)
{
struct ceph_msg *m;
m = kmem_cache_zalloc(ceph_msg_cache, flags);
if (m == NULL)
goto out;
m->hdr.type = cpu_to_le16(type);
m->hdr.priority = cpu_to_le16(CEPH_MSG_PRIO_DEFAULT);
m->hdr.front_len = cpu_to_le32(front_len);
INIT_LIST_HEAD(&m->list_head);
kref_init(&m->kref);
/* front */
if (front_len) {
m->front.iov_base = kvmalloc(front_len, flags);
if (m->front.iov_base == NULL) {
dout("ceph_msg_new can't allocate %d bytes\n",
front_len);
goto out2;
}
} else {
m->front.iov_base = NULL;
}
m->front_alloc_len = m->front.iov_len = front_len;
if (max_data_items) {
m->data = kmalloc_array(max_data_items, sizeof(*m->data),
flags);
if (!m->data)
goto out2;
m->max_data_items = max_data_items;
}
dout("ceph_msg_new %p front %d\n", m, front_len);
return m;
out2:
ceph_msg_put(m);
out:
if (!can_fail) {
pr_err("msg_new can't create type %d front %d\n", type,
front_len);
WARN_ON(1);
} else {
dout("msg_new can't create type %d front %d\n", type,
front_len);
}
return NULL;
}
EXPORT_SYMBOL(ceph_msg_new2);
struct ceph_msg *ceph_msg_new(int type, int front_len, gfp_t flags,
bool can_fail)
{
return ceph_msg_new2(type, front_len, 0, flags, can_fail);
}
EXPORT_SYMBOL(ceph_msg_new);
/*
* Allocate "middle" portion of a message, if it is needed and wasn't
* allocated by alloc_msg. This allows us to read a small fixed-size
* per-type header in the front and then gracefully fail (i.e.,
* propagate the error to the caller based on info in the front) when
* the middle is too large.
*/
static int ceph_alloc_middle(struct ceph_connection *con, struct ceph_msg *msg)
{
int type = le16_to_cpu(msg->hdr.type);
int middle_len = le32_to_cpu(msg->hdr.middle_len);
dout("alloc_middle %p type %d %s middle_len %d\n", msg, type,
ceph_msg_type_name(type), middle_len);
BUG_ON(!middle_len);
BUG_ON(msg->middle);
msg->middle = ceph_buffer_new(middle_len, GFP_NOFS);
if (!msg->middle)
return -ENOMEM;
return 0;
}
/*
* Allocate a message for receiving an incoming message on a
* connection, and save the result in con->in_msg. Uses the
* connection's private alloc_msg op if available.
*
* Returns 0 on success, or a negative error code.
*
* On success, if we set *skip = 1:
* - the next message should be skipped and ignored.
* - con->in_msg == NULL
* or if we set *skip = 0:
* - con->in_msg is non-null.
* On error (ENOMEM, EAGAIN, ...),
* - con->in_msg == NULL
*/
int ceph_con_in_msg_alloc(struct ceph_connection *con,
struct ceph_msg_header *hdr, int *skip)
{
int middle_len = le32_to_cpu(hdr->middle_len);
struct ceph_msg *msg;
int ret = 0;
BUG_ON(con->in_msg != NULL);
BUG_ON(!con->ops->alloc_msg);
mutex_unlock(&con->mutex);
msg = con->ops->alloc_msg(con, hdr, skip);
mutex_lock(&con->mutex);
if (con->state != CEPH_CON_S_OPEN) {
if (msg)
ceph_msg_put(msg);
return -EAGAIN;
}
if (msg) {
BUG_ON(*skip);
libceph: clear msg->con in ceph_msg_release() only The following bit in ceph_msg_revoke_incoming() is unsafe: struct ceph_connection *con = msg->con; if (!con) return; mutex_lock(&con->mutex); <more msg->con use> There is nothing preventing con from getting destroyed right after msg->con test. One easy way to reproduce this is to disable message signing only on the server side and try to map an image. The system will go into a libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature loop which has to be interrupted with Ctrl-C. Hit Ctrl-C and you are likely to end up with a random GP fault if the reset handler executes "within" ceph_msg_revoke_incoming(): <yet another reply w/o a signature> ... <Ctrl-C> rbd_obj_request_end ceph_osdc_cancel_request __unregister_request ceph_osdc_put_request ceph_msg_revoke_incoming ... osd_reset __kick_osd_requests __reset_osd remove_osd ceph_con_close reset_connection <clear con->in_msg->con> <put con ref> put_osd <free osd/con> <msg->con use> <-- !!! If ceph_msg_revoke_incoming() executes "before" the reset handler, osd/con will be leaked because ceph_msg_revoke_incoming() clears con->in_msg but doesn't put con ref, while reset_connection() only puts con ref if con->in_msg != NULL. The current msg->con scheme was introduced by commits 38941f8031bf ("libceph: have messages point to their connection") and 92ce034b5a74 ("libceph: have messages take a connection reference"), which defined when messages get associated with a connection and when that association goes away. Part of the problem is that this association is supposed to go away in much too many places; closing this race entirely requires either a rework of the existing or an addition of a new layer of synchronization. In lieu of that, we can make it *much* less likely to hit by disassociating messages only on their destruction and resend through a different connection. This makes the code simpler and is probably a good thing to do regardless - this patch adds a msg_con_set() helper which is is called from only three places: ceph_con_send() and ceph_con_in_msg_alloc() to set msg->con and ceph_msg_release() to clear it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-03 00:13:58 +08:00
msg_con_set(msg, con);
con->in_msg = msg;
} else {
/*
* Null message pointer means either we should skip
* this message or we couldn't allocate memory. The
* former is not an error.
*/
if (*skip)
return 0;
con->error_msg = "error allocating memory for incoming message";
return -ENOMEM;
}
memcpy(&con->in_msg->hdr, hdr, sizeof(*hdr));
if (middle_len && !con->in_msg->middle) {
ret = ceph_alloc_middle(con, con->in_msg);
if (ret < 0) {
ceph_msg_put(con->in_msg);
con->in_msg = NULL;
}
}
return ret;
}
void ceph_con_get_out_msg(struct ceph_connection *con)
{
struct ceph_msg *msg;
BUG_ON(list_empty(&con->out_queue));
msg = list_first_entry(&con->out_queue, struct ceph_msg, list_head);
WARN_ON(msg->con != con);
/*
* Put the message on "sent" list using a ref from ceph_con_send().
* It is put when the message is acked or revoked.
*/
list_move_tail(&msg->list_head, &con->out_sent);
/*
* Only assign outgoing seq # if we haven't sent this message
* yet. If it is requeued, resend with it's original seq.
*/
if (msg->needs_out_seq) {
msg->hdr.seq = cpu_to_le64(++con->out_seq);
msg->needs_out_seq = false;
if (con->ops->reencode_message)
con->ops->reencode_message(msg);
}
/*
* Get a ref for out_msg. It is put when we are done sending the
* message or in case of a fault.
*/
WARN_ON(con->out_msg);
con->out_msg = ceph_msg_get(msg);
}
/*
* Free a generically kmalloc'd message.
*/
static void ceph_msg_free(struct ceph_msg *m)
{
dout("%s %p\n", __func__, m);
kvfree(m->front.iov_base);
kfree(m->data);
kmem_cache_free(ceph_msg_cache, m);
}
static void ceph_msg_release(struct kref *kref)
{
struct ceph_msg *m = container_of(kref, struct ceph_msg, kref);
int i;
dout("%s %p\n", __func__, m);
WARN_ON(!list_empty(&m->list_head));
libceph: clear msg->con in ceph_msg_release() only The following bit in ceph_msg_revoke_incoming() is unsafe: struct ceph_connection *con = msg->con; if (!con) return; mutex_lock(&con->mutex); <more msg->con use> There is nothing preventing con from getting destroyed right after msg->con test. One easy way to reproduce this is to disable message signing only on the server side and try to map an image. The system will go into a libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature loop which has to be interrupted with Ctrl-C. Hit Ctrl-C and you are likely to end up with a random GP fault if the reset handler executes "within" ceph_msg_revoke_incoming(): <yet another reply w/o a signature> ... <Ctrl-C> rbd_obj_request_end ceph_osdc_cancel_request __unregister_request ceph_osdc_put_request ceph_msg_revoke_incoming ... osd_reset __kick_osd_requests __reset_osd remove_osd ceph_con_close reset_connection <clear con->in_msg->con> <put con ref> put_osd <free osd/con> <msg->con use> <-- !!! If ceph_msg_revoke_incoming() executes "before" the reset handler, osd/con will be leaked because ceph_msg_revoke_incoming() clears con->in_msg but doesn't put con ref, while reset_connection() only puts con ref if con->in_msg != NULL. The current msg->con scheme was introduced by commits 38941f8031bf ("libceph: have messages point to their connection") and 92ce034b5a74 ("libceph: have messages take a connection reference"), which defined when messages get associated with a connection and when that association goes away. Part of the problem is that this association is supposed to go away in much too many places; closing this race entirely requires either a rework of the existing or an addition of a new layer of synchronization. In lieu of that, we can make it *much* less likely to hit by disassociating messages only on their destruction and resend through a different connection. This makes the code simpler and is probably a good thing to do regardless - this patch adds a msg_con_set() helper which is is called from only three places: ceph_con_send() and ceph_con_in_msg_alloc() to set msg->con and ceph_msg_release() to clear it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-03 00:13:58 +08:00
msg_con_set(m, NULL);
/* drop middle, data, if any */
if (m->middle) {
ceph_buffer_put(m->middle);
m->middle = NULL;
}
for (i = 0; i < m->num_data_items; i++)
ceph_msg_data_destroy(&m->data[i]);
if (m->pool)
ceph_msgpool_put(m->pool, m);
else
ceph_msg_free(m);
}
struct ceph_msg *ceph_msg_get(struct ceph_msg *msg)
{
dout("%s %p (was %d)\n", __func__, msg,
kref_read(&msg->kref));
kref_get(&msg->kref);
return msg;
}
EXPORT_SYMBOL(ceph_msg_get);
void ceph_msg_put(struct ceph_msg *msg)
{
dout("%s %p (was %d)\n", __func__, msg,
kref_read(&msg->kref));
kref_put(&msg->kref, ceph_msg_release);
}
EXPORT_SYMBOL(ceph_msg_put);
void ceph_msg_dump(struct ceph_msg *msg)
{
pr_debug("msg_dump %p (front_alloc_len %d length %zd)\n", msg,
msg->front_alloc_len, msg->data_length);
print_hex_dump(KERN_DEBUG, "header: ",
DUMP_PREFIX_OFFSET, 16, 1,
&msg->hdr, sizeof(msg->hdr), true);
print_hex_dump(KERN_DEBUG, " front: ",
DUMP_PREFIX_OFFSET, 16, 1,
msg->front.iov_base, msg->front.iov_len, true);
if (msg->middle)
print_hex_dump(KERN_DEBUG, "middle: ",
DUMP_PREFIX_OFFSET, 16, 1,
msg->middle->vec.iov_base,
msg->middle->vec.iov_len, true);
print_hex_dump(KERN_DEBUG, "footer: ",
DUMP_PREFIX_OFFSET, 16, 1,
&msg->footer, sizeof(msg->footer), true);
}
EXPORT_SYMBOL(ceph_msg_dump);